Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs our proposed Semantic Anchoring mechanism. This design strategically decouples the quantization of Semantic and Acoustic details. The semantic anchoring is achieved via a lightweight projector that aligns acoustic features with a frozen, large-scale mHuBERT codebook, injecting linguistic priors while guaranteeing full codebook utilization. Sequentially, for acoustic details, a residual activation module with SimVQ enables a single-layer quantizer (acoustic path) to faithfully recover fine-grained information. At just 1.5 kbps, SACodec establishes a new state of the art by excelling in both fidelity and semantics: subjective listening tests confirm that its reconstruction quality is perceptually highly comparable to ground-truth audio, while its tokens demonstrate substantially improved semantic richness in downstream tasks.
Primary: Unknown
All Institutions: Unknown
The paper presents SACodec, a novel neural speech codec that addresses the trade-off between acoustic fidelity and semantic richness at low bitrates, achieving state-of-the-art performance in both areas. The technical contributions, particularly the asymmetric dual-quantizer design and the integration of semantic anchoring, represent a meaningful advancement in the field of neural speech coding, with potential applications in various audio processing tasks.
The methodology introduces a novel asymmetric dual-quantizer architecture that effectively separates semantic and acoustic information, addressing the limitations of traditional vector quantization methods. The use of a fixed mHuBERT codebook for semantic anchoring is innovative and allows for better utilization of semantic information without the risk of codebook collapse. The integration of SimVQ for acoustic detail recovery is a significant advancement, ensuring high fidelity at low bitrates. The overall design is well-structured and demonstrates a clear understanding of the challenges in low-bitrate speech coding.
The experiments are comprehensive, utilizing a robust dataset (LibriTTS) and comparing against multiple state-of-the-art codecs. The results show SACodec achieving superior performance in both reconstruction quality and semantic richness, validated through objective metrics and subjective listening tests. The ablation studies provide strong evidence for the effectiveness of the proposed components, enhancing the credibility of the findings.
The paper provides sufficient implementation details, including architecture specifics and training procedures, which should facilitate reproducibility. The code is available on GitHub, further aiding in the validation of results by other researchers.
The study is limited to English language data, which may restrict the applicability of the findings to other languages. Additionally, while the model shows promise, its integration into real-world applications and downstream systems remains to be fully validated.
The advancements presented in SACodec have significant implications for real-time speech applications, particularly in scenarios requiring low-bitrate transmission without sacrificing quality. This could enhance communication technologies, improve accessibility, and facilitate the deployment of speech models in resource-constrained environments. The paper presents SACodec, a novel neural speech codec that addresses the trade-off between acoustic fidelity and semantic richness at low bitrates, achieving state-of-the-art performance in both areas. The technical contributions, particularly the asymmetric dual-quantizer design and the integration of semantic anchoring, represent a meaningful advancement in the field of neural speech coding, with potential applications in various audio processing tasks.
Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here, we show that using a quantum kernel in audio deepfake detection reduces falsepositive rates without increasing model size. Quantum feature maps embed data into high-dimensional Hilbert spaces, enabling the use of expressive similarity measures and compact classifiers. Building on this motivation, we compare quantum-kernel SVMs (QSVMs) with classical SVMs using identical mel-spectrogram preprocessing and stratified 5-fold cross-validation across four corpora (ASVspoof 2019 LA, ASVspoof 5 (2024), ADD23, and an In-the-Wild set). QSVMs achieve consistently lower equalerror rates (EER): 0.183 vs. 0.299 on ASVspoof 5 (2024), 0.081 vs. 0.188 on ADD23, 0.346 vs. 0.399 on ASVspoof 2019, and 0.355 vs. 0.413 In-the-Wild. At the EER operating point (where FPR equals FNR), these correspond to absolute false-positiverate reductions of 0.116 (38.8%), 0.107 (56.9%), 0.053 (13.3%), and 0.058 (14.0%), respectively. We also report how consistent the results are across cross-validation folds and margin-based measures of class separation, using identical settings for both models. The only modification is the kernel; the features and SVM remain unchanged, no additional trainable parameters are introduced, and the quantum kernel is computed on a conventional computer.
Primary: University of Maryland, Baltimore County
All Institutions: University of Maryland, Baltimore County
The paper presents a novel approach to audio deepfake detection using quantum-kernel SVMs, demonstrating significant improvements in detection accuracy and reliability. The rigorous methodology and comprehensive experimental evaluation contribute valuable insights into the potential of quantum computing in enhancing machine learning models for security applications.
The methodology is robust, employing a controlled "kernel-swap" approach that isolates the effects of the quantum kernel from other variables such as model size and optimization techniques. The use of mel-spectrograms as features is appropriate for audio data, and the application of PCA for dimensionality reduction is well-justified. The experimental design, including stratified 5-fold cross-validation across multiple datasets, strengthens the validity of the findings. However, the reliance on classical simulation for quantum kernel computation may limit the practical applicability of the proposed method in real-time scenarios.
The experiments are well-structured, comparing the performance of QSVM against classical SVMs across four diverse datasets. The reported results demonstrate significant improvements in equal-error rates (EER) and false-positive rates (FPR) for QSVM, indicating its effectiveness in audio deepfake detection. The statistical analysis, including effect sizes, adds rigor to the evaluation. However, the paper could benefit from additional comparisons with modern deep learning architectures to contextualize the performance of QSVM further.
The paper provides detailed implementation details, including the use of specific quantum feature maps and the computational setup. However, the lack of a publicly accessible code repository or demo limits reproducibility. The authors mention using Qiskit Machine Learning, which is a positive aspect, but sharing the code would enhance transparency and facilitate further research.
One limitation is the focus on small datasets, which may not fully represent the challenges faced in larger, more diverse real-world applications. Additionally, the quantum kernel's computation on classical hardware may not reflect the potential advantages of quantum computing in practice. The paper also does not address adversarial robustness or real-time constraints, which are critical for deployment in security-sensitive environments.
The findings have significant implications for the fields of audio forensics and cybersecurity, where reliable detection of deepfakes is crucial. The proposed method could enhance existing detection systems, making them more robust against adversarial attacks and variability in recording conditions. As quantum computing technology advances, the integration of quantum kernels into machine learning pipelines may lead to further breakthroughs in various applications beyond audio deepfake detection. The paper presents a novel approach to audio deepfake detection using quantum-kernel SVMs, demonstrating significant improvements in detection accuracy and reliability. The rigorous methodology and comprehensive experimental evaluation contribute valuable insights into the potential of quantum computing in enhancing machine learning models for security applications.
Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, while training-based methods, although achieving higher generation quality, are severely constrained by the scarcity of high-quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts. To address these challenges, we propose MMEdit, an audio-language-model-driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition, replacement, removal, reordering, and attribute modification. Furthermore, we design a scalable data synthesis pipeline to construct large-scale paired datasets with fine-grained event-level annotations. To capture complex editing semantics, we integrate a Qwen2-Audio encoder with an MMDiT-based generator, enabling precise cross-modal alignment and localized editing. Experimental results demonstrate that our method achieves superior editing localization accuracy, robust instruction following, and high fidelity in non-edited regions.
Primary: Shanghai Jiao Tong University
All Institutions: MoE Key Lab of Artificial Intelligence, Shanghai AI Laboratory, Nanjing University
The main contribution of this paper is the development of MMEdit, a unified framework for multi-type audio editing that leverages an Audio Language Model to enhance instruction-following capabilities and editing fidelity. This work significantly advances the state-of-the-art in audio editing by addressing existing limitations and expanding the range of editable operations, thus holding substantial promise for practical applications in audio processing.
The proposed MMEdit framework is innovative in its integration of an Audio Language Model (ALM) with a multimodal generator, addressing the limitations of existing audio editing methods by allowing for a broader range of editing operations and improved instruction-following capabilities. The systematic extension of editing tasks and the development of a scalable data synthesis pipeline for generating high-quality paired datasets are significant contributions that enhance the framework's applicability and robustness.
The experiments are well-designed, utilizing both objective metrics and subjective evaluations to assess the performance of MMEdit against existing baselines. The use of a large-scale synthetic dataset and the inclusion of diverse editing tasks demonstrate the framework's versatility. Results indicate that MMEdit consistently outperforms competitors in terms of fidelity and instruction adherence, showcasing its effectiveness in practical scenarios.
The authors have committed to open-sourcing their evaluation benchmark and data generation pipeline, which is a positive step towards reproducibility. However, the paper could benefit from more detailed implementation specifics and code availability to facilitate independent validation of results.
While the framework shows promise, it has limitations regarding generation diversity and fine-grained temporal alignment, particularly for complex overlapping audio events. These challenges could hinder its effectiveness in more intricate real-world applications.
The advancements in text-guided audio editing presented in this paper have the potential to significantly impact various fields, including music production, sound design, and audio post-production, by providing more intuitive and efficient tools for audio manipulation. The ability to edit audio based on natural language instructions could democratize audio editing, making it accessible to users without technical expertise. The main contribution of this paper is the development of MMEdit, a unified framework for multi-type audio editing that leverages an Audio Language Model to enhance instruction-following capabilities and editing fidelity. This work significantly advances the state-of-the-art in audio editing by addressing existing limitations and expanding the range of editable operations, thus holding substantial promise for practical applications in audio processing.
Rare words remain a critical bottleneck for speech-to-text systems. While direct fine-tuning improves recognition of target words, it often incurs high cost, catastrophic forgetting, and limited scalability. To address these challenges, we propose a training-free paradigm based on task vectors for rare word recognition and translation. By defining task vectors as parameter differences and introducing word-level task vector arithmetic, our approach enables flexible composition of rare-word capabilities, greatly enhancing scalability and reusability. Extensive experiments across multiple domains show that the proposed method matches or surpasses fine-tuned models on target words, improves general performance by about 5 BLEU, and mitigates catastrophic forgetting.
Primary: Institute of Artificial Intelligence (TeleAI)
All Institutions: Institute of Artificial Intelligence (TeleAI)
The main contribution of this paper is the introduction of a training-free paradigm based on task vectors for rare word recognition and translation in speech models, which enhances scalability and mitigates catastrophic forgetting. This work is significant as it proposes a novel approach that could transform how speech models handle rare vocabulary, potentially leading to broader applications in various fields.
The proposed methodology introduces a novel approach to rare word recognition and translation in speech models using task vectors, which represent parameter shifts rather than requiring fine-tuning. This method is innovative as it allows for the flexible composition of capabilities across multiple rare words without incurring the costs associated with traditional fine-tuning. The introduction of word-level task vector arithmetic is a significant advancement that enhances scalability and reusability. However, while the theoretical foundation is solid, the paper could benefit from a more detailed exploration of the implications of task vector arithmetic on model performance across diverse datasets.
The experiments are extensive and cover multiple domains, demonstrating the effectiveness of the proposed method against fine-tuned models. The use of BLEU scores and Character Error Rate (CER) as evaluation metrics is appropriate for the tasks at hand. The results indicate that the task-vector-based models not only match but sometimes exceed the performance of fine-tuned models, which is a strong endorsement of the method. However, the paper lacks a thorough analysis of the specific datasets used and their characteristics, which would provide better context for the results.
The paper provides a general overview of the experimental setup, including training configurations and evaluation metrics. However, it lacks detailed implementation instructions or code availability, which are crucial for reproducibility. The absence of a demo or project URL further limits the ability of other researchers to replicate the findings.
One notable limitation is the potential for parameter conflicts when combining multiple task vectors, which the authors acknowledge but do not fully explore. Additionally, while the method shows promise, it may not generalize well to all types of rare words or domains, particularly those that are highly specialized or context-dependent. The paper could also benefit from a discussion on the computational efficiency of the proposed method compared to traditional fine-tuning.
The implications of this research are significant, particularly for applications in automatic speech recognition (ASR) and automatic speech translation (AST) in specialized fields such as medicine and law. By addressing the challenges of rare word recognition, this work could enhance the accessibility and accuracy of speech technologies in critical domains, ultimately improving user experience and outcomes. The main contribution of this paper is the introduction of a training-free paradigm based on task vectors for rare word recognition and translation in speech models, which enhances scalability and mitigates catastrophic forgetting. This work is significant as it proposes a novel approach that could transform how speech models handle rare vocabulary, potentially leading to broader applications in various fields.
Speech codecs are traditionally optimized for waveform fidelity, allocating bits to preserve acoustic detail even when much of it can be inferred from linguistic structure. This leads to inefficient compression and suboptimal performance on downstream recognition tasks. We propose SemDAC, a semantic-aware neural audio codec that leverages semantic codebooks as effective priors for speech compression. In SemDAC, the first quantizer in a residual vector quantization (RVQ) stack is distilled from HuBERT features to produce semantic tokens that capture phonetic content, while subsequent quantizers model residual acoustics. A FiLM-conditioned decoder reconstructs audio conditioned on the semantic tokens, improving efficiency in the use of acoustic codebooks. Despite its simplicity, this design proves highly effective: SemDAC outperforms DAC across perceptual metrics and achieves lower WER when running Whisper on reconstructed speech, all while operating at substantially lower bitrates (e.g., 0.95 kbps vs. 2.5 kbps for DAC). These results demonstrate that semantic codebooks provide an effective inductive bias for neural speech compression, producing compact yet recognition-friendly representations.
Primary: NYU Shanghai
All Institutions: NYU Shanghai
The main contribution of this paper is the introduction of SemDAC, a semantic-aware neural audio codec that effectively integrates semantic codebooks into the speech compression process, leading to improved efficiency and recognition accuracy. This work represents a meaningful advancement in the field of audio processing, particularly in the context of neural codecs, by demonstrating the importance of semantic information in enhancing audio quality at lower bitrates.
The proposed methodology in SemDAC is innovative, as it integrates semantic codebooks into the neural audio codec framework, specifically leveraging HuBERT features to enhance the quantization process. The use of a FiLM-conditioned decoder to incorporate semantic tokens into the reconstruction process is a significant advancement over traditional approaches that primarily focus on acoustic fidelity. The asymmetric design of the quantization layers, separating semantic and acoustic representations, is a thoughtful approach that addresses the inefficiencies of existing codecs.
The experiments are well-structured, utilizing the LibriSpeech dataset, which is a standard benchmark in speech processing. The comparison against the DAC and Opus codecs across various bitrates provides a comprehensive evaluation of SemDAC's performance. The results demonstrate clear advantages in perceptual metrics and word error rates, particularly at lower bitrates, which is crucial for practical applications in speech compression.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which should facilitate reproducibility. However, the absence of a public code repository or demo URL limits the practical reproducibility of the results, as other researchers would need to implement the model from scratch based on the provided descriptions.
While the paper presents a compelling approach, it does not extensively discuss potential limitations or scenarios where SemDAC may underperform compared to other codecs. Additionally, the reliance on HuBERT features may limit the model's applicability to languages or dialects for which such pretrained models are not available.
The implications of this work are significant, as it addresses the growing need for efficient speech compression methods that maintain high perceptual quality and intelligibility, especially in low-bitrate scenarios. This could benefit various applications, including voice communication over constrained networks, speech recognition systems, and assistive technologies. The main contribution of this paper is the introduction of SemDAC, a semantic-aware neural audio codec that effectively integrates semantic codebooks into the speech compression process, leading to improved efficiency and recognition accuracy. This work represents a meaningful advancement in the field of audio processing, particularly in the context of neural codecs, by demonstrating the importance of semantic information in enhancing audio quality at lower bitrates.
Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs our proposed Semantic Anchoring mechanism. This design strategically decouples the quantization of Semantic and Acoustic details. The semantic anchoring is achieved via a lightweight projector that aligns acoustic features with a frozen, large-scale mHuBERT codebook, injecting linguistic priors while guaranteeing full codebook utilization. Sequentially, for acoustic details, a residual activation module with SimVQ enables a single-layer quantizer (acoustic path) to faithfully recover fine-grained information. At just 1.5 kbps, SACodec establishes a new state of the art by excelling in both fidelity and semantics: subjective listening tests confirm that its reconstruction quality is perceptually highly comparable to ground-truth audio, while its tokens demonstrate substantially improved semantic richness in downstream tasks.
Primary: Unknown
All Institutions: Unknown
The paper presents SACodec, a novel neural speech codec that addresses the trade-off between acoustic fidelity and semantic richness at low bitrates, achieving state-of-the-art performance in both areas. The technical contributions, particularly the asymmetric dual-quantizer design and the integration of semantic anchoring, represent a meaningful advancement in the field of neural speech coding, with potential applications in various audio processing tasks.
The methodology introduces a novel asymmetric dual-quantizer architecture that effectively separates semantic and acoustic information, addressing the limitations of traditional vector quantization methods. The use of a fixed mHuBERT codebook for semantic anchoring is innovative and allows for better utilization of semantic information without the risk of codebook collapse. The integration of SimVQ for acoustic detail recovery is a significant advancement, ensuring high fidelity at low bitrates. The overall design is well-structured and demonstrates a clear understanding of the challenges in low-bitrate speech coding.
The experiments are comprehensive, utilizing a robust dataset (LibriTTS) and comparing against multiple state-of-the-art codecs. The results show SACodec achieving superior performance in both reconstruction quality and semantic richness, validated through objective metrics and subjective listening tests. The ablation studies provide strong evidence for the effectiveness of the proposed components, enhancing the credibility of the findings.
The paper provides sufficient implementation details, including architecture specifics and training procedures, which should facilitate reproducibility. The code is available on GitHub, further aiding in the validation of results by other researchers.
The study is limited to English language data, which may restrict the applicability of the findings to other languages. Additionally, while the model shows promise, its integration into real-world applications and downstream systems remains to be fully validated.
The advancements presented in SACodec have significant implications for real-time speech applications, particularly in scenarios requiring low-bitrate transmission without sacrificing quality. This could enhance communication technologies, improve accessibility, and facilitate the deployment of speech models in resource-constrained environments. The paper presents SACodec, a novel neural speech codec that addresses the trade-off between acoustic fidelity and semantic richness at low bitrates, achieving state-of-the-art performance in both areas. The technical contributions, particularly the asymmetric dual-quantizer design and the integration of semantic anchoring, represent a meaningful advancement in the field of neural speech coding, with potential applications in various audio processing tasks.
Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We present GenTSE, a two-stage decoder-only generative LM approach for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more faithful, content-aligned target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further employ DPO to better align outputs with human perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.
Primary: affiliation=1 (assumed to be a university or research institution)
All Institutions: affiliation=1, affiliation=2
The main contribution of this paper is the introduction of GenTSE, a novel two-stage generative language model for target speaker extraction that effectively separates semantic and acoustic generation, improving speech quality and intelligibility. This work represents a meaningful advancement in the field of audio processing and generative modeling, addressing critical challenges in speaker extraction tasks.
The paper introduces a two-stage generative language model architecture for target speaker extraction (TSE), which is innovative in its separation of semantic and acoustic token generation. The use of Frozen-LM Conditioning (FLC) to mitigate exposure bias and Direct Preference Optimization (DPO) to align outputs with human perceptual preferences are significant methodological advancements. The hierarchical modeling approach effectively reduces the complexity of direct acoustic modeling, which is a notable strength.
The experiments are well-structured, utilizing the Libri2Mix dataset, a standard benchmark for TSE. The results demonstrate that GenTSE outperforms existing models in terms of speech quality, intelligibility, and speaker consistency, which is quantitatively supported by various metrics. The ablation studies provide insights into the importance of each component of the model.
The implementation details are provided, including model architectures, training configurations, and evaluation metrics. However, the absence of a public code repository or demo limits the reproducibility of the results. The paper could benefit from providing access to code or models to facilitate further research.
While the paper presents a strong approach, it does not address potential scalability issues or the computational cost associated with training the two-stage model. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other contexts or languages.
The advancements in TSE have significant implications for applications in speech processing, including voice recognition, telecommunications, and assistive technologies. The ability to extract a target speaker's voice from a mixture could enhance user experiences in various audio applications. The main contribution of this paper is the introduction of GenTSE, a novel two-stage generative language model for target speaker extraction that effectively separates semantic and acoustic generation, improving speech quality and intelligibility. This work represents a meaningful advancement in the field of audio processing and generative modeling, addressing critical challenges in speaker extraction tasks.
Sound separation (SS) and target sound extraction (TSE) are fundamental techniques for addressing complex acoustic scenarios. While existing SS methods struggle with determining the unknown number of sound sources, TSE approaches require precisely specified clues to achieve optimal performance. This paper proposes a unified framework that synergistically combines SS and TSE to overcome their individual limitations. Our architecture employs two complementary components: 1) An Encoder-Decoder Attractor (EDA) network that automatically infers both the source count and corresponding acoustic clues for SS, and 2) A multi-modal fusion network that precisely interprets diverse user-provided clues (acoustic, semantic, or visual) for TSE. Through joint training with cross-task consistency constraints, we establish a unified latent space that bridges both paradigms. During inference, the system adaptively operates in either fully autonomous SS mode or clue-driven TSE mode. Experiments demonstrate remarkable performance in both tasks, with notable improvements of 1.4 dB SDR improvement in SS compared to baseline and 86\% TSE accuracy.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to unify sound separation and extraction tasks, demonstrating significant improvements in performance while addressing key limitations of existing methods. The integration of multi-modal clues and adaptive inference modes positions this work as a valuable contribution to the field of audio processing and machine learning.
The proposed USE framework integrates sound separation (SS) and target sound extraction (TSE) into a unified model, addressing the limitations of existing methods that require predefined source counts or high-quality clues. The architecture employs an Encoder-Decoder Attractor (EDA) network for inferring source counts and a multi-modal fusion network for interpreting various user-provided clues. The joint training strategy with cross-task consistency constraints is innovative, allowing the model to adaptively switch between SS and TSE modes during inference. However, the complexity of the architecture may pose challenges in practical applications, especially in real-time scenarios.
The experiments conducted on universal sound datasets demonstrate significant improvements in both SS and TSE tasks, with a reported 1.4 dB SDR improvement in SS and 86% accuracy in TSE. The use of diverse datasets, including the AudioSet and custom 2Mix and 3Mix datasets, strengthens the evaluation. However, the paper lacks detailed comparisons with a broader range of existing models, which could provide a clearer context for the reported improvements.
The paper outlines the training and evaluation strategies, including the two-stage training process and the datasets used. However, the absence of a publicly available code repository limits reproducibility. The methodology is described in detail, but without access to the actual implementation, it may be challenging for others to replicate the results.
The performance of the USE model is noted to be influenced by the cleanliness of the training data, which may affect its adaptability to various tasks. Additionally, the reliance on multiple modalities for clues could complicate its deployment in scenarios where such data is not readily available. The paper also does not address the computational efficiency of the model in real-time applications, which is critical for practical use.
The USE framework has the potential to significantly enhance audio processing applications, particularly in environments with complex sound mixtures, such as surveillance, multimedia content creation, and assistive technologies for the hearing impaired. By effectively separating and extracting sounds from various sources, it could lead to advancements in audio analysis and understanding, fostering further research in the field. The paper presents a novel approach to unify sound separation and extraction tasks, demonstrating significant improvements in performance while addressing key limitations of existing methods. The integration of multi-modal clues and adaptive inference modes positions this work as a valuable contribution to the field of audio processing and machine learning.
Neural vocoders and codecs reconstruct waveforms from acoustic representations, which directly impact the audio quality. Among existing methods, upsampling-based time-domain models are superior in both inference speed and synthesis quality, achieving state-of-the-art performance. Still, despite their success in producing perceptually natural sound, their synthesis fidelity remains limited due to the aliasing artifacts brought by the inadequately designed model architectures. In particular, the unconstrained nonlinear activation generates an infinite number of harmonics that exceed the Nyquist frequency, resulting in ``folded-back'' aliasing artifacts. The widely used upsampling layer, ConvTranspose, copies the mirrored low-frequency parts to fill the empty high-frequency region, resulting in ``mirrored'' aliasing artifacts. Meanwhile, the combination of its inherent periodicity and the mirrored DC bias also brings ``tonal artifact,'' resulting in constant-frequency ringing. This paper aims to solve these issues from a signal processing perspective. Specifically, we apply oversampling and anti-derivative anti-aliasing to the activation function to obtain its anti-aliased form, and replace the problematic ConvTranspose layer with resampling to avoid the ``tonal artifact'' and eliminate aliased components. Based on our proposed anti-aliased modules, we introduce Pupu-Vocoder and Pupu-Codec, and release high-quality pre-trained checkpoints to facilitate audio generation research. We build a test signal benchmark to illustrate the effectiveness of the anti-aliased modules, and conduct experiments on speech, singing voice, music, and audio to validate our proposed models. Experimental results confirm that our lightweight Pupu-Vocoder and Pupu-Codec models can easily outperform existing systems on singing voice, music, and audio, while achieving comparable performance on speech.
Primary: Chinese University of Hong Kong
All Institutions: Chinese University of Hong Kong, Aalto University, Spellbrush, Acoustic Lab
The main contribution of this paper is the introduction of anti-aliased activation functions and upsampling methods that significantly improve the fidelity of neural audio synthesis models. This work represents a meaningful advancement in the field of audio generation, bridging the gap between signal processing and deep learning to tackle a well-known challenge in audio synthesis.
The paper introduces a novel approach to address aliasing artifacts in neural audio synthesis by applying anti-derivative anti-aliasing (ADAA) techniques to activation functions and replacing the ConvTranspose layer with a resampling method. This approach is grounded in signal processing principles, which adds a layer of theoretical robustness to the proposed models, Pupu-Vocoder and Pupu-Codec. The methodology is well-structured and clearly articulated, showcasing a thoughtful integration of signal processing concepts into deep learning architectures.
The experimental setup is comprehensive, utilizing a diverse set of datasets across multiple domains (speech, singing voice, music, and audio). The results demonstrate the superiority of the proposed models over existing systems, with detailed metrics provided for both objective and subjective evaluations. The use of a test signal benchmark to validate the effectiveness of the anti-aliased modules is a strong point, as it provides clear evidence of the improvements made.
The paper includes sufficient implementation details, including model architectures, training configurations, and evaluation metrics, which facilitates reproducibility. However, the absence of a public code repository limits the ease of access for researchers wishing to replicate the results.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of audio synthesis challenges. Additionally, while the proposed models outperform existing systems, the paper does not address potential trade-offs in computational efficiency or real-time performance, which are critical in practical applications.
The proposed methods have significant implications for audio synthesis applications, including speech synthesis, music generation, and sound design. By effectively mitigating aliasing artifacts, the research can enhance audio quality in various domains, potentially influencing both academic research and industry practices. The main contribution of this paper is the introduction of anti-aliased activation functions and upsampling methods that significantly improve the fidelity of neural audio synthesis models. This work represents a meaningful advancement in the field of audio generation, bridging the gap between signal processing and deep learning to tackle a well-known challenge in audio synthesis.
Text-guided audio editing aims to modify specific acoustic events while strictly preserving non-target content. Despite recent progress, existing approaches remain fundamentally limited. Training-free methods often suffer from signal degradation caused by diffusion inversion, while training-based methods, although achieving higher generation quality, are severely constrained by the scarcity of high-quality paired data and task formulations that cover only a narrow subset of editing operations. In addition, standard architectures typically decouple text and audio processing, limiting the ability to align instructions with specific acoustic contexts. To address these challenges, we propose MMEdit, an audio-language-model-driven framework for unified audio editing. We systematically extend task definitions to cover a comprehensive range of editing operations, including addition, replacement, removal, reordering, and attribute modification. Furthermore, we design a scalable data synthesis pipeline to construct large-scale paired datasets with fine-grained event-level annotations. To capture complex editing semantics, we integrate a Qwen2-Audio encoder with an MMDiT-based generator, enabling precise cross-modal alignment and localized editing. Experimental results demonstrate that our method achieves superior editing localization accuracy, robust instruction following, and high fidelity in non-edited regions.
Primary: Shanghai Jiao Tong University
All Institutions: MoE Key Lab of Artificial Intelligence, Shanghai AI Laboratory, Nanjing University
The main contribution of this paper is the development of MMEdit, a unified framework for multi-type audio editing that leverages an Audio Language Model to enhance instruction-following capabilities and editing fidelity. This work significantly advances the state-of-the-art in audio editing by addressing existing limitations and expanding the range of editable operations, thus holding substantial promise for practical applications in audio processing.
The proposed MMEdit framework is innovative in its integration of an Audio Language Model (ALM) with a multimodal generator, addressing the limitations of existing audio editing methods by allowing for a broader range of editing operations and improved instruction-following capabilities. The systematic extension of editing tasks and the development of a scalable data synthesis pipeline for generating high-quality paired datasets are significant contributions that enhance the framework's applicability and robustness.
The experiments are well-designed, utilizing both objective metrics and subjective evaluations to assess the performance of MMEdit against existing baselines. The use of a large-scale synthetic dataset and the inclusion of diverse editing tasks demonstrate the framework's versatility. Results indicate that MMEdit consistently outperforms competitors in terms of fidelity and instruction adherence, showcasing its effectiveness in practical scenarios.
The authors have committed to open-sourcing their evaluation benchmark and data generation pipeline, which is a positive step towards reproducibility. However, the paper could benefit from more detailed implementation specifics and code availability to facilitate independent validation of results.
While the framework shows promise, it has limitations regarding generation diversity and fine-grained temporal alignment, particularly for complex overlapping audio events. These challenges could hinder its effectiveness in more intricate real-world applications.
The advancements in text-guided audio editing presented in this paper have the potential to significantly impact various fields, including music production, sound design, and audio post-production, by providing more intuitive and efficient tools for audio manipulation. The ability to edit audio based on natural language instructions could democratize audio editing, making it accessible to users without technical expertise. The main contribution of this paper is the development of MMEdit, a unified framework for multi-type audio editing that leverages an Audio Language Model to enhance instruction-following capabilities and editing fidelity. This work significantly advances the state-of-the-art in audio editing by addressing existing limitations and expanding the range of editable operations, thus holding substantial promise for practical applications in audio processing.
Many existing audio processing and generation models rely on task-specific architectures, resulting in fragmented development efforts and limited extensibility. It is therefore promising to design a unified framework capable of handling multiple tasks, while providing robust instruction and audio understanding and high-quality audio generation. This requires a compatible paradigm design, a powerful backbone, and a high-fidelity audio reconstruction module. To meet these requirements, this technical report introduces QuarkAudio, a decoder-only autoregressive (AR) LM-based generative framework that unifies multiple tasks. The framework includes a unified discrete audio tokenizer, H-Codec, which incorporates self-supervised learning (SSL) representations into the tokenization and reconstruction process. We further propose several improvements to H-Codec, such as a dynamic frame-rate mechanism and extending the audio sampling rate to 48 kHz. QuarkAudio unifies tasks by using task-specific conditional information as the conditioning sequence of the decoder-only LM, and predicting discrete target audio tokens in an AR manner. The framework supports a wide range of audio processing and generation tasks, including speech restoration (SR), target speaker extraction (TSE), speech separation (SS), voice conversion (VC), and language-queried audio source separation (LASS). In addition, we extend downstream tasks to universal free-form audio editing guided by natural language instructions (including speech semantic editing and audio event editing). Experimental results show that H-Codec achieves high-quality audio reconstruction with a low frame rate, improving both the efficiency and performance of downstream audio generation, and that QuarkAudio delivers competitive or comparable performance to state-of-the-art task-specific or multi-task systems across multiple tasks.
Primary: Alibaba Group
All Institutions: Alibaba Group, Zhejiang University, Tongyi AI Lab
The paper presents QuarkAudio, a unified framework for audio processing and generation that effectively combines multiple tasks into a single architecture, showcasing innovative methodologies and promising experimental results. The comprehensive approach to audio understanding and generation positions it as a significant advancement in the field of machine learning for audio applications.
The paper introduces QuarkAudio, a unified framework for audio processing and generation that employs a decoder-only autoregressive model. The methodology is sound, integrating a novel discrete audio tokenizer (H-Codec) that utilizes self-supervised learning for improved audio representation. The dual-stream codec design, which separates acoustic and semantic features, is innovative and addresses the limitations of existing models. The dynamic frame-rate mechanism and the extension of the audio sampling rate to 48 kHz are notable enhancements that contribute to the framework's robustness and efficiency.
The experiments are comprehensive, utilizing a diverse set of datasets for training and evaluation across multiple audio tasks. The results demonstrate competitive performance against state-of-the-art models, with detailed metrics provided for various tasks, including speech restoration and audio editing. The use of extensive datasets and the two-stage training strategy further bolster the credibility of the findings.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and evaluation metrics. However, the absence of a clear mention of the code's reproducibility status could hinder independent validation. The URLs provided for the demo and project repository are helpful for those interested in replicating the results.
While the framework shows promise, the paper acknowledges challenges in semantic alignment and the potential for improved performance in speech semantic editing tasks. Additionally, the reliance on large datasets may limit accessibility for smaller research groups.
The QuarkAudio framework has the potential to significantly advance the field of audio processing by providing a unified model capable of handling various tasks, which could streamline development efforts and enhance the quality of audio generation applications. Its implications extend to areas such as speech synthesis, audio editing, and interactive applications, making it a valuable contribution to the field. The paper presents QuarkAudio, a unified framework for audio processing and generation that effectively combines multiple tasks into a single architecture, showcasing innovative methodologies and promising experimental results. The comprehensive approach to audio understanding and generation positions it as a significant advancement in the field of machine learning for audio applications.
The goal of this paper is to provide a new perspective on speech modeling by incorporating perceptual invariances such as amplitude scaling and temporal shifts. Conventional generative formulations often treat each dataset sample as a fixed representative of the target distribution. From a generative standpoint, however, such samples are only one among many perceptually equivalent variants within the true speech distribution. To address this, we propose Linear Projection Conditional Flow Matching (LP-CFM), which models targets as projection-aligned elongated Gaussians along perceptually equivalent variants. We further introduce Vector Calibrated Sampling (VCS) to keep the sampling process aligned with the line-projection path. In neural vocoding experiments across model sizes, data scales, and sampling steps, the proposed approach consistently improves over the conventional optimal transport CFM, with particularly strong gains in low-resource and few-step scenarios. These results highlight the potential of LP-CFM and VCS to provide more robust and perceptually grounded generative modeling of speech.
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology
The main contribution of this paper is the introduction of LP-CFM, a novel framework for speech modeling that incorporates perceptual invariances, leading to improved generative performance in various scenarios. This work presents a meaningful step forward in the field of audio machine learning, providing a fresh perspective on how to model speech more effectively.
The methodology introduced in this paper, LP-CFM, is innovative in its approach to modeling speech by leveraging perceptual invariances. The authors propose a novel generative framework that treats speech samples as variants within a broader distribution, effectively addressing limitations in conventional models that do not account for perceptual equivalences. The introduction of Vector Calibrated Sampling (VCS) further enhances the alignment of the sampling process with the proposed model, showcasing a thoughtful integration of theoretical concepts into practical implementation.
The experimental section is robust, demonstrating the efficacy of LP-CFM across various model sizes, data scales, and sampling steps. The results indicate significant improvements over existing methods, particularly in low-resource scenarios. The thorough evaluation across different conditions strengthens the claims made by the authors and provides a solid foundation for the proposed approach.
While the paper outlines the methodology and results clearly, it lacks detailed implementation specifics that would facilitate reproducibility. The absence of a public code repository or demo URL limits the ability of other researchers to replicate the findings, which is a critical aspect of scientific research.
One limitation of the study is the potential overfitting to specific datasets, as the results are primarily derived from neural vocoding experiments. Additionally, the paper does not extensively address the scalability of the proposed method to larger datasets or more complex speech tasks, which could be a concern for practical applications.
The implications of this research are significant for the field of speech modeling, particularly in applications where perceptual quality is paramount, such as voice synthesis and speech recognition. By addressing perceptual invariances, the proposed methods could lead to advancements in more natural-sounding speech generation and improved performance in low-resource environments. The main contribution of this paper is the introduction of LP-CFM, a novel framework for speech modeling that incorporates perceptual invariances, leading to improved generative performance in various scenarios. This work presents a meaningful step forward in the field of audio machine learning, providing a fresh perspective on how to model speech more effectively.
Large speech generation models are evolving from single-speaker, short sentence synthesis to multi-speaker, long conversation geneartion. Current long-form speech generation models are predominately constrained to dyadic, turn-based interactions. To address this, we introduce JoyVoice, a novel anthropomorphic foundation model designed for flexible, boundary-free synthesis of up to eight speakers. Unlike conventional cascaded systems, JoyVoice employs a unified E2E-Transformer-DiT architecture that utilizes autoregressive hidden representations directly for diffusion inputs, enabling holistic end-to-end optimization. We further propose a MM-Tokenizer operating at a low bitrate of 12.5 Hz, which integrates multitask semantic and MMSE losses to effectively model both semantic and acoustic information. Additionally, the model incorporates robust text front-end processing via large-scale data perturbation. Experiments show that JoyVoice achieves state-of-the-art results in multilingual generation (Chinese, English, Japanese, Korean) and zero-shot voice cloning. JoyVoice achieves top-tier results on both the Seed-TTS-Eval Benchmark and multi-speaker long-form conversational voice cloning tasks, demonstrating superior audio quality and generalization. It achieves significant improvements in prosodic continuity for long-form speech, rhythm richness in multi-speaker conversations, paralinguistic naturalness, besides superior intelligibility. We encourage readers to listen to the demo at https://jea-speech.github.io/JoyVoice
Primary: bf SpeechTeam
All Institutions: bf SpeechTeam
The main contribution of this paper is the introduction of JoyVoice, a novel anthropomorphic multi-speaker conversational synthesis model that significantly enhances the quality and flexibility of long-form speech generation. The comprehensive analysis highlights the innovative methodology, strong experimental validation, and potential for impactful applications in the field of audio synthesis.
The paper introduces JoyVoice, a novel end-to-end Transformer-DiT architecture that innovatively integrates autoregressive hidden representations for diffusion inputs, allowing for a more cohesive synthesis of multi-speaker conversations. The use of a low-bitrate MM-Tokenizer to manage both semantic and acoustic information is particularly noteworthy, as it addresses the challenges of modeling complex conversational dynamics. The methodology is well-structured and demonstrates a clear advancement over traditional cascaded systems, although further elaboration on the training process and hyperparameter tuning would enhance understanding.
The experiments are robust, showcasing JoyVoice's performance across multiple languages and tasks, including multilingual generation and zero-shot voice cloning. The results indicate significant improvements in audio quality, prosodic continuity, and paralinguistic naturalness, which are critical for conversational synthesis. However, the paper could benefit from a more detailed comparison with baseline models and additional metrics to substantiate the claims of superiority.
While the paper provides a demo URL, it lacks detailed implementation specifics, such as code availability or dataset descriptions, which are essential for reproducibility. Clearer guidelines on the training process and data preprocessing would enhance the ability of other researchers to replicate the results.
The paper acknowledges limitations, such as potential overfitting in specific scenarios and the need for extensive computational resources for training. However, a more thorough discussion of the limitations related to the model's scalability and adaptability to diverse conversational contexts would be beneficial.
JoyVoice has significant potential applications in areas such as virtual assistants, gaming, and interactive storytelling, where natural and engaging multi-speaker interactions are crucial. Its advancements could lead to more immersive user experiences and broaden the accessibility of conversational AI technologies. The main contribution of this paper is the introduction of JoyVoice, a novel anthropomorphic multi-speaker conversational synthesis model that significantly enhances the quality and flexibility of long-form speech generation. The comprehensive analysis highlights the innovative methodology, strong experimental validation, and potential for impactful applications in the field of audio synthesis.
We introduce Perception Encoder Audiovisual, PE-AV, a new family of encoders for audio and video understanding trained with scaled contrastive learning. Built on PE, PE-AV makes several key contributions to extend representations to audio, and natively support joint embeddings across audio-video, audio-text, and video-text modalities. PE-AV's unified cross-modal embeddings enable novel tasks such as speech retrieval, and set a new state of the art across standard audio and video benchmarks. We unlock this by building a strong audiovisual data engine that synthesizes high-quality captions for O(100M) audio-video pairs, enabling large-scale supervision consistent across modalities. Our audio data includes speech, music, and general sound effects-avoiding single-domain limitations common in prior work. We exploit ten pairwise contrastive objectives, showing that scaling cross-modality and caption-type pairs strengthens alignment and improves zero-shot performance. We further develop PE-A-Frame by fine-tuning PE-AV with frame-level contrastive objectives, enabling fine-grained audio-frame-to-text alignment for tasks such as sound event detection.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the PE-AV framework, which significantly advances the state of audiovisual perception through innovative contrastive learning techniques and large-scale data synthesis. This work is poised to impact various applications in multimodal machine learning and sets a new benchmark for future research in the field.
The proposed methodology introduces the Perception Encoder Audiovisual (PE-AV) framework, which effectively leverages scaled contrastive learning to create unified cross-modal embeddings. The use of ten pairwise contrastive objectives is innovative, as it enhances the alignment across different modalities (audio, video, text) and allows for novel tasks such as speech retrieval. The integration of a large-scale audiovisual data engine that synthesizes high-quality captions for a vast number of audio-video pairs is a significant methodological advancement, addressing the limitations of prior works that often focused on single-domain data.
The experiments conducted demonstrate the effectiveness of PE-AV across various standard audio and video benchmarks, achieving state-of-the-art results. The evaluation of downstream applications, particularly in sound event detection, showcases the practical applicability of the model. However, the paper could benefit from more detailed descriptions of the datasets used and the specific metrics employed for evaluation, as well as comparisons with additional baselines to strengthen the claims of superiority.
The paper lacks sufficient implementation details that would facilitate reproducibility. There is no mention of code availability or specific configurations used during training and evaluation, which are crucial for other researchers to replicate the results. Providing a project URL or a GitHub repository would significantly enhance the reproducibility of the findings.
One limitation noted is the reliance on a large-scale dataset that may not be readily available to all researchers, potentially hindering further exploration of the proposed methods. Additionally, while the approach shows promise, the paper does not address potential biases in the data or the implications of using such a large-scale audiovisual dataset.
The implications of this research are significant, as it opens up new avenues for multimodal learning applications in areas such as human-computer interaction, content retrieval, and automatic captioning systems. The ability to align audio and video representations could lead to advancements in accessibility technologies and enhance the user experience in multimedia applications. The main contribution of this paper is the introduction of the PE-AV framework, which significantly advances the state of audiovisual perception through innovative contrastive learning techniques and large-scale data synthesis. This work is poised to impact various applications in multimodal machine learning and sets a new benchmark for future research in the field.
Speech intelligibility assessment is essential for many speech-related applications. However, most objective intelligibility metrics are intrusive, as they require clean reference speech in addition to the degraded or processed signal for evaluation. Furthermore, existing metrics such as STOI are primarily designed for normal hearing listeners, and their predictive accuracy for hearing impaired speech intelligibility remains limited. On the other hand, the GESI (Gammachirp Envelope Similarity Index) can be used to estimate intelligibility for hearing-impaired listeners, but it is also intrusive, as it depends on reference signals. This requirement limits its applicability in real-world scenarios. To overcome this limitation, this study proposes DeepGESI, a non-intrusive deep learning-based model capable of accurately and efficiently predicting the speech intelligibility of hearing-impaired listeners without requiring any clean reference speech. Experimental results demonstrate that, under the test conditions of the 2nd Clarity Prediction Challenge(CPC2) dataset, the GESI scores predicted by DeepGESI exhibit a strong correlation with the actual GESI scores. In addition, the proposed model achieves a substantially faster prediction speed compared to conventional methods.
Primary: Faculty of Systems Engineering
All Institutions: Faculty of Systems Engineering
The main contribution of this paper is the introduction of DeepGESI, a non-intrusive deep learning model for predicting speech intelligibility in hearing-impaired listeners, which significantly advances the field by providing a practical solution to a longstanding challenge in speech assessment. The technical contributions, particularly in methodology and experimental validation, position this work as a valuable addition to the literature on speech intelligibility metrics.
The proposed DeepGESI model is innovative in its non-intrusive approach to predicting speech intelligibility for hearing-impaired listeners. The architecture effectively combines acoustic feature extraction through STFT and learnable filterbanks, along with an attention mechanism to capture salient information. The use of Maxout activation functions and Rotary Position Embedding for positional encoding demonstrates a thoughtful adaptation of existing techniques to improve performance in speech intelligibility tasks. However, while the methodology is sound, it heavily relies on the underlying assumptions of the GESI framework, which may limit its generalizability.
The experiments leverage the CPC2 dataset, which is appropriate for the task and allows for robust evaluation against both seen and unseen datasets. The reported metrics (MSE, LCC, SRCC) indicate strong performance, particularly in terms of correlation with ground truth values. The paper provides sufficient detail on the experimental setup, including the computational resources used, which enhances the credibility of the results. However, the absence of comparisons with a broader range of existing non-intrusive metrics limits the contextual understanding of DeepGESI's performance.
The paper outlines the architecture and training process in sufficient detail, allowing for potential reproduction of the results. However, the lack of publicly available code or a demo URL is a significant drawback, as it hinders other researchers from validating the findings independently. Including a GitHub repository or similar resource would greatly enhance reproducibility.
The study acknowledges that DeepGESI was not fine-tuned using subjective listening tests, which is a critical aspect for validating intelligibility predictions. Additionally, while the model shows strong performance on the CPC2 dataset, its effectiveness in more diverse or real-world acoustic conditions remains untested. The reliance on a single dataset may also raise concerns about overfitting and generalizability.
The implications of this research are significant, particularly for the development of hearing aids and cochlear implants, as it addresses a critical gap in non-intrusive intelligibility assessment for hearing-impaired individuals. By enabling real-time processing and reducing the need for clean reference speech, DeepGESI could facilitate advancements in speech technology applications, improving accessibility for hearing-impaired listeners in various environments. The main contribution of this paper is the introduction of DeepGESI, a non-intrusive deep learning model for predicting speech intelligibility in hearing-impaired listeners, which significantly advances the field by providing a practical solution to a longstanding challenge in speech assessment. The technical contributions, particularly in methodology and experimental validation, position this work as a valuable addition to the literature on speech intelligibility metrics.
Detecting synthetic speech is challenging when labeled data are scarce and recording conditions vary. Existing end-to-end deep models often overfit or fail to generalize, and while kernel methods can remain competitive, their performance heavily depends on the chosen kernel. Here, we show that using a quantum kernel in audio deepfake detection reduces falsepositive rates without increasing model size. Quantum feature maps embed data into high-dimensional Hilbert spaces, enabling the use of expressive similarity measures and compact classifiers. Building on this motivation, we compare quantum-kernel SVMs (QSVMs) with classical SVMs using identical mel-spectrogram preprocessing and stratified 5-fold cross-validation across four corpora (ASVspoof 2019 LA, ASVspoof 5 (2024), ADD23, and an In-the-Wild set). QSVMs achieve consistently lower equalerror rates (EER): 0.183 vs. 0.299 on ASVspoof 5 (2024), 0.081 vs. 0.188 on ADD23, 0.346 vs. 0.399 on ASVspoof 2019, and 0.355 vs. 0.413 In-the-Wild. At the EER operating point (where FPR equals FNR), these correspond to absolute false-positiverate reductions of 0.116 (38.8%), 0.107 (56.9%), 0.053 (13.3%), and 0.058 (14.0%), respectively. We also report how consistent the results are across cross-validation folds and margin-based measures of class separation, using identical settings for both models. The only modification is the kernel; the features and SVM remain unchanged, no additional trainable parameters are introduced, and the quantum kernel is computed on a conventional computer.
Primary: University of Maryland, Baltimore County
All Institutions: University of Maryland, Baltimore County
The paper presents a novel approach to audio deepfake detection using quantum-kernel SVMs, demonstrating significant improvements in detection accuracy and reliability. The rigorous methodology and comprehensive experimental evaluation contribute valuable insights into the potential of quantum computing in enhancing machine learning models for security applications.
The methodology is robust, employing a controlled "kernel-swap" approach that isolates the effects of the quantum kernel from other variables such as model size and optimization techniques. The use of mel-spectrograms as features is appropriate for audio data, and the application of PCA for dimensionality reduction is well-justified. The experimental design, including stratified 5-fold cross-validation across multiple datasets, strengthens the validity of the findings. However, the reliance on classical simulation for quantum kernel computation may limit the practical applicability of the proposed method in real-time scenarios.
The experiments are well-structured, comparing the performance of QSVM against classical SVMs across four diverse datasets. The reported results demonstrate significant improvements in equal-error rates (EER) and false-positive rates (FPR) for QSVM, indicating its effectiveness in audio deepfake detection. The statistical analysis, including effect sizes, adds rigor to the evaluation. However, the paper could benefit from additional comparisons with modern deep learning architectures to contextualize the performance of QSVM further.
The paper provides detailed implementation details, including the use of specific quantum feature maps and the computational setup. However, the lack of a publicly accessible code repository or demo limits reproducibility. The authors mention using Qiskit Machine Learning, which is a positive aspect, but sharing the code would enhance transparency and facilitate further research.
One limitation is the focus on small datasets, which may not fully represent the challenges faced in larger, more diverse real-world applications. Additionally, the quantum kernel's computation on classical hardware may not reflect the potential advantages of quantum computing in practice. The paper also does not address adversarial robustness or real-time constraints, which are critical for deployment in security-sensitive environments.
The findings have significant implications for the fields of audio forensics and cybersecurity, where reliable detection of deepfakes is crucial. The proposed method could enhance existing detection systems, making them more robust against adversarial attacks and variability in recording conditions. As quantum computing technology advances, the integration of quantum kernels into machine learning pipelines may lead to further breakthroughs in various applications beyond audio deepfake detection. The paper presents a novel approach to audio deepfake detection using quantum-kernel SVMs, demonstrating significant improvements in detection accuracy and reliability. The rigorous methodology and comprehensive experimental evaluation contribute valuable insights into the potential of quantum computing in enhancing machine learning models for security applications.
Target speaker extraction (TSE) aims to isolate a desired speaker's voice from a multi-speaker mixture using auxiliary information such as a reference utterance. Although recent advances in diffusion and flow-matching models have improved TSE performance, these methods typically require multi-step sampling, which limits their practicality in low-latency settings. In this work, we propose MeanFlow-TSE, a one-step generative TSE framework trained with mean-flow objectives, enabling fast and high-quality generation without iterative refinement. Building on the AD-FlowTSE paradigm, our method defines a flow between the background and target source that is governed by the mixing ratio (MR). Experiments on the Libri2Mix corpus show that our approach outperforms existing diffusion- and flow-matching-based TSE models in separation quality and perceptual metrics while requiring only a single inference step. These results demonstrate that mean-flow-guided one-step generation offers an effective and efficient alternative for real-time target speaker extraction. Code is available at https://github.com/rikishimizu/MeanFlow-TSE.
Primary: Xilin
All Institutions: Xilin
The paper presents MeanFlow-TSE, a novel one-step generative framework for target speaker extraction that significantly improves performance and efficiency over existing methods. The comprehensive evaluation of the methodology and results highlights its potential impact on real-time audio applications, marking a meaningful advancement in the field.
The proposed MeanFlow-TSE framework introduces a novel one-step generative approach to target speaker extraction (TSE) by leveraging mean-flow objectives. This method stands out by eliminating the need for iterative refinement, which is common in existing models, thus addressing latency issues in real-time applications. The integration of mixing ratio-aware training and adaptive weight loss strategies demonstrates a thoughtful approach to improving model performance while maintaining computational efficiency. The use of curriculum learning for transitioning training objectives is a sophisticated technique that enhances stability and performance.
The experiments conducted on the Libri2Mix dataset are comprehensive, comparing MeanFlow-TSE against a variety of state-of-the-art models. The reported results indicate significant improvements in separation quality and perceptual metrics, showcasing the effectiveness of the proposed method. The use of multiple evaluation metrics (SI-SDR, PESQ, ESTOI, etc.) provides a well-rounded assessment of model performance, reinforcing the claims made by the authors. The thorough benchmarking against existing models adds credibility to the findings.
The paper provides sufficient implementation details, including the architecture, training configurations, and optimization strategies, which are essential for reproducibility. The availability of the code on GitHub further enhances the potential for other researchers to replicate the results. However, the absence of a demo URL limits immediate accessibility to the model's capabilities.
While the paper presents a robust framework, it does not address potential limitations in handling diverse acoustic environments or the scalability of the model to multi-channel scenarios. Additionally, the reliance on the Libri2Mix dataset may limit the generalizability of the findings to other real-world applications.
The advancements in TSE presented in this work have significant implications for applications in automatic speech recognition, hearing aids, and telecommunications, particularly in noisy environments. The efficiency of the MeanFlow-TSE framework could facilitate real-time applications, making it a valuable contribution to the field of audio processing. The paper presents MeanFlow-TSE, a novel one-step generative framework for target speaker extraction that significantly improves performance and efficiency over existing methods. The comprehensive evaluation of the methodology and results highlights its potential impact on real-time audio applications, marking a meaningful advancement in the field.