Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
The methodology presented in this paper is robust and innovative, addressing the unique challenges of continual learning (CL) in audio contexts, particularly the upstream-downstream misalignment that has hindered previous approaches. The introduction of PACE, which combines improved first-session adaptation (FSA) with multi-session adaptation (MSA) and boundary-aware regularization, is a significant advancement. The paper meticulously details the design choices behind each component, demonstrating a clear understanding of the audio domain's intricacies. The use of analytic classifiers and adaptive subspace-orthogonal PEFT is particularly noteworthy, as it showcases a tailored approach to audio CL that diverges from traditional vision-based methods.
The experimental evaluation is thorough, employing six diverse audio CL benchmarks that effectively highlight the strengths and weaknesses of the proposed method. The results consistently demonstrate that PACE outperforms state-of-the-art methods, providing strong empirical evidence for its effectiveness. The ablation studies further reinforce the validity of the proposed components, illustrating how each contributes to the overall performance. However, the paper could benefit from additional comparisons with more recent methods in the audio domain, if available.
The authors commit to releasing their code and benchmarks, which is a positive aspect for reproducibility. The detailed descriptions of the experimental setup, including hyperparameters and dataset configurations, enhance the likelihood that other researchers can replicate the results. However, the absence of a demo or interactive component limits immediate accessibility for broader audiences.
One limitation is the potential for overfitting in fine-grained tasks, as indicated by the authors. The paper also acknowledges that while PACE narrows the gap to joint training, it does not completely eliminate it, suggesting that further improvements could be made. Additionally, the reliance on specific pretrained models may limit the generalizability of the findings across different audio tasks.
The implications of this work are significant, particularly for applications in speech recognition, audio event detection, and environmental sound understanding. By addressing the challenges of continual learning in audio, the proposed methods could enhance the robustness and adaptability of audio models in real-world scenarios, leading to more effective and reliable systems. The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of physiological sounds caused by the occlusion effect; however, existing approaches often fail under real-world noise or rely on computationally expensive models. We present EarResp-ANS, the first system enabling fully on-device, real-time RR estimation on commercial earphones. The system employs LMS-based adaptive noise suppression (ANS) to attenuate ambient noise while preserving respiration-related acoustic components, without requiring neural networks or audio streaming, thereby explicitly addressing the energy and privacy constraints of wearable devices. We evaluate EarResp-ANS in a study with 18 participants under realistic acoustic conditions, including music, cafeteria noise, and white noise up to 80 dB SPL. EarResp-ANS achieves robust performance with a global MAE of 0.84 CPM , reduced to 0.47 CPM via automatic outlier rejection, while operating with less than 2% processor load directly on the earphone.
Primary: Karlsruhe Institute of Technology
All Institutions: Karlsruhe Institute of Technology
The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
The methodology presented in EarResp-ANS is innovative, leveraging LMS-based adaptive noise suppression to enhance the accuracy of respiration rate estimation from in-ear audio signals. The decision to avoid neural networks and audio streaming is commendable, as it addresses energy efficiency and privacy concerns, which are critical in wearable technology. The paper provides a clear description of the signal processing techniques used, although further details on the implementation specifics would enhance understanding.
The experimental setup is robust, involving 18 participants and testing under various realistic acoustic conditions. The reported results, including a global MAE of 0.84 CPM and improved performance with outlier rejection, demonstrate the system's effectiveness. However, the sample size could be considered limited for broader generalizability, and additional metrics could provide a more comprehensive performance evaluation.
The paper lacks sufficient detail regarding the implementation of the system, which could hinder reproducibility. While the methodology is described, specific parameters, configurations, and the dataset used for training and validation are not thoroughly detailed, making it challenging for other researchers to replicate the study.
One limitation is the relatively small participant pool, which may not capture the variability in respiration rates across different demographics. Additionally, the performance under extreme noise conditions could be further explored, as the current evaluation focuses on a limited range of acoustic environments.
The potential applications of this technology are significant, particularly in health monitoring and wellness, as it allows for unobtrusive and continuous monitoring of a vital sign that is often overlooked. The system's design prioritizes user privacy and energy efficiency, making it suitable for widespread adoption in consumer devices. The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.
Primary: Meta
All Institutions: Meta
The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
The proposed methodology of the Speech Diffusion Tokenizer (SiTok) is innovative, leveraging a diffusion autoencoder to jointly optimize quantization and reconstruction. The introduction of semantic regularization through a CTC decoder is a significant advancement, allowing the model to maintain semantic integrity while achieving high compression rates. The architecture effectively combines the strengths of diffusion models with the need for efficient speech tokenization, addressing the limitations of previous approaches that often relied on heuristic compromises. The design choices, such as the use of mel-spectrograms and the focus on low token rates, are well-justified and align with the objectives of scalable language modeling.
The experiments conducted are extensive, utilizing a large dataset of 2 million hours of speech, which enhances the robustness of the findings. The paper provides a comprehensive evaluation across various tasks, including speech reconstruction, emotion recognition, and automatic speech recognition, demonstrating that SiTok outperforms existing baselines significantly. The results are well-presented, with clear metrics for comparison, and the ablation studies effectively highlight the contributions of different components of the model.
The paper includes detailed descriptions of the model architecture, training settings, and evaluation protocols, which are crucial for reproducibility. The authors have made efforts to ensure that their work can be replicated, which is commendable. However, the absence of a publicly available code repository limits the ease of reproducibility for practitioners in the field.
While the proposed model shows promising results, it may still face challenges in real-world applications, such as the potential for overfitting due to the large number of parameters (1.6B) and the reliance on extensive training data. Additionally, the computational efficiency during inference, although improved with shortcut fine-tuning, may still be a concern for deployment in resource-constrained environments. The paper does not address the ethical implications of misuse in generating synthetic speech, which is an important consideration in today's landscape.
The development of SiTok has significant implications for speech technology, particularly in applications such as automatic speech recognition, text-to-speech systems, and conversational agents. By enabling high-fidelity audio reconstruction at low bit rates, this work could enhance accessibility and usability in various domains, including assistive technologies and real-time communication systems. The potential for misuse, such as generating deceptive synthetic speech, highlights the need for responsible deployment and monitoring of such technologies. The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the dynamic nature and acoustic complexity of 360-degree scenes, fail to fully account for dynamic sound sources, and neglect complex environmental effects such as occlusion, reflections, and reverberation, which are influenced by scene geometries and materials. We propose DynFOA, a framework based on dynamic acoustic perception and conditional diffusion, for generating high-fidelity FOA from 360-degree videos. DynFOA first performs visual processing via a video encoder, which detects and localizes multiple dynamic sound sources, estimates their depth and semantics, and reconstructs the scene geometry and materials using a 3D Gaussian Splatting. This reconstruction technique accurately models occlusion, reflections, and reverberation based on the geometries and materials of the reconstructed 3D scene and the listener's viewpoint. The audio encoder then captures the spatial motion and temporal 4D sound source trajectories to fine-tune the diffusion-based FOA generator. The fine-tuned FOA generator adjusts spatial cues in real time, ensuring consistent directional fidelity during listener head rotation and complex environmental changes. Extensive evaluations demonstrate that DynFOA consistently outperforms existing methods across metrics such as spatial accuracy, acoustic fidelity, and distribution matching, while also improving the user experience. Therefore, DynFOA provides a robust and scalable approach to rendering realistic dynamic spatial audio for VR and immersive media applications.
Primary: Martha Stewart Enterprises
All Institutions: Martha Stewart Enterprises, Allied Widgets Research
DynFOA presents a significant advancement in the generation of spatial audio for complex acoustic environments. The integration of visual and acoustic processing through a conditional diffusion model marks a notable contribution to the field, addressing critical challenges in immersive audio rendering.
The methodology presented in DynFOA is robust, integrating a multi-modal approach that combines visual processing with audio generation through conditional diffusion. The use of 3D Gaussian Splatting for scene reconstruction is particularly innovative, allowing for a detailed understanding of the environment that enhances acoustic fidelity. The model's architecture, which includes separate encoders for video and audio, effectively captures the complexities of dynamic sound sources in 360-degree videos. However, the reliance on specific datasets and the complexity of the model may limit its applicability in diverse real-world scenarios.
The experimental evaluation is comprehensive, utilizing a well-structured dataset (Dyn360) that includes various acoustic scenarios. The results demonstrate a clear superiority of DynFOA over baseline methods across multiple metrics, including spatial accuracy and acoustic fidelity. The inclusion of both objective metrics and user studies strengthens the findings, providing a balanced view of the model's performance. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the training process. While the methodology is described in depth, the absence of a public repository or demo limits the ability of other researchers to replicate the results.
Key limitations include the model's performance in uncontrolled environments, as the experiments were primarily conducted in indoor settings. Additionally, the approach may not generalize well to different acoustic conditions, such as underwater environments or those with varying material properties. The reliance on specific datasets could also introduce biases that affect the generalizability of the findings.
The potential applications of DynFOA are significant, particularly in the fields of virtual reality, augmented reality, and immersive media. By improving the realism of spatial audio, this work can enhance user experiences in gaming, film, and educational applications. The integration of visual and acoustic modalities could pave the way for more immersive storytelling and interactive experiences. DynFOA presents a significant advancement in the generation of spatial audio for complex acoustic environments. The integration of visual and acoustic processing through a conditional diffusion model marks a notable contribution to the field, addressing critical challenges in immersive audio rendering.
Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses of a given scene into a set of scalar acoustic parameters, which can reach unmanageable sizes in large environments with many source-receiver pairs. We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting these acoustic parameters. The RLF framework employs a volumetric grid of trainable latent embeddings decoded with a symmetric function, ensuring acoustic reciprocity. We study a variety of decoders and show that leveraging Riemannian metric learning leads to a better reproduction of acoustic phenomena in complex scenes. Experimental validation demonstrates that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude. Furthermore, a MUSHRA-like subjective listening test indicates that sound rendered via RLF is perceptually indistinguishable from ground-truth simulations.
Primary: unknown
All Institutions: unknown
The paper presents a novel framework for modeling sound propagation using latent embeddings, significantly improving memory efficiency and maintaining perceptual quality in audio rendering. The technical contributions, particularly the integration of Riemannian metric learning, position this work as a meaningful advancement in the field of audio machine learning, with practical applications in immersive environments.
The paper introduces the Reciprocal Latent Fields (RLF) framework, which innovatively utilizes a volumetric grid of trainable latent embeddings to encode and predict acoustic parameters. The methodology emphasizes the importance of acoustic reciprocity by employing symmetric functions in the decoding process. The use of Riemannian metric learning to enhance the accuracy of acoustic phenomena reproduction is a notable advancement over simpler Euclidean models. The approach is well-structured, with clear definitions and justifications for the chosen methods, including the training process and the architecture of the decoders.
The experimental validation is robust, featuring a variety of models and configurations tested across two distinct environments (Audio Gym and Wwise Audio Lab). The results demonstrate significant memory efficiency gains while maintaining high fidelity in sound reproduction, as evidenced by both quantitative metrics and qualitative assessments through MUSHRA-like listening tests. The paper provides a thorough analysis of the performance of different models, comparing their accuracy and computational costs effectively.
While the paper details the methodology and experimental setup comprehensively, it lacks explicit URLs for code or data repositories, which could hinder reproducibility. The description of the training data generation and model training processes is clear, but without access to the actual implementation, independent verification of results may be challenging.
The primary limitations identified include the lack of implementation for spatial compression of the latent fields and the restriction to static geometries, which limits the applicability of the RLF framework in dynamic environments. The authors acknowledge these limitations and suggest future work to address them, indicating an awareness of the framework's current constraints.
The RLF framework has significant implications for real-time audio rendering in virtual environments, particularly in gaming and simulation contexts. By reducing memory requirements while maintaining high-quality sound reproduction, this work could enhance user experiences in immersive environments. The potential for extending the framework to other reciprocal quantities also opens avenues for further research and applications beyond acoustics. The paper presents a novel framework for modeling sound propagation using latent embeddings, significantly improving memory efficiency and maintaining perceptual quality in audio rendering. The technical contributions, particularly the integration of Riemannian metric learning, position this work as a meaningful advancement in the field of audio machine learning, with practical applications in immersive environments.
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of ARCHI-TTS, a novel non-autoregressive text-to-speech model that effectively addresses the challenges of text-speech alignment and computational efficiency through innovative architectural components. The comprehensive analysis of its technical contributions, methodology, and results positions it as a significant advancement in the TTS domain, with potential for impactful applications in various audio synthesis tasks.
The methodology proposed in ARCHI-TTS is innovative, combining a semantic aligner with a flow-matching decoder to address the challenges of text-speech alignment and inference efficiency in TTS systems. The use of a low-token-rate representation derived from a Variational Autoencoder (VAE) is a significant advancement, allowing for a more compact representation of audio data while maintaining quality. The architecture's reliance on a transformer-based semantic aligner to create self-supervised text-aligned semantic representations is a novel approach that enhances the model's ability to generate coherent and contextually relevant speech. The integration of an auxiliary CTC loss to bolster semantic understanding further demonstrates a thoughtful approach to improving the model's performance.
The experimental evaluation is robust, utilizing a large-scale multilingual dataset (100k hours) for training and multiple established benchmarks for testing. The reported results, including a WER of 1.98% on the LibriSpeech-PC test-clean and competitive performance on the SeedTTS test set, indicate that ARCHI-TTS outperforms several state-of-the-art models while using fewer computational resources. The inclusion of ablation studies adds depth to the evaluation, providing insights into the contributions of various architectural components. However, the paper could benefit from more extensive subjective evaluations to further validate the quality of the generated speech.
The paper provides sufficient details regarding the model configuration, training process, and evaluation metrics, which should facilitate reproducibility. The authors mention the use of specific hardware (8 RTX 5090 GPUs) and training duration, which are valuable for replicating the experiments. However, the lack of a direct link to the code repository limits accessibility for other researchers wishing to reproduce the results.
While the proposed model shows promising results, it does exhibit some limitations, such as slightly lagging behind other state-of-the-art models in subjective quality evaluations. The reliance on a specific dataset (Emilia) may also limit the generalizability of the findings. Additionally, the computational efficiency improvements come at the cost of some performance degradation, which may need further exploration.
The advancements presented in ARCHI-TTS have significant implications for the field of TTS and audio synthesis, particularly in enhancing the efficiency and quality of speech generation. The model's ability to perform zero-shot synthesis with high fidelity could lead to broader applications in voice cloning, audiobooks, and interactive voice response systems. As TTS technology continues to evolve, the methodologies introduced in this paper could influence future research directions and commercial applications. The main contribution of this paper is the introduction of ARCHI-TTS, a novel non-autoregressive text-to-speech model that effectively addresses the challenges of text-speech alignment and computational efficiency through innovative architectural components. The comprehensive analysis of its technical contributions, methodology, and results positions it as a significant advancement in the TTS domain, with potential for impactful applications in various audio synthesis tasks.
Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1) via semantic token assignment (STA). To further eliminate reliance on SSL-based semantic tokenizers and improve efficiency during inference, we propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.
Primary: University of California
All Institutions: University of California
The main contribution of this paper is the introduction of STACodec, a novel audio codec that integrates semantic information through a unique token assignment mechanism, achieving a balance between acoustic fidelity and semantic capability. This work significantly advances the state-of-the-art in audio codecs by addressing the limitations of existing hybrid models and providing a clear pathway for future research in multimodal audio processing.
The methodology presented in STACodec is innovative, integrating semantic token assignment (STA) into the first layer of residual vector quantization (RVQ-1) to enhance both acoustic fidelity and semantic information in audio codecs. The introduction of the Semantic Pre-Distillation (SPD) module is particularly noteworthy, as it reduces reliance on SSL-based tokenizers and improves inference efficiency. The methodology is well-structured, with clear explanations of the architecture and training objectives, although some equations and references to figures are incomplete in the provided text.
The experimental evaluation is robust, utilizing a comprehensive dataset (LibriSpeech) and employing multiple metrics (PESQ, STOI, ViSQOL) for audio reconstruction quality, as well as downstream tasks like ASR and intent classification. The results demonstrate that STACodec outperforms existing hybrid codecs, indicating effective integration of semantic information without significant degradation of audio quality. However, the paper could benefit from more detailed statistical analysis of results and comparisons with additional baseline methods.
The paper provides a reasonable level of detail regarding the training configurations, model architectures, and evaluation metrics, which supports reproducibility. The availability of the code on GitHub further enhances the potential for other researchers to replicate the findings. However, the absence of specific hyperparameter settings and training procedures might hinder complete reproducibility.
One limitation is the reliance on the LibriSpeech dataset, which may not fully represent the diversity of real-world audio scenarios. Additionally, while the SPD module improves efficiency, it may introduce trade-offs in reconstruction fidelity, which the authors acknowledge but do not explore in depth. The paper could also address potential scalability issues when applying STACodec to larger or more complex datasets.
The proposed STACodec has significant implications for the fields of audio processing and machine learning, particularly in applications involving speech recognition and multimodal language models. By effectively balancing acoustic fidelity and semantic information, STACodec could enhance the performance of various audio-related tasks, making it a valuable contribution to the development of more sophisticated audio codecs. The main contribution of this paper is the introduction of STACodec, a novel audio codec that integrates semantic information through a unique token assignment mechanism, achieving a balance between acoustic fidelity and semantic capability. This work significantly advances the state-of-the-art in audio codecs by addressing the limitations of existing hybrid models and providing a clear pathway for future research in multimodal audio processing.
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
Primary: Zhejiang University
All Institutions: Zhejiang University
The main contribution of this paper is the introduction of HyperPotter, a hypergraph-based framework for audio deepfake detection that effectively captures high-order interactions, demonstrating substantial improvements over existing methods. This work represents a meaningful advancement in the field of audio deepfake detection, with the potential to influence future research directions and applications.
The proposed HyperPotter framework introduces a novel approach to audio deepfake detection by leveraging hypergraphs to model high-order interactions (HOIs). This is a significant departure from traditional methods that focus primarily on local features or pairwise relations. The use of clustering-based hyperedges with class-aware prototype initialization is innovative and suggests a deeper understanding of the relationships between features. However, the paper could benefit from a more detailed explanation of the hypergraph construction process and the specific clustering techniques employed.
The experiments are extensive, covering 11 datasets and demonstrating a relative gain of 22.15% over baseline methods, as well as a 13.96% improvement over state-of-the-art methods on challenging cross-domain datasets. This breadth of evaluation is commendable and indicates robust performance across various scenarios. However, the paper lacks a detailed comparison with other recent methodologies in the field, which could provide further context for the results.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Clear guidelines on how to replicate the experiments, including hyperparameter settings and dataset access, would enhance the paper's impact.
One limitation is the potential complexity of the hypergraph model, which may require significant computational resources and expertise to implement. Additionally, while the results are promising, the paper does not address the scalability of the approach or its performance in real-time applications.
The implications of this research are significant, particularly in the context of increasing audio deepfake threats. The ability to detect sophisticated audio manipulations could enhance security in various applications, including media verification, cybersecurity, and content authenticity. The methodology could also inspire further research into high-order interactions in other domains beyond audio. The main contribution of this paper is the introduction of HyperPotter, a hypergraph-based framework for audio deepfake detection that effectively captures high-order interactions, demonstrating substantial improvements over existing methods. This work represents a meaningful advancement in the field of audio deepfake detection, with the potential to influence future research directions and applications.
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
Primary: Nankai University
All Institutions: Nankai University, Alibaba International Digital Commerce, University of Exeter
Speech-XL presents a significant advancement in long-form speech understanding through its innovative use of Speech Summarization Tokens and curriculum learning strategies. This work not only addresses critical limitations in existing models but also sets the stage for future developments in efficient audio processing methodologies.
The methodology presented in Speech-XL is innovative, particularly with the introduction of the Speech Summarization Token (SST) for compressing long-form audio data. The model effectively addresses the limitations of existing Large Speech Language Models (LSLMs) by leveraging a curriculum learning approach to progressively train the SST for varying compression ratios. This structured training strategy enhances the model's ability to maintain semantic integrity while reducing memory usage. The dual-adapter bridge architecture is also a notable contribution, allowing for effective integration of acoustic and semantic features into the LLM's framework.
The experimental setup is robust, utilizing significant datasets like LongSpeech and AudioMarathon to evaluate the model's performance across various tasks. The results indicate that Speech-XL outperforms existing models in several benchmarks, demonstrating its effectiveness in long-form audio understanding. The comparative analysis with upper-bound models and other state-of-the-art systems provides a clear picture of its capabilities, although the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper outlines a clear training process and provides details on the datasets used, model architecture, and training parameters. However, the absence of a publicly accessible code repository or demo limits reproducibility. Future work should consider releasing the model and training scripts to enhance transparency and allow for independent verification of results.
One limitation is the reliance on a relatively small training dataset for certain tasks, which may affect the generalizability of the model across diverse audio contexts. Additionally, the model's performance in out-of-domain evaluations suggests that it may struggle with audio types not represented in the training data. The authors acknowledge the need for broader training data to fully leverage the SST mechanism's potential.
The advancements in long-form speech understanding have significant implications for various applications, including transcription services, virtual assistants, and accessibility technologies. By improving the efficiency and accuracy of processing long audio sequences, Speech-XL could enhance user experiences in these domains. The work also opens avenues for future research into more sophisticated audio processing techniques that could benefit from the SST framework. Speech-XL presents a significant advancement in long-form speech understanding through its innovative use of Speech Summarization Tokens and curriculum learning strategies. This work not only addresses critical limitations in existing models but also sets the stage for future developments in efficient audio processing methodologies.
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre-trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A-ControlNet and T2A-Adapter, and show that the T2A-Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores. We further extend this framework to audio editing, proposing T2A-Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
The paper introduces the Audio ControlNet framework, which innovatively builds on pre-trained text-to-audio (T2A) models by integrating lightweight auxiliary networks for fine-grained control over audio attributes such as loudness, pitch, and sound events. The two proposed architectures, T2A-ControlNet and T2A-Adapter, are well-structured, with T2A-Adapter demonstrating efficiency through fewer parameters while maintaining high performance. The methodology is sound, leveraging established techniques from the ControlNet paradigm and adapting them to the audio domain, thus showcasing a thoughtful approach to enhancing existing models without extensive retraining.
The experiments are comprehensive, utilizing the AudioSet-Strong dataset for both training and evaluation, which is appropriate given the task. The results indicate that T2A-Adapter achieves state-of-the-art performance in sound event detection metrics, outperforming existing models while using significantly fewer parameters. The paper includes both objective metrics (F1 scores) and subjective evaluations (MOS), providing a well-rounded assessment of model performance. However, the paper could benefit from more detailed comparisons with a broader range of baseline models to further validate its claims.
The authors mention plans to release models, code, dataset pipelines, and benchmarks, which is a positive step towards reproducibility. However, specific implementation details, such as hyperparameter settings and training configurations, could be more explicitly stated to enhance clarity and facilitate replication by other researchers.
The paper acknowledges limitations, such as the computational constraints that prevented exhaustive hyperparameter searches and the focus on a limited set of control conditions. Additionally, the reliance on generalization for multi-condition control at inference time may not be robust across all scenarios. Future work is suggested to explore richer control signals and more comprehensive multi-condition training.
The framework has significant potential applications in sound design, music creation, and video production, where precise audio generation and editing are crucial. The ability to manipulate audio attributes with fine granularity can enhance creative workflows and enable new forms of audio content generation. However, ethical considerations regarding the misuse of generated audio, such as impersonation or disinformation, must be addressed to ensure responsible deployment. The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Primary: Huawei Noah's Ark Lab
All Institutions: Huawei Noah's Ark Lab
This paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representations and their alignment with human cognitive processes. The innovative methodology and rigorous experimental evaluation contribute valuable insights to the field of machine learning in audio processing.
The paper employs Sparse Autoencoders (SAEs) to analyze the activations of Whisper and HuBERT models, providing a systematic approach to feature extraction and interpretability in audio processing. The methodology includes a comprehensive evaluation of feature stability, interpretability, and practical applications, which is a significant advancement in the field. The use of various metrics for validation and the introduction of novel techniques for feature steering and EEG correlation analysis enhance the robustness of the methodology.
The experiments are well-structured, utilizing a diverse corpus of audio data for training and evaluation. The authors demonstrate the effectiveness of SAEs in capturing semantic and paralinguistic information, with results showing a substantial reduction in false detections when steering Whisper's features. The correlation with EEG activity adds a neuroscientific dimension to the findings, indicating a deeper understanding of audio processing in relation to human cognition.
The paper provides detailed implementation information, including model architectures, training setups, and hyperparameters, which supports reproducibility. The availability of code and checkpoints on GitHub further enhances the potential for other researchers to replicate the study and build upon its findings.
The paper acknowledges limitations in its scope, including a focus on specific classification tasks and the exclusion of larger model variants due to computational constraints. Additionally, the auto-interpretation method's reliance on a captioning model trained primarily on music and sound data may lead to generic interpretations of speech-related features.
The findings have significant implications for audio processing applications, particularly in improving speech recognition systems and understanding human auditory processing. The techniques developed could be applied to various domains, including speech enhancement, emotion recognition, and environmental sound classification, potentially leading to advancements in human-computer interaction and accessibility technologies. This paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representations and their alignment with human cognitive processes. The innovative methodology and rigorous experimental evaluation contribute valuable insights to the field of machine learning in audio processing.
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Primary: Huawei Noah's Ark Lab
All Institutions: Huawei Noah's Ark Lab
The paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representation learning. The methodology and results contribute to the field by enhancing model interpretability and practical utility, particularly in speech recognition tasks.
The paper employs Sparse Autoencoders (SAEs) to analyze the activations of Whisper and HuBERT models, a novel approach in the audio domain. The methodology is comprehensive, involving a multi-faceted evaluation of feature stability, interpretability, and practical utility. The use of distributional similarity metrics and the introduction of a novel steering technique to mitigate hallucinations in speech models are particularly noteworthy. The SAE architecture and training setup are well-detailed, ensuring clarity in the implementation.
The experiments are robust, utilizing a diverse dataset of approximately 2.8k hours of audio, which includes speech, music, and environmental sounds. The evaluation of SAE features across various tasks, including gender identification and emotion recognition, demonstrates the practical applicability of the proposed methods. The results indicate significant improvements in model performance and interpretability, particularly in reducing false positives in speech detection.
The paper provides sufficient details on the training setup, hyperparameters, and evaluation metrics, which enhances reproducibility. The availability of code and checkpoints on GitHub further supports this aspect, allowing other researchers to replicate the experiments and build upon the findings.
The paper acknowledges limitations in the scope of classification tasks evaluated and the focus on base model variants, suggesting that broader applications and larger architectures were not explored due to computational constraints. Additionally, the auto-interpretation method's reliance on a music-trained captioning model limits its effectiveness in capturing fine-grained speech features.
The findings have significant implications for audio processing, particularly in enhancing the interpretability of neural models and improving real-world applications like speech recognition. The correlation of SAE features with human EEG activity opens avenues for interdisciplinary research, linking machine learning with neuroscience. The paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representation learning. The methodology and results contribute to the field by enhancing model interpretability and practical utility, particularly in speech recognition tasks.
Transformer-based models have shown strong performance in speech deepfake detection, largely due to the effectiveness of the multi-head self-attention (MHSA) mechanism. MHSA provides frame-level attention scores, which are particularly valuable because deepfake artifacts often occur in small, localized regions along the temporal dimension of speech. This makes fine-grained frame modeling essential for accurately detecting subtle spoofing cues. In this work, we propose fine-grained frame modeling (FGFM) for MHSA-based speech deepfake detection, where the most informative frames are first selected through a multi-head voting (MHV) module. These selected frames are then refined via a cross-layer refinement (CLR) module to enhance the model's ability to learn subtle spoofing cues. Experimental results demonstrate that our method outperforms the baseline model and achieves Equal Error Rate (EER) of 0.90%, 1.88%, and 6.64% on the LA21, DF21, and ITW datasets, respectively. These consistent improvements across multiple benchmarks highlight the effectiveness of our fine-grained modeling for robust speech deepfake detection.
Primary: Hanoi University of Science and Technology
All Institutions: Hanoi University of Science and Technology, Nanyang Technological University
The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
The proposed methodology introduces a novel fine-grained frame modeling (FGFM) approach that effectively enhances the multi-head self-attention (MHSA) mechanism for speech deepfake detection. The integration of the multi-head voting (MHV) module to select salient frames and the cross-layer refinement (CLR) module to aggregate information across layers is innovative. This dual approach addresses the limitations of conventional MHSA by focusing on localized artifacts, which are critical for detecting subtle spoofing cues. The methodology is well-structured and builds upon existing transformer architectures, demonstrating a clear understanding of the challenges in deepfake detection.
The experimental evaluation is robust, utilizing multiple datasets (ASVspoof 2021 LA, DF, and ITW) to validate the effectiveness of the proposed method. The reported Equal Error Rates (EER) indicate significant improvements over baseline models, showcasing the method's effectiveness across diverse conditions. The inclusion of ablation studies further strengthens the evaluation, providing insights into the contributions of individual components of the proposed framework.
The paper provides sufficient detail regarding the experimental setup, including model configurations and training procedures, which supports reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings. Future work should consider making the implementation accessible to enhance reproducibility.
While the proposed method shows promising results, it may still be sensitive to variations in the quality of the input audio, such as background noise or recording conditions. Additionally, the reliance on specific datasets may limit the generalizability of the findings to real-world applications. The paper could benefit from a discussion on how the model performs under such conditions.
The implications of this research are significant, particularly in the context of biometric security and misinformation. As deepfake technology becomes more sophisticated, effective detection methods are crucial for safeguarding against potential abuses in various sectors, including finance and communication. The proposed FGFM approach could contribute to the development of more reliable detection systems, thereby enhancing trust in voice-based interactions. The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.
Primary: Technion--Israel Institute of Technology
All Institutions: Technion--Israel Institute of Technology, Carnegie Mellon University
The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
The methodology presented in this paper is robust and innovative, focusing on the generation of unsafe spoken dialogues and the evaluation of large audio-language models (LALMs) as safety judges. The controlled generation of unsafe dialogue variants, along with the systematic benchmarking of LALMs across different modalities, is a significant contribution to the field. The use of human raters to validate the generated unsafe dialogues and the severity scale adds credibility to the findings. The paper also effectively addresses the challenges of audio-specific cues and transcription errors, which are often overlooked in text-centric assessments.
The experimental evaluation is thorough, with a well-defined dataset of 24,000 dialogues and a clear methodology for assessing the performance of the LALMs. The results reveal important trade-offs between sensitivity, specificity, and stability across different models and modalities. The use of various prompting strategies to optimize performance further demonstrates a comprehensive approach to evaluating the models. However, the paper could benefit from more detailed statistical analysis and comparisons with existing benchmarks in the field.
The paper mentions plans to release the dataset and code, which is crucial for reproducibility. However, specific implementation details, such as the exact configurations used for the LALMs and the human raters' instructions, should be more explicitly stated to facilitate replication of the study. The inclusion of supplementary materials or appendices would enhance reproducibility.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world dialogues. Additionally, the potential for bias in the generated unsafe dialogues and the subjective nature of human ratings could impact the validity of the findings. The paper also acknowledges the risk of misuse of the benchmark data, which is an important ethical consideration.
The findings of this research have significant implications for the development of safer spoken dialogue systems and voice agents. By providing a systematic approach to evaluating harmful content in multi-turn dialogues, the work aims to improve the safety and reliability of voice interfaces. However, the potential for misuse of the generated data and the reliance on automated judges without human oversight could lead to unintended consequences in real-world applications. The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Independent Researcher
The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
The paper proposes a novel audio tokenizer, ReasoningCodec, which effectively separates audio representations into reasoning and reconstruction tokens. This dual-token approach allows for higher-level abstractions while maintaining fidelity in audio reconstruction. The architecture's functional layer specialization is a significant methodological advancement, optimizing the processing of audio and text tokens across different transformer layers, which is a departure from the traditional uniform approach. The introduction of auditory sentences as a means to unify task construction is innovative and enhances the model's ability to handle complex audio tasks.
The authors conducted extensive experiments across various speech, sound, and music tasks, demonstrating competitive performance on in-domain evaluations. The model's ability to generalize to unseen tasks in few-shot and zero-shot settings is particularly noteworthy, showcasing its robustness and versatility. However, the paper could benefit from more detailed quantitative results and comparisons with state-of-the-art models to better contextualize its performance.
The authors commit to providing demo, code, and checkpoints, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics and hyperparameter settings that would facilitate full reproducibility by other researchers.
The paper acknowledges potential risks associated with misuse of the technology, such as impersonation and copyright issues. However, it does not delve deeply into the technical limitations of the model itself, such as potential biases in the training data or the scalability of the approach to more complex audio tasks.
The proposed model has significant implications for applications in creative assistance, human-computer interaction, and audio generation. However, the authors rightly caution against potential misuse, emphasizing the need for responsible deployment practices to mitigate risks associated with audio generation technologies. The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Independent Researcher
The main contribution of this work is the development of UniAudio 2.0, a unified audio language model that effectively integrates understanding and generation tasks through innovative tokenization and architecture strategies. This paper represents a meaningful advancement in the field of audio language models, addressing key challenges and providing a robust framework for future research and applications.
The paper introduces a novel audio tokenizer, ReasoningCodec, which effectively separates audio into reasoning and reconstruction tokens, addressing the limitations of existing discrete tokenizers. The proposed unified autoregressive architecture with functional layer specialization enhances the model's ability to process both audio and text, allowing for improved understanding and generation. The introduction of auditory sentences as a method for multi-task training is particularly innovative, as it facilitates the integration of diverse audio tasks without the need for extensive manual task design.
The authors report extensive experiments on a large dataset comprising 100B text tokens and 60B audio tokens, demonstrating competitive performance on various tasks. The few-shot and zero-shot generalization capabilities are particularly noteworthy, indicating the model's robustness and versatility across different audio-related tasks. However, specific metrics and comparisons with baseline models could be more thoroughly detailed to strengthen the claims of performance.
The paper mentions that demo, code, and checkpoints will be made available, which is a positive aspect for reproducibility. However, the absence of a detailed description of the experimental setup, hyperparameters, and model training procedures limits the ease with which others can replicate the results.
The paper acknowledges potential risks associated with audio generation, such as misuse and copyright issues, but it could benefit from a more in-depth discussion of the limitations of the proposed model itself, including any biases in the training data or challenges in the generalization to highly diverse audio tasks.
The implications of this research are significant, as it opens avenues for advanced applications in creative assistance, human-computer interaction, and audio content generation. However, the authors rightly highlight the ethical considerations and potential for misuse, which need to be addressed as the technology develops. The main contribution of this work is the development of UniAudio 2.0, a unified audio language model that effectively integrates understanding and generation tasks through innovative tokenization and architecture strategies. This paper represents a meaningful advancement in the field of audio language models, addressing key challenges and providing a robust framework for future research and applications.
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
Primary: National Taiwan University
All Institutions: National Taiwan University
The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
The proposed URSA-GAN framework presents a novel approach to address the challenges of domain adaptation in ASR and SE by leveraging a dual-embedding architecture that captures noise and channel characteristics. This method is innovative in its use of generative adversarial networks (GANs) combined with dynamic stochastic perturbation for enhanced robustness. The architecture is well-structured, with a clear delineation of roles for the noise encoder, channel encoder, and generator, which collectively facilitate effective domain adaptation. The introduction of instance-level embeddings and the use of feature-wise linear modulation (FiLM) for conditioning the generator on noise and channel characteristics are particularly noteworthy. However, the complexity of the model may pose challenges in practical applications.
The experiments conducted are extensive and cover a variety of datasets and scenarios, demonstrating the effectiveness of URSA-GAN in improving ASR and SE performance under mismatched conditions. The results show significant improvements in character error rates and perceptual metrics, validating the framework's robustness. The evaluation metrics used are appropriate, and the comparative analysis against baseline models and previous works strengthens the claims made by the authors. However, the paper could benefit from more detailed ablation studies to further clarify the contributions of individual components.
The paper provides a comprehensive description of the methodology, including the architecture, training process, and evaluation metrics, which facilitates reproducibility. However, the lack of a publicly available code repository or demo limits the ability of other researchers to replicate the experiments fully. Clearer documentation of hyperparameters and training configurations would enhance reproducibility.
One limitation is the reliance on pre-trained models for the noise and channel encoders, which may not generalize well to all domains. Additionally, the model's complexity could hinder its deployment in real-time applications, especially on resource-constrained devices. The performance gap between URSA-GAN and models trained on labeled target-domain data suggests that while the framework is effective, it may still require some labeled data for optimal performance.
The proposed framework has significant implications for real-world applications of ASR and SE, particularly in environments with varying noise and channel conditions. By improving the robustness of these systems, URSA-GAN could enhance user experiences in various domains, including telecommunications, voice assistants, and hearing aids. The approach also opens avenues for further research in domain adaptation techniques across different audio processing tasks. The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.
The proposed methodology of PFluxTTS is innovative, combining a dual-decoder architecture that integrates both duration-guided and alignment-free models through inference-time vector-field fusion. This hybrid approach effectively addresses the stability-naturalness trade-off prevalent in existing TTS systems. The use of FLUX-based speech-prompt embeddings for robust cross-lingual voice cloning is a significant advancement, allowing the model to maintain speaker identity across languages without relying on prompt transcripts. Additionally, the integration of a modified PeriodWave vocoder with super-resolution capabilities to synthesize high-quality audio at 48 kHz from low-rate mel features is a noteworthy enhancement.
The experimental evaluation is comprehensive, utilizing a variety of datasets that reflect real-world challenges in TTS, particularly in cross-lingual scenarios. The authors provide both subjective and objective metrics to assess performance, demonstrating that PFluxTTS outperforms several state-of-the-art systems in terms of naturalness and speaker similarity. The use of statistical significance tests to validate the results adds rigor to the findings. However, the reliance on a limited number of baselines may restrict the generalizability of the conclusions.
The paper includes detailed descriptions of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the results fully. The authors could improve reproducibility by providing access to their training data and model checkpoints.
One limitation of the study is the potential overfitting to the specific datasets used for training and evaluation, which may not represent the full diversity of real-world speech. Additionally, while the system shows robustness in challenging conditions, the performance on extremely noisy or low-quality inputs is not thoroughly explored. The authors also note that the model's performance may vary with different languages, which could limit its applicability in multilingual contexts.
The advancements presented in PFluxTTS have significant implications for applications in AI dubbing, virtual assistants, and accessibility technologies. By improving cross-lingual voice cloning and audio quality, the system can enhance user experience in multilingual environments, making technology more inclusive. Furthermore, the research contributes to the ongoing development of high-fidelity TTS systems, which can benefit various industries, including entertainment, education, and customer service. The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.
Audio is a fundamental modality for analyzing speech, music, and environmental sounds. Although pretrained audio models have significantly advanced audio understanding, they remain fragile in real-world settings where data distributions shift over time. In this work, we present the first systematic benchmark for audio continual learning (CL) with pretrained models (PTMs), together with a comprehensive analysis of its unique challenges. Unlike in vision, where parameter-efficient fine-tuning (PEFT) has proven effective for CL, directly transferring such strategies to audio leads to poor performance. This stems from a fundamental property of audio backbones: they focus on low-level spectral details rather than structured semantics, causing severe upstream-downstream misalignment. Through extensive empirical study, we identify analytic classifiers with first-session adaptation (FSA) as a promising direction, but also reveal two major limitations: representation saturation in coarse-grained scenarios and representation drift in fine-grained scenarios. To address these challenges, we propose PACE, a novel method that enhances FSA via a regularized analytic classifier and enables multi-session adaptation through adaptive subspace-orthogonal PEFT for improved semantic alignment. In addition, we introduce spectrogram-based boundary-aware perturbations to mitigate representation overlap and improve stability. Experiments on six diverse audio CL benchmarks demonstrate that PACE substantially outperforms state-of-the-art baselines, marking an important step toward robust and scalable audio continual learning with PTMs.
Primary: Tsinghua University
All Institutions: Tsinghua University
The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
The methodology presented in this paper is robust and innovative, addressing the unique challenges of continual learning (CL) in audio contexts, particularly the upstream-downstream misalignment that has hindered previous approaches. The introduction of PACE, which combines improved first-session adaptation (FSA) with multi-session adaptation (MSA) and boundary-aware regularization, is a significant advancement. The paper meticulously details the design choices behind each component, demonstrating a clear understanding of the audio domain's intricacies. The use of analytic classifiers and adaptive subspace-orthogonal PEFT is particularly noteworthy, as it showcases a tailored approach to audio CL that diverges from traditional vision-based methods.
The experimental evaluation is thorough, employing six diverse audio CL benchmarks that effectively highlight the strengths and weaknesses of the proposed method. The results consistently demonstrate that PACE outperforms state-of-the-art methods, providing strong empirical evidence for its effectiveness. The ablation studies further reinforce the validity of the proposed components, illustrating how each contributes to the overall performance. However, the paper could benefit from additional comparisons with more recent methods in the audio domain, if available.
The authors commit to releasing their code and benchmarks, which is a positive aspect for reproducibility. The detailed descriptions of the experimental setup, including hyperparameters and dataset configurations, enhance the likelihood that other researchers can replicate the results. However, the absence of a demo or interactive component limits immediate accessibility for broader audiences.
One limitation is the potential for overfitting in fine-grained tasks, as indicated by the authors. The paper also acknowledges that while PACE narrows the gap to joint training, it does not completely eliminate it, suggesting that further improvements could be made. Additionally, the reliance on specific pretrained models may limit the generalizability of the findings across different audio tasks.
The implications of this work are significant, particularly for applications in speech recognition, audio event detection, and environmental sound understanding. By addressing the challenges of continual learning in audio, the proposed methods could enhance the robustness and adaptability of audio models in real-world scenarios, leading to more effective and reliable systems. The main contribution of this paper is the introduction of PACE, a novel framework for pretrained audio continual learning that effectively addresses the unique challenges posed by audio data distributions. This work significantly advances the field by providing a systematic approach to continual learning in audio, demonstrating state-of-the-art performance across multiple benchmarks while offering a foundation for future research in this area.
We propose a data-driven sparse recovery framework for hybrid spherical linear microphone arrays using singular value decomposition (SVD) of the transfer operator. The SVD yields orthogonal microphone and field modes, reducing to spherical harmonics (SH) in the SMA-only case, while incorporating LMAs introduces complementary modes beyond SH. Modal analysis reveals consistent divergence from SH across frequency, confirming the improved spatial selectivity. Experiments in reverberant conditions show reduced energy-map mismatch and angular error across frequency, distance, and source count, outperforming SMA-only and direct concatenation. The results demonstrate that SVD-modal processing provides a principled and unified treatment of hybrid arrays for robust sparse sound-field reconstruction.
Primary: The University of Sydney
All Institutions: The University of Sydney
The main contribution of this paper is the introduction of a unified SVD-modal framework for sparse sound field reconstruction using hybrid microphone arrays, which significantly improves spatial selectivity and robustness in reverberant environments. This work provides a principled approach that advances the state of the art in audio processing and sound field analysis, addressing key limitations of existing methods.
The proposed methodology leverages singular value decomposition (SVD) to derive a unified modal solution for sound field reconstruction using hybrid spherical-linear microphone arrays. This approach is innovative as it generalizes existing spherical harmonic (SH) processing while introducing complementary modes from linear microphone arrays (LMAs). The paper effectively integrates theoretical foundations with practical applications, demonstrating a clear understanding of the challenges posed by reverberant environments and the limitations of previous methods. The modal analysis and the use of a well-conditioned dictionary for sparse recovery are particularly noteworthy, as they provide a robust framework for addressing the underdetermined nature of the problem.
The experimental evaluation is comprehensive, utilizing simulations in reverberant conditions to assess the performance of the proposed method against baseline techniques such as SMA-only and residue refinement. The metrics employed, including energy map mismatch and angular error, are appropriate for the task and provide a clear indication of the method's effectiveness. The results consistently demonstrate the advantages of the SVD-modal framework, particularly in terms of spatial accuracy and robustness under varying conditions, which strengthens the paper's claims.
The paper lacks specific implementation details that would facilitate reproducibility, such as access to the datasets used for training and testing, or the code for the proposed algorithm. While the methodology is well described, the absence of a project URL or demo limits the ability of other researchers to replicate the findings. Clearer documentation and sharing of resources would enhance reproducibility.
One limitation of the study is the reliance on simulated environments, which may not fully capture the complexities of real-world acoustic conditions. Additionally, the trade-off between energy-map fidelity and localization accuracy when varying the number of modes could be further explored. The paper suggests future work on optimal mode selection, indicating that the current approach may not be universally applicable across all scenarios.
The proposed framework has significant implications for audio processing applications, particularly in environments where accurate sound field reconstruction is critical, such as in virtual reality, augmented reality, and advanced audio capture technologies. By improving the spatial resolution and robustness of sound field reconstruction, this work could enhance user experiences in immersive audio applications and contribute to advancements in spatial audio technologies. The main contribution of this paper is the introduction of a unified SVD-modal framework for sparse sound field reconstruction using hybrid microphone arrays, which significantly improves spatial selectivity and robustness in reverberant environments. This work provides a principled approach that advances the state of the art in audio processing and sound field analysis, addressing key limitations of existing methods.
Emotional expression in human speech is nuanced and compositional, often involving multiple, sometimes conflicting, affective cues that may diverge from linguistic content. In contrast, most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression. While activation steering via latent direction vectors offers a promising solution, it remains unclear whether emotion representations are linearly steerable in TTS, where steering should be applied within hybrid TTS architectures, and how such complex emotion behaviors should be evaluated. This paper presents the first systematic analysis of activation steering for emotional control in hybrid TTS models, introducing a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis. Our results demonstrate, for the first time, that emotional prosody and expressive variability are primarily synthesized by the TTS language module instead of the flow-matching module, and also provide a lightweight steering approach for generating natural, human-like emotional speech.
Primary: The University of Melbourne
All Institutions: The University of Melbourne, Wuhan University, The Hong Kong University of Science and Technology (Guangzhou), The University of Auckland
The paper presents a pioneering approach to emotional TTS through activation steering, significantly advancing the field by enabling composable emotional expression and challenging existing paradigms in TTS architecture. The methodology is innovative, and while the experimental results are promising, further validation and implementation details would strengthen the contributions to the field.
The paper introduces a novel framework for emotional TTS that leverages activation steering via latent direction vectors. This approach is significant as it allows for composable and controllable emotional expression, addressing the limitations of existing TTS systems that typically enforce a single emotion per utterance. The methodology is well-structured, systematically analyzing the linear steerability of emotion representations and proposing a quantitative steering framework. The introduction of multi-rater evaluation protocols is particularly noteworthy, as it enhances the assessment of emotional synthesis quality.
The experiments conducted are robust, demonstrating the effectiveness of the proposed method in generating mixed-emotion synthesis and addressing text-emotion mismatches. The results indicate that emotional prosody is primarily synthesized by the TTS language module, which is a significant finding that challenges previous assumptions about TTS architecture. However, the paper could benefit from more extensive datasets and comparisons with state-of-the-art systems to further validate the claims.
The paper lacks detailed implementation information that would facilitate reproducibility. While the methodology is described, the absence of specific parameters, datasets, and code availability limits the ability of other researchers to replicate the results. Including a supplementary material section with these details would enhance the paper's reproducibility.
One limitation of the study is the potential overfitting to the datasets used for training and evaluation, which may not generalize well to all types of emotional speech. Additionally, the paper does not thoroughly address the computational efficiency of the proposed method, which is crucial for real-time applications.
The implications of this research are significant for various applications, including virtual assistants, gaming, and mental health support systems, where nuanced emotional expression can enhance user experience. The ability to generate human-like emotional speech can lead to more engaging and relatable interactions in AI systems. The paper presents a pioneering approach to emotional TTS through activation steering, significantly advancing the field by enabling composable emotional expression and challenging existing paradigms in TTS architecture. The methodology is innovative, and while the experimental results are promising, further validation and implementation details would strengthen the contributions to the field.
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Primary: Meta
All Institutions: Meta, Institut Polytechnique de Paris
The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative approaches. This work significantly advances the field by providing a more coherent and integrated method for audio-visual alignment, with promising applications across multiple domains.
The proposed Conditional Flow Matching (CFM) framework represents a significant methodological advancement by reframing visually-guided acoustic highlighting as a generative problem rather than a discriminative one. This shift allows for a more nuanced approach to audio remixing, addressing the inherent ambiguities present in the task. The introduction of a rollout loss to mitigate prediction errors during iterative flow-based generation is a clever solution to the problem of trajectory drift, enhancing the stability of the model. The conditioning module that integrates audio and visual cues is also a noteworthy innovation that enables more effective cross-modal source selection.
The paper provides extensive quantitative and qualitative evaluations, demonstrating that the CFM framework consistently outperforms existing state-of-the-art methods. The experimental design appears robust, utilizing a variety of datasets to validate the effectiveness of the proposed approach. However, specific details regarding the datasets used and the metrics for evaluation could be elaborated upon to strengthen the findings.
The paper lacks detailed implementation specifics that would facilitate reproducibility. While the methodology is described, there are no links to code repositories or supplementary materials that would allow other researchers to replicate the experiments. Providing such resources would significantly enhance the paper's impact and utility in the research community.
One limitation is the potential for the model to overfit to the training data, especially given the complexity of the generative task. Additionally, the paper does not address the computational efficiency of the proposed method, which could be a concern for real-time applications. The reliance on visual cues may also limit the model's applicability in scenarios where visual information is not available or is of low quality.
The implications of this research are substantial, particularly in fields such as multimedia content creation, virtual reality, and assistive technologies for the hearing impaired. By improving the alignment of audio and visual elements, the proposed framework could enhance user experiences in various applications, making it a valuable contribution to the intersection of audio processing and machine learning. The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative approaches. This work significantly advances the field by providing a more coherent and integrated method for audio-visual alignment, with promising applications across multiple domains.
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
Primary: Institut Polytechnique de Paris
All Institutions: Meta, Institut Polytechnique de Paris
The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative models. The innovative methodology, combined with promising experimental results, positions this work as a significant advancement in the intersection of audio and visual machine learning.
The proposed Conditional Flow Matching (CFM) framework represents a significant methodological shift from traditional discriminative models to a generative approach for visually-guided acoustic highlighting. The introduction of a rollout loss to mitigate error propagation in iterative flow-based generation is an innovative solution to a common problem in generative modeling. Additionally, the conditioning module that integrates audio and visual cues before vector field regression is a thoughtful enhancement that allows for explicit cross-modal source selection, which is crucial for the task at hand.
The authors conducted extensive quantitative and qualitative evaluations, demonstrating that their method consistently outperforms the previous state-of-the-art discriminative approach. However, the paper would benefit from a more detailed description of the datasets used, including their size, diversity, and relevance to the task. The evaluation metrics employed should also be clearly defined to allow for reproducibility and comparison with future work.
The paper lacks sufficient implementation details that would allow other researchers to reproduce the results. While the methodology is described, specifics regarding hyperparameters, training procedures, and the computational resources used are not provided. Including a supplementary material section with this information or a link to a code repository would significantly enhance reproducibility.
One limitation of the proposed method is its reliance on the quality of the visual input, which may not always be reliable in real-world scenarios. Additionally, the complexity of the model may lead to longer inference times, which could be a drawback for real-time applications. The authors should also address potential overfitting issues, especially given the generative nature of the approach.
The implications of this research extend beyond audio-visual alignment, potentially influencing fields such as multimedia content creation, augmented reality, and assistive technologies for the hearing impaired. By improving the coherence between audio and visual stimuli, this work could enhance user experiences in various applications, making it a valuable contribution to the field. The main contribution of this paper is the introduction of a novel generative framework for visually-guided acoustic highlighting, which effectively addresses the limitations of existing discriminative models. The innovative methodology, combined with promising experimental results, positions this work as a significant advancement in the intersection of audio and visual machine learning.
Respiratory rate (RR) is a key vital sign for clinical assessment and mental well-being, yet it is rarely monitored in everyday life due to the lack of unobtrusive sensing technologies. In-ear audio sensing is promising due to its high social acceptance and the amplification of physiological sounds caused by the occlusion effect; however, existing approaches often fail under real-world noise or rely on computationally expensive models. We present EarResp-ANS, the first system enabling fully on-device, real-time RR estimation on commercial earphones. The system employs LMS-based adaptive noise suppression (ANS) to attenuate ambient noise while preserving respiration-related acoustic components, without requiring neural networks or audio streaming, thereby explicitly addressing the energy and privacy constraints of wearable devices. We evaluate EarResp-ANS in a study with 18 participants under realistic acoustic conditions, including music, cafeteria noise, and white noise up to 80 dB SPL. EarResp-ANS achieves robust performance with a global MAE of 0.84 CPM , reduced to 0.47 CPM via automatic outlier rejection, while operating with less than 2% processor load directly on the earphone.
Primary: Karlsruhe Institute of Technology
All Institutions: Karlsruhe Institute of Technology
The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
The methodology presented in EarResp-ANS is innovative, leveraging LMS-based adaptive noise suppression to enhance the accuracy of respiration rate estimation from in-ear audio signals. The decision to avoid neural networks and audio streaming is commendable, as it addresses energy efficiency and privacy concerns, which are critical in wearable technology. The paper provides a clear description of the signal processing techniques used, although further details on the implementation specifics would enhance understanding.
The experimental setup is robust, involving 18 participants and testing under various realistic acoustic conditions. The reported results, including a global MAE of 0.84 CPM and improved performance with outlier rejection, demonstrate the system's effectiveness. However, the sample size could be considered limited for broader generalizability, and additional metrics could provide a more comprehensive performance evaluation.
The paper lacks sufficient detail regarding the implementation of the system, which could hinder reproducibility. While the methodology is described, specific parameters, configurations, and the dataset used for training and validation are not thoroughly detailed, making it challenging for other researchers to replicate the study.
One limitation is the relatively small participant pool, which may not capture the variability in respiration rates across different demographics. Additionally, the performance under extreme noise conditions could be further explored, as the current evaluation focuses on a limited range of acoustic environments.
The potential applications of this technology are significant, particularly in health monitoring and wellness, as it allows for unobtrusive and continuous monitoring of a vital sign that is often overlooked. The system's design prioritizes user privacy and energy efficiency, making it suitable for widespread adoption in consumer devices. The main contribution of this paper is the development of EarResp-ANS, a novel system for real-time respiration rate estimation using in-ear audio sensing, which effectively addresses noise interference and energy constraints in wearable devices. This work represents a meaningful advancement in the field of unobtrusive health monitoring technologies, combining innovative signal processing techniques with practical applications in everyday life.
Audio foundation models learn general-purpose audio representations that facilitate a wide range of downstream tasks. While the performance of these models has greatly increased for conventional single-channel, dry audio clips, their success in real-world acoustic environments with reverberation and noise is limited. Furthermore, most audio foundation models ignore the spatial dimension of real-world acoustic environments, ruling out tasks involving sound localization. To address these limitations, we propose GRAM: a general-purpose real-world audio model that employs a multi-channel masked autoencoder to efficiently learn spatial audio representations. We evaluated GRAM and other audio foundation models in a standardized manner on high-quality simulations of naturalistic, spatial acoustic environments as well as recordings of real-world environments and release these two complementary benchmark task suites: NatHEAR and RealSELD. Our results demonstrate that GRAM outperforms all state-of-the-art self-supervised audio foundation models on NatHEAR and the clean, single-channel version HEAR, while using only a fraction of the training data. GRAM also shows state-of-the-art localization performance in simulated environments and generalizes efficiently to real-world recordings in RealSELD. Taken together, GRAM presents a significant advance toward robust spatial audio foundation models for real-world environments.
Primary: Donders Institute, Radboud University
All Institutions: Donders Institute, Radboud University, Mortimer B Zuckerman Institute, Columbia University
The paper presents GRAM, a significant advancement in spatial audio representation, demonstrating state-of-the-art performance in real-world environments while addressing the limitations of existing audio foundation models. The comprehensive methodology and rigorous evaluation contribute to its potential impact on the field of machine learning and audio processing.
The paper presents GRAM, a multi-channel masked autoencoder designed to learn spatial audio representations. The methodology is well-structured, employing a novel training pipeline that utilizes high-quality simulations of real-world sound environments. The use of a masked autoencoder to reconstruct spatial audio features is innovative, particularly in the context of audio foundation models, which typically overlook spatial dimensions. The introduction of two benchmark suites, NatHEAR and RealSELD, adds significant value by providing standardized evaluation metrics for audio models in complex environments.
The experiments are comprehensive, comparing GRAM against state-of-the-art models across various tasks in both simulated and real-world environments. The results demonstrate GRAM's superior performance in sound localization and general-purpose audio representation tasks, achieving state-of-the-art results while requiring less training data. The inclusion of ablation studies further strengthens the evaluation by providing insights into the impact of different model components and training strategies.
The paper provides sufficient details regarding the training process, model architecture, and evaluation metrics, which enhances reproducibility. The authors have made their code and datasets available, which is a positive aspect for the community. However, some specific hyperparameter settings and configurations could be more explicitly detailed to facilitate easier replication of results.
One limitation noted is the inadequate resolution of mel-spectrograms for binaural inputs, which may have impacted localization performance. Additionally, while the model shows promise in real-world applications, its performance in highly complex acoustic environments with significant noise interference remains to be fully explored.
The advancements made by GRAM could significantly impact various applications, including audio-visual scene understanding, robotics, and ambient intelligence systems. By improving the robustness of audio models in real-world environments, this work could enhance user experiences in smart environments and contribute to the development of more sophisticated auditory perception systems. The paper presents GRAM, a significant advancement in spatial audio representation, demonstrating state-of-the-art performance in real-world environments while addressing the limitations of existing audio foundation models. The comprehensive methodology and rigorous evaluation contribute to its potential impact on the field of machine learning and audio processing.
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale Mr.HiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Primary: Seoul National University
All Institutions: Seoul National University
The main contribution of this work is the introduction of a dual-pathway audio encoder that effectively captures both semantic and dynamic audio features for improved video highlight detection. This innovative approach not only sets a new benchmark in performance but also addresses critical limitations in existing methodologies, paving the way for future research in audio-visual learning.
The proposed methodology, DAViHD, introduces a dual-pathway audio encoder that effectively disentangles audio signals into semantic and dynamic components. This innovative approach allows for a more nuanced understanding of audio features, addressing a significant gap in existing models that often overlook the dynamic characteristics of sound. The use of frequency-adaptive mechanisms and the integration of self-attention in the audio feature fusion process are notable advancements that enhance the model's ability to capture salient moments in videos.
The experimental setup is robust, utilizing large-scale datasets (TVSum and Mr.HiSum) to validate the proposed model. The results demonstrate significant improvements over baseline models, achieving state-of-the-art performance metrics. The thorough comparison against various existing methods, including both audio-visual and visual-only models, strengthens the credibility of the findings. Additionally, the ablation studies provide clear insights into the contributions of different components of the model.
The paper provides detailed implementation details, including the architecture of the model, training parameters, and the datasets used. This level of transparency is crucial for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
While the paper presents a compelling case for the dual-pathway approach, it does not extensively discuss potential limitations or scenarios where the model may underperform. Additionally, the reliance on pre-trained models for feature extraction could introduce biases from those models, which should be acknowledged.
The advancements in audio-visual highlight detection have significant implications for various applications, including content summarization, video retrieval, and recommendation systems. By improving the understanding of audio dynamics, this research could enhance user experiences in multimedia applications, making it a valuable contribution to the field. The main contribution of this work is the introduction of a dual-pathway audio encoder that effectively captures both semantic and dynamic audio features for improved video highlight detection. This innovative approach not only sets a new benchmark in performance but also addresses critical limitations in existing methodologies, paving the way for future research in audio-visual learning.
Audio-visual video highlight detection aims to automatically identify the most salient moments in videos by leveraging both visual and auditory cues. However, existing models often underutilize the audio modality, focusing on high-level semantic features while failing to fully leverage the rich, dynamic characteristics of sound. To address this limitation, we propose a novel framework, Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD). The dual-pathway audio encoder is composed of a semantic pathway for content understanding and a dynamic pathway that captures spectro-temporal dynamics. The semantic pathway extracts high-level information by identifying the content within the audio, such as speech, music, or specific sound events. The dynamic pathway employs a frequency-adaptive mechanism as time evolves to jointly model these dynamics, enabling it to identify transient acoustic events via salient spectral bands and rapid energy changes. We integrate the novel audio encoder into a full audio-visual framework and achieve new state-of-the-art performance on the large-scale MrHiSum benchmark. Our results demonstrate that a sophisticated, dual-faceted audio representation is key to advancing the field of highlight detection.
Primary: Seoul National University
All Institutions: Seoul National University
The paper presents a novel approach to audio-visual video highlight detection through the DAViHD framework, which effectively models both semantic and dynamic audio features, achieving state-of-the-art performance and demonstrating the importance of sophisticated audio representations in multimedia understanding.
The proposed Dual-Pathway Audio Encoders for Video Highlight Detection (DAViHD) framework effectively disentangles audio signals into semantic and dynamic pathways, addressing a significant gap in existing models that often overlook the rich spectro-temporal dynamics of audio. The architecture employs a frequency-adaptive mechanism, allowing it to capture transient acoustic events, which is a notable advancement over traditional methods that rely on high-level semantic features. The integration of self-attention mechanisms before fusion enhances the model's ability to contextualize audio features, making the methodology both innovative and robust.
The experiments are conducted on well-established benchmarks (TVSum and Mr.HiSum), with the model achieving state-of-the-art results. The use of rigorous evaluation metrics, including F1-score and mean Average Precision (mAP), demonstrates the model's effectiveness in accurately identifying video highlights. The ablation studies provide strong evidence for the contributions of each component, particularly the dual-pathway approach, which significantly outperforms baseline models.
The paper provides detailed implementation details, including the architecture, training protocols, and hyperparameters, which are essential for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
While the model shows impressive performance, it may still be sensitive to the quality of audio input and may not generalize well to videos with poor audio quality or significant background noise. Additionally, the reliance on pre-trained models for audio semantic encoding may introduce biases based on the training data of those models.
The advancements in audio-visual highlight detection have significant implications for content summarization, retrieval, and recommendation systems, enhancing user experiences in various applications such as video streaming platforms and educational content delivery. The methodology could also inspire further research into multi-modal learning frameworks that leverage diverse data types for improved understanding. The paper presents a novel approach to audio-visual video highlight detection through the DAViHD framework, which effectively models both semantic and dynamic audio features, achieving state-of-the-art performance and demonstrating the importance of sophisticated audio representations in multimedia understanding.
Designing front-ends for speech deepfake detectors primarily focuses on two categories. Hand-crafted filterbank features are transparent but are limited in capturing high-level semantic details, often resulting in performance gaps compared to self-supervised (SSL) features. SSL features, in turn, lack interpretability and may overlook fine-grained spectral anomalies. We propose the WST-X series, a novel family of feature extractors that combines the best of both worlds via the wavelet scattering transform (WST), integrating wavelets with nonlinearities analogous to deep convolutional networks. We investigate 1D and 2D WSTs to extract acoustic details and higher-order structural anomalies, respectively. Experimental results on the recent and challenging Deepfake-Eval-2024 dataset indicate that WST-X outperforms existing front-ends by a wide margin. Our analysis reveals that a small averaging scale ($J$), combined with high-frequency and directional resolutions ($Q, L$), is critical for capturing subtle artifacts. This underscores the value of translation-invariant and deformation-stable features for robust and interpretable speech deepfake detection.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland, Université PSL, Université de Paris, University of Chinese Academy of Sciences, University of Toronto
The WST-X series presents a novel and effective approach to speech deepfake detection by leveraging wavelet scattering transforms and self-supervised learning features. This work significantly advances the field by addressing the critical need for interpretable and robust detection methods in audio forensics.
The paper introduces the WST-X series, a novel approach that effectively combines wavelet scattering transforms with self-supervised learning features for speech deepfake detection. The methodology is well-structured, detailing the theoretical foundations of the wavelet scattering transform and its integration with SSL features. The dual-branch architecture (WST-X1 and WST-X2) is innovative, allowing for both parallel and cascaded processing of features, which enhances the model's ability to capture subtle acoustic artifacts. The careful selection of parameters (J, Q, M for 1D and J, L, M for 2D) demonstrates a thorough understanding of the underlying signal characteristics and their relevance to deepfake detection.
The experimental setup is robust, utilizing the challenging Deepfake-Eval-2024 dataset, which is representative of real-world scenarios. The performance metrics chosen (minDCF, EER, F1-score, AUC) are appropriate for evaluating the effectiveness of the proposed methods. The results indicate significant performance improvements over traditional feature extraction methods, showcasing the advantages of the WST-X series in capturing fine-grained spectral anomalies. However, the paper could benefit from more extensive comparisons with other state-of-the-art methods beyond the baseline features mentioned.
The paper provides sufficient detail on the implementation of the WST-X series, including the choice of libraries (Kymatio, Librosa) and model configurations. However, the lack of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider making the code accessible to facilitate further research and validation.
One limitation is the reliance on the Deepfake-Eval-2024 dataset, which may not encompass all potential variations in deepfake generation techniques. Additionally, while the paper emphasizes interpretability, the complexity of the model may still pose challenges in fully understanding the decision-making process of the classifier. The paper does not address potential overfitting issues that may arise from the high-dimensional feature space.
The proposed WST-X series has significant implications for audio forensics and the detection of deepfake technologies, which are increasingly relevant in today's digital landscape. By improving the interpretability and robustness of speech deepfake detection systems, this work contributes to the ongoing efforts to combat misinformation and ensure the integrity of audio content. The WST-X series presents a novel and effective approach to speech deepfake detection by leveraging wavelet scattering transforms and self-supervised learning features. This work significantly advances the field by addressing the critical need for interpretable and robust detection methods in audio forensics.