Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.
Primary: Google Research
All Institutions: Google Research
The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
The paper introduces the Massive Sound Embedding Benchmark (MSEB), which is a novel framework aimed at evaluating auditory capabilities in multimodal systems. The methodology is well-structured, presenting eight core tasks that cover a wide range of audio processing capabilities. The inclusion of the Simple Voice Questions (SVQ) dataset is a significant addition, as it provides a large-scale resource for benchmarking. The tasks are clearly defined, and the framework is extensible, allowing for future enhancements. However, the paper could benefit from more detailed descriptions of the specific algorithms or techniques used to generate embeddings for each task.
The initial experiments reported in the paper establish performance benchmarks across the eight tasks, indicating clear performance headrooms. While the results are promising, the paper lacks detailed quantitative results and comparisons with existing benchmarks, which would strengthen the claims of improvement. Additionally, the experimental setup and metrics used for evaluation are not thoroughly discussed, which raises questions about the robustness of the findings.
The paper mentions that the library is publicly hosted on GitHub, which is a positive aspect for reproducibility. However, there is limited information on the specific implementation details, such as the versions of libraries used, the hardware setup, and the training procedures. This lack of detail could hinder other researchers from effectively reproducing the results.
One limitation is the potential overfitting to the benchmark tasks, as the initial experiments may not fully represent real-world scenarios. Furthermore, the paper does not address the scalability of the framework or how it performs with varying audio qualities and conditions. The reliance on a single dataset (SVQ) for initial experiments may also limit the generalizability of the findings.
The MSEB framework has the potential to significantly impact the field of machine auditory intelligence by providing a standardized way to evaluate and compare different algorithms. This could accelerate advancements in multimodal systems that rely on audio processing, with applications in areas such as human-computer interaction, accessibility technologies, and automated content generation. The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.
Primary: MIT-IBM Watson AI Lab
All Institutions: MIT-IBM Watson AI Lab, Tuebingen AI Center
The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
The proposed method, Class-Conditional Sparse Attention Vectors, introduces a novel approach to weighting attention heads based on class-specific importance, which is a significant departure from previous methods that treated all heads equally. This class-dependent weighting mechanism allows for more nuanced feature extraction tailored to specific tasks, enhancing the model's performance in few-shot classification scenarios. The methodology is well-structured and builds upon existing frameworks in audio-language processing, demonstrating a clear understanding of the limitations of uniform voting schemes.
The experiments conducted across various benchmarks for audio classification, audio-visual classification, and spoofing detection are robust. The reported improvements over state-of-the-art methods by notable margins (up to 14.52% in audio classification) indicate that the proposed method is not only effective but also competitive in real-world applications. However, the paper would benefit from a more detailed description of the datasets used and the specific metrics for evaluation to enhance transparency.
The paper lacks sufficient implementation details and code availability, which are critical for reproducibility. While the methodology is sound, without access to the code or a clear description of the experimental setup, it would be challenging for other researchers to replicate the results.
One limitation is the reliance on few-shot learning, which may not generalize well to all audio classification tasks, particularly those requiring extensive training data. Additionally, the paper does not address potential biases in the attention heads or the implications of class imbalance in the datasets used.
The implications of this research are significant for the development of more efficient audio-language models that can be applied in various domains, including accessibility technologies, automated content moderation, and interactive AI systems. By improving the performance of LALMs in discriminative tasks, this work could enhance user experiences in applications such as voice assistants and audio-based search engines. The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
While deep learning has advanced speech enhancement (SE), effective phase modeling remains challenging, as conventional networks typically operate within a flat Euclidean feature space, which is not easy to model the underlying circular topology of the phase. To address this, we propose a manifold-aware magnitude-phase dual-stream framework that aligns the phase stream with its intrinsic circular geometry by enforcing Global Rotation Equivariance (GRE) characteristic. Specifically, we introduce a Magnitude-Phase Interactive Convolutional Module (MPICM) for modulus-based information exchange and a Hybrid-Attention Dual-FFN (HADF) bottleneck for unified feature fusion, both of which are designed to preserve GRE in the phase stream. Comprehensive evaluations are conducted across phase retrieval, denoising, dereverberation, and bandwidth extension tasks to validate the superiority of the proposed method over multiple advanced baselines. Notably, the proposed architecture reduces Phase Distance by over 20\% in the phase retrieval task and improves PESQ by more than 0.1 in zero-shot cross-corpus denoising evaluations. The overall superiority is also established in universal SE tasks involving mixed distortions. Qualitative analysis further reveals that the learned phase features exhibit distinct periodic patterns, which are consistent with the intrinsic circular nature of the phase. The source code is available at https://github.com/wangchengzhong/RENet.
Primary: Institute of Acoustics, Chinese Academy of Sciences
All Institutions: Institute of Acoustics, Chinese Academy of Sciences, University of Chinese Academy of Sciences
The paper presents a significant advancement in speech enhancement through a novel phase modeling approach that respects the geometric properties of phase data. The methodology is innovative, and the results demonstrate substantial improvements over existing methods, marking a meaningful contribution to the field of audio processing and machine learning.
The paper introduces a novel manifold-aware framework for phase modeling in speech enhancement, emphasizing Global Rotation Equivariance (GRE) to address the circular topology of phase data. The methodology is well-structured, with two main components: the Magnitude-Phase Interactive Convolutional Module (MPICM) and the Hybrid-Attention Dual-FFN (HADF). These components facilitate effective interaction between magnitude and phase streams while preserving the intrinsic geometric properties of phase. The approach is innovative, as it fundamentally alters how phase information is processed in deep learning architectures, moving away from traditional Euclidean assumptions.
The authors conduct extensive experiments across various tasks, including phase retrieval, denoising, dereverberation, and bandwidth extension. They use established datasets like VoiceBank+DEMAND and DNS Challenge 2020, demonstrating the effectiveness of their method against multiple strong baselines. The results indicate significant improvements in phase modeling accuracy and perceptual quality metrics, showcasing the robustness of the proposed architecture in diverse acoustic conditions. However, the paper could benefit from more detailed comparisons with a wider range of state-of-the-art methods.
The paper provides a clear description of the proposed architecture and the experimental setup, including datasets and training configurations. The availability of the source code on GitHub enhances reproducibility, allowing other researchers to validate and build upon the work. However, specific hyperparameter settings and training details could be elaborated further to facilitate easier replication of results.
While the proposed method shows promising results, the paper does not address potential limitations such as the computational complexity of the model and its scalability to larger datasets. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other speech enhancement scenarios.
The proposed framework has significant implications for various applications in telecommunications, smart devices, and hearing aids, where effective speech enhancement is crucial. By improving phase modeling, the method could lead to advancements in real-time speech processing systems, enhancing user experience in noisy environments. The paper presents a significant advancement in speech enhancement through a novel phase modeling approach that respects the geometric properties of phase data. The methodology is innovative, and the results demonstrate substantial improvements over existing methods, marking a meaningful contribution to the field of audio processing and machine learning.
Time-frequency domain dual-path models have demonstrated strong performance and are widely used in source separation. Because their computational cost grows with the number of frequency bins, these models often use the band-split (BS) module in high-sampling-rate tasks such as music source separation (MSS) and cinematic audio source separation (CASS). The BS encoder compresses frequency information by encoding features for each predefined subband. It achieves effective compression by introducing an inductive bias that places greater emphasis on low-frequency parts. Despite its success, the BS module has two inherent limitations: (i) it is not input-adaptive, preventing the use of input-dependent information, and (ii) the parameter count is large, since each subband requires a dedicated module. To address these issues, we propose Spectral Feature Compression (SFC). SFC compresses the input using a single sequence modeling module, making it both input-adaptive and parameter-efficient. We investigate two variants of SFC, one based on cross-attention and the other on Mamba, and introduce inductive biases inspired by the BS module to make them suitable for frequency information compression. Experiments on MSS and CASS tasks demonstrate that the SFC module consistently outperforms the BS module across different separator sizes and compression ratios. We also provide an analysis showing that SFC adaptively captures frequency patterns from the input.
Primary: National Institute of Advanced Industrial Science and Technology (AIST)
All Institutions: National Institute of Advanced Industrial Science and Technology (AIST), Waseda University
The main contribution of this paper is the introduction of the Spectral Feature Compression module, which provides a novel, input-adaptive, and parameter-efficient approach to spectral feature compression for source separation tasks. This work represents a meaningful advancement in the field of audio processing, addressing key limitations of existing methods and demonstrating strong empirical results.
The paper introduces a novel approach to spectral feature compression through the Spectral Feature Compression (SFC) module, which utilizes sequence modeling techniques to create an input-adaptive and parameter-efficient method for source separation. The methodology is well-structured, addressing the limitations of the traditional band-split (BS) module by incorporating inductive biases and demonstrating the effectiveness of two variants based on cross-attention and Mamba. The approach is innovative in its attempt to adaptively capture frequency patterns, which is a significant advancement over previous methods.
The experiments are comprehensive, evaluating the proposed SFC module against the BS module across various tasks, including music source separation (MSS) and cinematic audio source separation (CASS). The results consistently show that SFC outperforms BS across different separator sizes and compression ratios, indicating a robust experimental design. However, details on the datasets used and the specific metrics for evaluation could be elaborated further to enhance clarity.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the experimental setup. While the methodology is sound, the absence of a project URL or demo could hinder other researchers from replicating the results.
One limitation is the reliance on inductive biases inspired by the BS module, which may not generalize well to all types of audio signals. Additionally, while the SFC module shows promise, its performance in real-world scenarios beyond the tested datasets remains unverified.
The proposed method has significant implications for audio processing applications, particularly in enhancing the quality of source separation in music and cinematic audio. The input-adaptive nature of the SFC module could lead to more efficient and effective audio processing systems, potentially influencing both academic research and industry practices. The main contribution of this paper is the introduction of the Spectral Feature Compression module, which provides a novel, input-adaptive, and parameter-efficient approach to spectral feature compression for source separation tasks. This work represents a meaningful advancement in the field of audio processing, addressing key limitations of existing methods and demonstrating strong empirical results.
Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University
The paper presents ProtoDisent-TTS, a prototype-based disentanglement TTS framework that effectively synthesizes dysarthric speech while preserving speaker identity. The innovative methodology and promising experimental results position this work as a valuable contribution to the field of speech synthesis and assistive technologies.
The proposed ProtoDisent-TTS framework introduces a novel approach to disentangling speaker identity from dysarthric articulation by utilizing a prototype-based codebook and a dual-classifier objective. This method is innovative as it combines elements of text-to-speech synthesis with a clear focus on pathology, allowing for controlled speech generation. The use of a gradient reversal layer to enforce invariance of speaker embeddings to dysarthric attributes is particularly noteworthy, as it addresses a significant challenge in the field of speech synthesis.
The experiments conducted on the TORGO dataset are well-structured and demonstrate the effectiveness of the proposed framework. The results show consistent improvements in ASR performance and speaker identity preservation, validating the utility of synthetic data generated by ProtoDisent-TTS. However, the paper could benefit from more extensive comparisons with existing state-of-the-art methods to better contextualize the results.
The implementation details provided are thorough, including specifics on the architecture, training procedures, and hyperparameters. However, the absence of a publicly accessible code repository limits the reproducibility of the results. The authors mention using a pre-trained Index-TTS model, but it would be beneficial to provide access to this model or detailed instructions for replication.
One limitation of the study is the reliance on a relatively small dataset (TORGO), which may affect the generalizability of the findings. Additionally, while the framework shows promise for dysarthric speech synthesis, its performance in real-world applications and with diverse speaker populations remains to be evaluated.
The work has significant implications for assistive speech technologies, particularly for individuals with dysarthria. By enabling controllable and interpretable speech synthesis, the framework could enhance communication for those affected by speech disorders. This research could also inspire further studies in related areas, such as voice conversion and personalized speech synthesis. The paper presents ProtoDisent-TTS, a prototype-based disentanglement TTS framework that effectively synthesizes dysarthric speech while preserving speaker identity. The innovative methodology and promising experimental results position this work as a valuable contribution to the field of speech synthesis and assistive technologies.
While existing Singing Voice Synthesis systems achieve high-fidelity solo performances, they are constrained by global timbre control, failing to address dynamic multi-singer arrangement and vocal texture within a single song. To address this, we propose Tutti, a unified framework designed for structured multi-singer generation. Specifically, we introduce a Structure-Aware Singer Prompt to enable flexible singer scheduling evolving with musical structure, and propose Complementary Texture Learning via Condition-Guided VAE to capture implicit acoustic textures (e.g., spatial reverberation and spectral fusion) that are complementary to explicit controls. Experiments demonstrate that Tutti excels in precise multi-singer scheduling and significantly enhances the acoustic realism of choral generation, offering a novel paradigm for complex multi-singer arrangement. Audio samples are available at https://annoauth123-ctrl.github.io/Tutii_Demo/.
Primary: Wuhan University of Technology
All Institutions: Wuhan University of Technology, Tencent Inc.
The paper presents Tutti, a novel framework for dynamic multi-singer synthesis that significantly enhances the acoustic realism and artistic cohesion of choral generation. The innovative methodology and comprehensive experimental validation position this work as a meaningful contribution to the field of machine learning and audio synthesis.
The methodology presented in this paper is robust and innovative, introducing the Tutti framework for multi-singer synthesis. The Structure-Aware Singer Prompt and the Complementary Texture Learning via Condition-Guided VAE are significant contributions that address the limitations of existing Singing Voice Synthesis (SVS) systems. The integration of these components allows for dynamic scheduling of singers and captures complex vocal textures, which are crucial for realistic multi-singer arrangements. The use of a Latent Diffusion Transformer (DiT) backbone enhances the model's ability to manage long musical sequences effectively.
The experimental setup is comprehensive, utilizing a large dataset for training and rigorous evaluation metrics, including both objective and subjective assessments. The results demonstrate significant improvements in multi-singer scheduling and acoustic realism compared to existing models. The ablation studies effectively highlight the contributions of each component of the proposed framework, reinforcing the importance of the adaptive fuser and texture learning in achieving high-quality synthesis.
The paper provides detailed implementation and training configurations, including model architecture, training parameters, and evaluation protocols. This level of detail supports reproducibility, allowing other researchers to replicate the experiments. However, the lack of a publicly available code repository limits accessibility for broader validation and experimentation.
The paper acknowledges limitations, such as the assumption that verse sections contain only a single singer, which may not reflect real-world scenarios. Additionally, the model's performance in melodicity and emotional expressiveness is noted as an area for improvement. These limitations suggest that while the framework is innovative, it may require further refinement to handle more complex musical arrangements.
The Tutti framework has the potential to significantly impact the field of music generation and synthesis, particularly in applications involving choral music and multi-singer arrangements. By enhancing the realism and expressiveness of synthesized singing voices, this research could facilitate advancements in music production, virtual performances, and interactive music applications. The implications extend to creative industries, education, and entertainment, where realistic vocal synthesis can enhance user experiences. The paper presents Tutti, a novel framework for dynamic multi-singer synthesis that significantly enhances the acoustic realism and artistic cohesion of choral generation. The innovative methodology and comprehensive experimental validation position this work as a meaningful contribution to the field of machine learning and audio synthesis.
Current audio formats present a fundamental trade-off between file size and functionality: lossless formats like FLAC preserve quality but lack adaptability, while lossy formats reduce size at the cost of fidelity and offer no stem-level access.We introduce the Stem-Native Codec (SNC), a novel audio container format that stores music as independently encoded stems plus a low-energy mastering residual. By exploiting the lower information entropy of separated stems compared to mixed audio, SNC achieves a 38.2% file size reduction versus FLAC (7.76 MB vs. 12.55 MB for a 2:18 test track) while maintaining perceptual transparency (STOI = 0.996). Unlike existing formats, SNC enables context-aware adaptive playback, spatial audio rendering, and user-controlled remixing without requiring additional storage. Our experimental validation demonstrates that the stems-plus residual architecture successfully decouples the conflicting requirements of compression efficiency and feature richness, offering a practical path toward next-generation audio distribution systems.
Primary: Wubble AI
All Institutions: Wubble AI
The main contribution of this paper is the introduction of the Stem-Native Codec (SNC), which innovatively combines efficient lossless audio storage with adaptive playback capabilities. This work presents a significant advancement in audio compression technology, addressing key limitations of existing formats and paving the way for future developments in audio distribution systems.
The methodology is well-structured, introducing the Stem-Native Codec (SNC) as a novel approach to audio storage that separates audio into independently encoded stems and a mastering residual. The theoretical framework is grounded in information theory, establishing a strong basis for the claim that separated stems have lower information entropy than mixed audio. The choice of using Opus for encoding stems is justified, and the detailed description of the encoding and decoding processes demonstrates a comprehensive understanding of audio compression techniques. However, the paper could benefit from clearer references to the sections mentioned in the contributions, as they are currently marked as [REF].
The experimental validation is robust, showcasing a significant file size reduction of 38.2% compared to FLAC while maintaining high perceptual quality (STOI = 0.996). The use of objective metrics such as spectral convergence and SNR adds credibility to the results. The paper effectively compares SNC with existing formats and highlights its advantages in terms of adaptive playback and spatial audio rendering. However, the experiments rely on a single test track, which may limit the generalizability of the findings.
The paper provides open-source encoder and decoder implementations, which is a strong point for reproducibility. The detailed encoding parameters and procedures are well-documented, allowing for potential replication of the results. However, the lack of a demo or project URL limits accessibility for interested researchers.
The primary limitation identified is the dependency on high-quality stems for effective encoding. The paper acknowledges that AI separation methods may introduce artifacts, which could affect the performance of SNC. Additionally, the decoding complexity is slightly higher than traditional formats, which may pose challenges for some applications. The need for standardized metadata schemas for adaptive playback features is also a potential barrier to widespread adoption.
The SNC has the potential to significantly influence music distribution by enabling smaller file sizes and enhanced playback experiences tailored to diverse environments. It opens up new avenues for artists to engage with their audience through remixing capabilities and adaptive features. The proposed format could also lead to reduced storage and bandwidth costs for streaming platforms, making advanced audio formats more accessible. The main contribution of this paper is the introduction of the Stem-Native Codec (SNC), which innovatively combines efficient lossless audio storage with adaptive playback capabilities. This work presents a significant advancement in audio compression technology, addressing key limitations of existing formats and paving the way for future developments in audio distribution systems.
While recent years have witnessed rapid progress in speech synthesis, open-source singing voice synthesis (SVS) systems still face significant barriers to industrial deployment, particularly in terms of robustness and zero-shot generalization. In this report, we introduce SoulX-Singer, a high-quality open-source SVS system designed with practical deployment considerations in mind. SoulX-Singer supports controllable singing generation conditioned on either symbolic musical scores (MIDI) or melodic representations, enabling flexible and expressive control in real-world production workflows. Trained on more than 42,000 hours of vocal data, the system supports Mandarin Chinese, English, and Cantonese and consistently achieves state-of-the-art synthesis quality across languages under diverse musical conditions. Furthermore, to enable reliable evaluation of zero-shot SVS performance in practical scenarios, we construct SoulX-Singer-Eval, a dedicated benchmark with strict training-test disentanglement, facilitating systematic assessment in zero-shot settings.
Primary: Soul-AI Lab
All Institutions: Soul-AI Lab
SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.
The methodology of SoulX-Singer is robust, leveraging a large-scale dataset of over 42,000 hours of vocal recordings to enhance zero-shot generalization capabilities. The dual-control mechanism (melody-control and score-control modes) is innovative, allowing for flexible synthesis based on different input types. The data processing pipeline is well-structured, ensuring high-quality vocal extraction and annotation, which is crucial for training effective models. The use of flow matching and a dedicated Singing Content Encoder to manage multimodal inputs is a significant advancement in the field.
The experimental evaluation is thorough, utilizing two distinct benchmarks (GMO-SVS and SoulX-Singer-Eval) to assess performance across multiple dimensions, including melodic accuracy, intelligibility, and overall singing quality. The results consistently demonstrate that SoulX-Singer outperforms existing state-of-the-art models, showcasing its effectiveness in both controlled and zero-shot scenarios. The comprehensive metrics used for evaluation provide a clear picture of the model's capabilities.
The paper provides sufficient detail regarding the architecture, training process, and evaluation metrics, which supports reproducibility. The availability of the dataset and code on GitHub further enhances the potential for other researchers to replicate the study. However, the reliance on specific pretrained models for vocal extraction and transcription may pose some challenges in reproducing the exact results without access to those models.
One limitation of the study is the potential for voice impersonation and ethical concerns associated with the use of synthesized voices, which the authors acknowledge. Additionally, while the model shows strong performance across multiple languages, the dataset's composition may still limit its generalization to other languages or dialects not represented in the training data.
SoulX-Singer has significant implications for the music production industry, enabling creators to synthesize high-quality singing voices without the need for extensive vocal recordings. This technology could democratize music creation, allowing individuals without access to professional singers to produce high-quality vocal tracks. However, the ethical considerations surrounding voice synthesis and potential misuse must be addressed to ensure responsible deployment. SoulX-Singer represents a significant advancement in zero-shot singing voice synthesis, combining a large-scale dataset with innovative modeling techniques to achieve high-quality, flexible vocal generation across multiple languages. The comprehensive evaluation and robust methodology position this work as a valuable contribution to the field of machine learning and audio synthesis.
Large audio-language models (LALMs) exhibit strong zero-shot capabilities in multiple downstream tasks, such as audio question answering (AQA) and abstract reasoning; however, these models still lag behind specialized models for certain discriminative tasks (e.g., audio classification). Recent studies show that sparse subsets of attention heads within an LALM can serve as strong discriminative feature extractors for downstream tasks such as classification via simple voting schemes. However, these methods assign uniform weights to all selected heads, implicitly assuming that each head contributes equally across all semantic categories. In this work, we propose Class-Conditional Sparse Attention Vectors for Large Audio-Language Models, a few-shot classification method that learns class-dependent importance weights over attention heads. This formulation allows individual heads to specialize in distinct semantic categories and to contribute to ensemble predictions proportionally to their estimated reliability. Experiments on multiple few-shot audio and audiovisual classification benchmarks and tasks demonstrate that our method consistently outperforms state-of-the-art uniform voting-based approaches by up to 14.52%, 1.53%, 8.35% absolute gains for audio classification, audio-visual classification, and spoofing detection respectively.
Primary: MIT-IBM Watson AI Lab
All Institutions: MIT-IBM Watson AI Lab, Tuebingen AI Center
The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
The proposed method, Class-Conditional Sparse Attention Vectors, introduces a novel approach to weighting attention heads based on class-specific importance, which is a significant departure from previous methods that treated all heads equally. This class-dependent weighting mechanism allows for more nuanced feature extraction tailored to specific tasks, enhancing the model's performance in few-shot classification scenarios. The methodology is well-structured and builds upon existing frameworks in audio-language processing, demonstrating a clear understanding of the limitations of uniform voting schemes.
The experiments conducted across various benchmarks for audio classification, audio-visual classification, and spoofing detection are robust. The reported improvements over state-of-the-art methods by notable margins (up to 14.52% in audio classification) indicate that the proposed method is not only effective but also competitive in real-world applications. However, the paper would benefit from a more detailed description of the datasets used and the specific metrics for evaluation to enhance transparency.
The paper lacks sufficient implementation details and code availability, which are critical for reproducibility. While the methodology is sound, without access to the code or a clear description of the experimental setup, it would be challenging for other researchers to replicate the results.
One limitation is the reliance on few-shot learning, which may not generalize well to all audio classification tasks, particularly those requiring extensive training data. Additionally, the paper does not address potential biases in the attention heads or the implications of class imbalance in the datasets used.
The implications of this research are significant for the development of more efficient audio-language models that can be applied in various domains, including accessibility technologies, automated content moderation, and interactive AI systems. By improving the performance of LALMs in discriminative tasks, this work could enhance user experiences in applications such as voice assistants and audio-based search engines. The main contribution of this paper is the introduction of a class-dependent weighting mechanism for attention heads in large audio-language models, which significantly enhances their performance in few-shot classification tasks. This work represents a meaningful advancement in the field of audio processing and machine learning, addressing existing limitations in model performance and paving the way for future research in adaptive attention mechanisms.
Audio is a critical component of multimodal perception, and any truly intelligent system must demonstrate a wide range of auditory capabilities. These capabilities include transcription, classification, retrieval, reasoning, segmentation, clustering, reranking, and reconstruction. Fundamentally, each task involves transforming a raw audio signal into a meaningful 'embedding' - be it a single vector, a sequence of continuous or discrete representations, or another structured form - which then serves as the basis for generating the task's final response. To accelerate progress towards robust machine auditory intelligence, we present the Massive Sound Embedding Benchmark (MSEB): an extensible framework designed to evaluate the auditory components of any multimodal system. In its first release, MSEB offers a comprehensive suite of eight core tasks, with more planned for the future, supported by diverse datasets, including the new, large-scale Simple Voice Questions (SVQ) dataset. Our initial experiments establish clear performance headrooms, highlighting the significant opportunity to improve real-world multimodal experiences where audio is a core signal. We encourage the research community to use MSEB to assess their algorithms and contribute to its growth. The library is publicly hosted at github.
Primary: Google Research
All Institutions: Google Research
The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
The paper introduces the Massive Sound Embedding Benchmark (MSEB), which is a novel framework aimed at evaluating auditory capabilities in multimodal systems. The methodology is well-structured, presenting eight core tasks that cover a wide range of audio processing capabilities. The inclusion of the Simple Voice Questions (SVQ) dataset is a significant addition, as it provides a large-scale resource for benchmarking. The tasks are clearly defined, and the framework is extensible, allowing for future enhancements. However, the paper could benefit from more detailed descriptions of the specific algorithms or techniques used to generate embeddings for each task.
The initial experiments reported in the paper establish performance benchmarks across the eight tasks, indicating clear performance headrooms. While the results are promising, the paper lacks detailed quantitative results and comparisons with existing benchmarks, which would strengthen the claims of improvement. Additionally, the experimental setup and metrics used for evaluation are not thoroughly discussed, which raises questions about the robustness of the findings.
The paper mentions that the library is publicly hosted on GitHub, which is a positive aspect for reproducibility. However, there is limited information on the specific implementation details, such as the versions of libraries used, the hardware setup, and the training procedures. This lack of detail could hinder other researchers from effectively reproducing the results.
One limitation is the potential overfitting to the benchmark tasks, as the initial experiments may not fully represent real-world scenarios. Furthermore, the paper does not address the scalability of the framework or how it performs with varying audio qualities and conditions. The reliance on a single dataset (SVQ) for initial experiments may also limit the generalizability of the findings.
The MSEB framework has the potential to significantly impact the field of machine auditory intelligence by providing a standardized way to evaluate and compare different algorithms. This could accelerate advancements in multimodal systems that rely on audio processing, with applications in areas such as human-computer interaction, accessibility technologies, and automated content generation. The paper presents the Massive Sound Embedding Benchmark (MSEB), a comprehensive framework for evaluating auditory capabilities in multimodal systems. The proposed methodology and initial experiments highlight significant opportunities for improvement in machine auditory intelligence, although further details on implementation and rigorous benchmarking against existing methods would enhance its impact.
Speech tokenizers are foundational to speech language models, yet existing approaches face two major challenges: (1) balancing trade-offs between encoding semantics for understanding and acoustics for reconstruction, and (2) achieving low bit rates and low token rates. We propose Speech Diffusion Tokenizer (SiTok), a diffusion autoencoder that jointly learns semantic-rich representations through supervised learning and enables high-fidelity audio reconstruction with diffusion. We scale SiTok to 1.6B parameters and train it on 2 million hours of speech. Experiments show that SiTok outperforms strong baselines on understanding, reconstruction and generation tasks, at an extremely low token rate of $12.5$ Hz and a bit-rate of 200 bits-per-second.
Primary: Meta
All Institutions: Meta
The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
The proposed methodology of the Speech Diffusion Tokenizer (SiTok) is innovative, leveraging a diffusion autoencoder to jointly optimize quantization and reconstruction. The introduction of semantic regularization through a CTC decoder is a significant advancement, allowing the model to maintain semantic integrity while achieving high compression rates. The architecture effectively combines the strengths of diffusion models with the need for efficient speech tokenization, addressing the limitations of previous approaches that often relied on heuristic compromises. The design choices, such as the use of mel-spectrograms and the focus on low token rates, are well-justified and align with the objectives of scalable language modeling.
The experiments conducted are extensive, utilizing a large dataset of 2 million hours of speech, which enhances the robustness of the findings. The paper provides a comprehensive evaluation across various tasks, including speech reconstruction, emotion recognition, and automatic speech recognition, demonstrating that SiTok outperforms existing baselines significantly. The results are well-presented, with clear metrics for comparison, and the ablation studies effectively highlight the contributions of different components of the model.
The paper includes detailed descriptions of the model architecture, training settings, and evaluation protocols, which are crucial for reproducibility. The authors have made efforts to ensure that their work can be replicated, which is commendable. However, the absence of a publicly available code repository limits the ease of reproducibility for practitioners in the field.
While the proposed model shows promising results, it may still face challenges in real-world applications, such as the potential for overfitting due to the large number of parameters (1.6B) and the reliance on extensive training data. Additionally, the computational efficiency during inference, although improved with shortcut fine-tuning, may still be a concern for deployment in resource-constrained environments. The paper does not address the ethical implications of misuse in generating synthetic speech, which is an important consideration in today's landscape.
The development of SiTok has significant implications for speech technology, particularly in applications such as automatic speech recognition, text-to-speech systems, and conversational agents. By enabling high-fidelity audio reconstruction at low bit rates, this work could enhance accessibility and usability in various domains, including assistive technologies and real-time communication systems. The potential for misuse, such as generating deceptive synthetic speech, highlights the need for responsible deployment and monitoring of such technologies. The main contribution of this paper is the introduction of SiTok, a novel speech tokenizer that utilizes a diffusion autoencoder to achieve high-quality speech representation and reconstruction while maintaining low bit and token rates. This work significantly advances the field of speech processing by addressing key challenges in existing methodologies and providing a robust framework for future research and applications.
Spatial audio is crucial for creating compelling immersive 360-degree video experiences. However, generating realistic spatial audio, such as first-order ambisonics (FOA), from 360-degree videos in complex acoustic scenes remains challenging. Existing methods often overlook the dynamic nature and acoustic complexity of 360-degree scenes, fail to fully account for dynamic sound sources, and neglect complex environmental effects such as occlusion, reflections, and reverberation, which are influenced by scene geometries and materials. We propose DynFOA, a framework based on dynamic acoustic perception and conditional diffusion, for generating high-fidelity FOA from 360-degree videos. DynFOA first performs visual processing via a video encoder, which detects and localizes multiple dynamic sound sources, estimates their depth and semantics, and reconstructs the scene geometry and materials using a 3D Gaussian Splatting. This reconstruction technique accurately models occlusion, reflections, and reverberation based on the geometries and materials of the reconstructed 3D scene and the listener's viewpoint. The audio encoder then captures the spatial motion and temporal 4D sound source trajectories to fine-tune the diffusion-based FOA generator. The fine-tuned FOA generator adjusts spatial cues in real time, ensuring consistent directional fidelity during listener head rotation and complex environmental changes. Extensive evaluations demonstrate that DynFOA consistently outperforms existing methods across metrics such as spatial accuracy, acoustic fidelity, and distribution matching, while also improving the user experience. Therefore, DynFOA provides a robust and scalable approach to rendering realistic dynamic spatial audio for VR and immersive media applications.
Primary: Martha Stewart Enterprises
All Institutions: Martha Stewart Enterprises, Allied Widgets Research
DynFOA presents a significant advancement in the generation of spatial audio for complex acoustic environments. The integration of visual and acoustic processing through a conditional diffusion model marks a notable contribution to the field, addressing critical challenges in immersive audio rendering.
The methodology presented in DynFOA is robust, integrating a multi-modal approach that combines visual processing with audio generation through conditional diffusion. The use of 3D Gaussian Splatting for scene reconstruction is particularly innovative, allowing for a detailed understanding of the environment that enhances acoustic fidelity. The model's architecture, which includes separate encoders for video and audio, effectively captures the complexities of dynamic sound sources in 360-degree videos. However, the reliance on specific datasets and the complexity of the model may limit its applicability in diverse real-world scenarios.
The experimental evaluation is comprehensive, utilizing a well-structured dataset (Dyn360) that includes various acoustic scenarios. The results demonstrate a clear superiority of DynFOA over baseline methods across multiple metrics, including spatial accuracy and acoustic fidelity. The inclusion of both objective metrics and user studies strengthens the findings, providing a balanced view of the model's performance. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the training process. While the methodology is described in depth, the absence of a public repository or demo limits the ability of other researchers to replicate the results.
Key limitations include the model's performance in uncontrolled environments, as the experiments were primarily conducted in indoor settings. Additionally, the approach may not generalize well to different acoustic conditions, such as underwater environments or those with varying material properties. The reliance on specific datasets could also introduce biases that affect the generalizability of the findings.
The potential applications of DynFOA are significant, particularly in the fields of virtual reality, augmented reality, and immersive media. By improving the realism of spatial audio, this work can enhance user experiences in gaming, film, and educational applications. The integration of visual and acoustic modalities could pave the way for more immersive storytelling and interactive experiences. DynFOA presents a significant advancement in the generation of spatial audio for complex acoustic environments. The integration of visual and acoustic processing through a conditional diffusion model marks a notable contribution to the field, addressing critical challenges in immersive audio rendering.
Realistic sound propagation is essential for immersion in a virtual scene, yet physically accurate wave-based simulations remain computationally prohibitive for real-time applications. Wave coding methods address this limitation by precomputing and compressing impulse responses of a given scene into a set of scalar acoustic parameters, which can reach unmanageable sizes in large environments with many source-receiver pairs. We introduce Reciprocal Latent Fields (RLF), a memory-efficient framework for encoding and predicting these acoustic parameters. The RLF framework employs a volumetric grid of trainable latent embeddings decoded with a symmetric function, ensuring acoustic reciprocity. We study a variety of decoders and show that leveraging Riemannian metric learning leads to a better reproduction of acoustic phenomena in complex scenes. Experimental validation demonstrates that RLF maintains replication quality while reducing the memory footprint by several orders of magnitude. Furthermore, a MUSHRA-like subjective listening test indicates that sound rendered via RLF is perceptually indistinguishable from ground-truth simulations.
Primary: unknown
All Institutions: unknown
The paper presents a novel framework for modeling sound propagation using latent embeddings, significantly improving memory efficiency and maintaining perceptual quality in audio rendering. The technical contributions, particularly the integration of Riemannian metric learning, position this work as a meaningful advancement in the field of audio machine learning, with practical applications in immersive environments.
The paper introduces the Reciprocal Latent Fields (RLF) framework, which innovatively utilizes a volumetric grid of trainable latent embeddings to encode and predict acoustic parameters. The methodology emphasizes the importance of acoustic reciprocity by employing symmetric functions in the decoding process. The use of Riemannian metric learning to enhance the accuracy of acoustic phenomena reproduction is a notable advancement over simpler Euclidean models. The approach is well-structured, with clear definitions and justifications for the chosen methods, including the training process and the architecture of the decoders.
The experimental validation is robust, featuring a variety of models and configurations tested across two distinct environments (Audio Gym and Wwise Audio Lab). The results demonstrate significant memory efficiency gains while maintaining high fidelity in sound reproduction, as evidenced by both quantitative metrics and qualitative assessments through MUSHRA-like listening tests. The paper provides a thorough analysis of the performance of different models, comparing their accuracy and computational costs effectively.
While the paper details the methodology and experimental setup comprehensively, it lacks explicit URLs for code or data repositories, which could hinder reproducibility. The description of the training data generation and model training processes is clear, but without access to the actual implementation, independent verification of results may be challenging.
The primary limitations identified include the lack of implementation for spatial compression of the latent fields and the restriction to static geometries, which limits the applicability of the RLF framework in dynamic environments. The authors acknowledge these limitations and suggest future work to address them, indicating an awareness of the framework's current constraints.
The RLF framework has significant implications for real-time audio rendering in virtual environments, particularly in gaming and simulation contexts. By reducing memory requirements while maintaining high-quality sound reproduction, this work could enhance user experiences in immersive environments. The potential for extending the framework to other reciprocal quantities also opens avenues for further research and applications beyond acoustics. The paper presents a novel framework for modeling sound propagation using latent embeddings, significantly improving memory efficiency and maintaining perceptual quality in audio rendering. The technical contributions, particularly the integration of Riemannian metric learning, position this work as a meaningful advancement in the field of audio machine learning, with practical applications in immersive environments.
Although diffusion-based, non-autoregressive text-to-speech (TTS) systems have demonstrated impressive zero-shot synthesis capabilities, their efficacy is still hindered by two key challenges: the difficulty of text-speech alignment modeling and the high computational overhead of the iterative denoising process. To address these limitations, we propose ARCHI-TTS that features a dedicated semantic aligner to ensure robust temporal and semantic consistency between text and audio. To overcome high computational inference costs, ARCHI-TTS employs an efficient inference strategy that reuses encoder features across denoising steps, drastically accelerating synthesis without performance degradation. An auxiliary CTC loss applied to the condition encoder further enhances the semantic understanding. Experimental results demonstrate that ARCHI-TTS achieves a WER of 1.98% on LibriSpeech-PC test-clean, and 1.47%/1.42% on SeedTTS test-en/test-zh with a high inference efficiency, consistently outperforming recent state-of-the-art TTS systems.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The main contribution of this paper is the introduction of ARCHI-TTS, a novel non-autoregressive text-to-speech model that effectively addresses the challenges of text-speech alignment and computational efficiency through innovative architectural components. The comprehensive analysis of its technical contributions, methodology, and results positions it as a significant advancement in the TTS domain, with potential for impactful applications in various audio synthesis tasks.
The methodology proposed in ARCHI-TTS is innovative, combining a semantic aligner with a flow-matching decoder to address the challenges of text-speech alignment and inference efficiency in TTS systems. The use of a low-token-rate representation derived from a Variational Autoencoder (VAE) is a significant advancement, allowing for a more compact representation of audio data while maintaining quality. The architecture's reliance on a transformer-based semantic aligner to create self-supervised text-aligned semantic representations is a novel approach that enhances the model's ability to generate coherent and contextually relevant speech. The integration of an auxiliary CTC loss to bolster semantic understanding further demonstrates a thoughtful approach to improving the model's performance.
The experimental evaluation is robust, utilizing a large-scale multilingual dataset (100k hours) for training and multiple established benchmarks for testing. The reported results, including a WER of 1.98% on the LibriSpeech-PC test-clean and competitive performance on the SeedTTS test set, indicate that ARCHI-TTS outperforms several state-of-the-art models while using fewer computational resources. The inclusion of ablation studies adds depth to the evaluation, providing insights into the contributions of various architectural components. However, the paper could benefit from more extensive subjective evaluations to further validate the quality of the generated speech.
The paper provides sufficient details regarding the model configuration, training process, and evaluation metrics, which should facilitate reproducibility. The authors mention the use of specific hardware (8 RTX 5090 GPUs) and training duration, which are valuable for replicating the experiments. However, the lack of a direct link to the code repository limits accessibility for other researchers wishing to reproduce the results.
While the proposed model shows promising results, it does exhibit some limitations, such as slightly lagging behind other state-of-the-art models in subjective quality evaluations. The reliance on a specific dataset (Emilia) may also limit the generalizability of the findings. Additionally, the computational efficiency improvements come at the cost of some performance degradation, which may need further exploration.
The advancements presented in ARCHI-TTS have significant implications for the field of TTS and audio synthesis, particularly in enhancing the efficiency and quality of speech generation. The model's ability to perform zero-shot synthesis with high fidelity could lead to broader applications in voice cloning, audiobooks, and interactive voice response systems. As TTS technology continues to evolve, the methodologies introduced in this paper could influence future research directions and commercial applications. The main contribution of this paper is the introduction of ARCHI-TTS, a novel non-autoregressive text-to-speech model that effectively addresses the challenges of text-speech alignment and computational efficiency through innovative architectural components. The comprehensive analysis of its technical contributions, methodology, and results positions it as a significant advancement in the TTS domain, with potential for impactful applications in various audio synthesis tasks.
Neural audio codecs are widely used for audio compression and can be integrated into token-based language models. Traditional codecs preserve acoustic details well but lack semantic information. Recent hybrid codecs attempt to incorporate semantic information through distillation, but this often degrades reconstruction performance, making it difficult to achieve both. To address this limitation, we introduce STACodec, a unified codec that integrates semantic information from self-supervised learning (SSL) models into the first layer of residual vector quantization (RVQ-1) via semantic token assignment (STA). To further eliminate reliance on SSL-based semantic tokenizers and improve efficiency during inference, we propose a semantic pre-distillation (SPD) module, which predicts semantic tokens directly for assignment to the first RVQ layer during inference. Experimental results show that STACodec outperforms existing hybrid codecs in both audio reconstruction and downstream semantic tasks, demonstrating a better balance between acoustic fidelity and semantic capability.
Primary: University of California
All Institutions: University of California
The main contribution of this paper is the introduction of STACodec, a novel audio codec that integrates semantic information through a unique token assignment mechanism, achieving a balance between acoustic fidelity and semantic capability. This work significantly advances the state-of-the-art in audio codecs by addressing the limitations of existing hybrid models and providing a clear pathway for future research in multimodal audio processing.
The methodology presented in STACodec is innovative, integrating semantic token assignment (STA) into the first layer of residual vector quantization (RVQ-1) to enhance both acoustic fidelity and semantic information in audio codecs. The introduction of the Semantic Pre-Distillation (SPD) module is particularly noteworthy, as it reduces reliance on SSL-based tokenizers and improves inference efficiency. The methodology is well-structured, with clear explanations of the architecture and training objectives, although some equations and references to figures are incomplete in the provided text.
The experimental evaluation is robust, utilizing a comprehensive dataset (LibriSpeech) and employing multiple metrics (PESQ, STOI, ViSQOL) for audio reconstruction quality, as well as downstream tasks like ASR and intent classification. The results demonstrate that STACodec outperforms existing hybrid codecs, indicating effective integration of semantic information without significant degradation of audio quality. However, the paper could benefit from more detailed statistical analysis of results and comparisons with additional baseline methods.
The paper provides a reasonable level of detail regarding the training configurations, model architectures, and evaluation metrics, which supports reproducibility. The availability of the code on GitHub further enhances the potential for other researchers to replicate the findings. However, the absence of specific hyperparameter settings and training procedures might hinder complete reproducibility.
One limitation is the reliance on the LibriSpeech dataset, which may not fully represent the diversity of real-world audio scenarios. Additionally, while the SPD module improves efficiency, it may introduce trade-offs in reconstruction fidelity, which the authors acknowledge but do not explore in depth. The paper could also address potential scalability issues when applying STACodec to larger or more complex datasets.
The proposed STACodec has significant implications for the fields of audio processing and machine learning, particularly in applications involving speech recognition and multimodal language models. By effectively balancing acoustic fidelity and semantic information, STACodec could enhance the performance of various audio-related tasks, making it a valuable contribution to the development of more sophisticated audio codecs. The main contribution of this paper is the introduction of STACodec, a novel audio codec that integrates semantic information through a unique token assignment mechanism, achieving a balance between acoustic fidelity and semantic capability. This work significantly advances the state-of-the-art in audio codecs by addressing the limitations of existing hybrid models and providing a clear pathway for future research in multimodal audio processing.
Advances in AIGC technologies have enabled the synthesis of highly realistic audio deepfakes capable of deceiving human auditory perception. Although numerous audio deepfake detection (ADD) methods have been developed, most rely on local temporal/spectral features or pairwise relations, overlooking high-order interactions (HOIs). HOIs capture discriminative patterns that emerge from multiple feature components beyond their individual contributions. We propose HyperPotter, a hypergraph-based framework that explicitly models these synergistic HOIs through clustering-based hyperedges with class-aware prototype initialization. Extensive experiments demonstrate that HyperPotter surpasses its baseline by an average relative gain of 22.15% across 11 datasets and outperforms state-of-the-art methods by 13.96% on 4 challenging cross-domain datasets, demonstrating superior generalization to diverse attacks and speakers.
Primary: Zhejiang University
All Institutions: Zhejiang University
The main contribution of this paper is the introduction of HyperPotter, a hypergraph-based framework for audio deepfake detection that effectively captures high-order interactions, demonstrating substantial improvements over existing methods. This work represents a meaningful advancement in the field of audio deepfake detection, with the potential to influence future research directions and applications.
The proposed HyperPotter framework introduces a novel approach to audio deepfake detection by leveraging hypergraphs to model high-order interactions (HOIs). This is a significant departure from traditional methods that focus primarily on local features or pairwise relations. The use of clustering-based hyperedges with class-aware prototype initialization is innovative and suggests a deeper understanding of the relationships between features. However, the paper could benefit from a more detailed explanation of the hypergraph construction process and the specific clustering techniques employed.
The experiments are extensive, covering 11 datasets and demonstrating a relative gain of 22.15% over baseline methods, as well as a 13.96% improvement over state-of-the-art methods on challenging cross-domain datasets. This breadth of evaluation is commendable and indicates robust performance across various scenarios. However, the paper lacks a detailed comparison with other recent methodologies in the field, which could provide further context for the results.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Clear guidelines on how to replicate the experiments, including hyperparameter settings and dataset access, would enhance the paper's impact.
One limitation is the potential complexity of the hypergraph model, which may require significant computational resources and expertise to implement. Additionally, while the results are promising, the paper does not address the scalability of the approach or its performance in real-time applications.
The implications of this research are significant, particularly in the context of increasing audio deepfake threats. The ability to detect sophisticated audio manipulations could enhance security in various applications, including media verification, cybersecurity, and content authenticity. The methodology could also inspire further research into high-order interactions in other domains beyond audio. The main contribution of this paper is the introduction of HyperPotter, a hypergraph-based framework for audio deepfake detection that effectively captures high-order interactions, demonstrating substantial improvements over existing methods. This work represents a meaningful advancement in the field of audio deepfake detection, with the potential to influence future research directions and applications.
Despite the growing success of Large Speech Language Models (LSLMs) in processing short-term acoustic signals, their extension to long-form audio understanding is severely bottlenecked. This limitation stems from the limited context length and the exorbitant memory footprints required for long-form inference. In this work, we propose Speech-XL, a new model that capitalizes on the intrinsic key-value (KV) sparsification capacity of Large Language Models (LLMs) to achieve high-ratio speech input compression. Specifically, we introduce a novel special token, the Speech Summarization Token (SST), for each speech interval to encapsulate the intra-interval speech information into its associated KV pairs. The SST module is trained via instruction fine-tuning, employing a curriculum learning strategy where the SST learns to compress information in a progressive manner--advancing from low-ratio (simple) to high-ratio (challenging) compression. Despite utilizing significantly less training data than other baselines, our model achieves highly competitive performance on major benchmarks, including LongSpeech and AUDIOMARATHON. By addressing the long-standing bottlenecks in long-form audio modeling, our approach offers a novel perspective on the condensation of extensive acoustic sequences.
Primary: Nankai University
All Institutions: Nankai University, Alibaba International Digital Commerce, University of Exeter
Speech-XL presents a significant advancement in long-form speech understanding through its innovative use of Speech Summarization Tokens and curriculum learning strategies. This work not only addresses critical limitations in existing models but also sets the stage for future developments in efficient audio processing methodologies.
The methodology presented in Speech-XL is innovative, particularly with the introduction of the Speech Summarization Token (SST) for compressing long-form audio data. The model effectively addresses the limitations of existing Large Speech Language Models (LSLMs) by leveraging a curriculum learning approach to progressively train the SST for varying compression ratios. This structured training strategy enhances the model's ability to maintain semantic integrity while reducing memory usage. The dual-adapter bridge architecture is also a notable contribution, allowing for effective integration of acoustic and semantic features into the LLM's framework.
The experimental setup is robust, utilizing significant datasets like LongSpeech and AudioMarathon to evaluate the model's performance across various tasks. The results indicate that Speech-XL outperforms existing models in several benchmarks, demonstrating its effectiveness in long-form audio understanding. The comparative analysis with upper-bound models and other state-of-the-art systems provides a clear picture of its capabilities, although the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper outlines a clear training process and provides details on the datasets used, model architecture, and training parameters. However, the absence of a publicly accessible code repository or demo limits reproducibility. Future work should consider releasing the model and training scripts to enhance transparency and allow for independent verification of results.
One limitation is the reliance on a relatively small training dataset for certain tasks, which may affect the generalizability of the model across diverse audio contexts. Additionally, the model's performance in out-of-domain evaluations suggests that it may struggle with audio types not represented in the training data. The authors acknowledge the need for broader training data to fully leverage the SST mechanism's potential.
The advancements in long-form speech understanding have significant implications for various applications, including transcription services, virtual assistants, and accessibility technologies. By improving the efficiency and accuracy of processing long audio sequences, Speech-XL could enhance user experiences in these domains. The work also opens avenues for future research into more sophisticated audio processing techniques that could benefit from the SST framework. Speech-XL presents a significant advancement in long-form speech understanding through its innovative use of Speech Summarization Tokens and curriculum learning strategies. This work not only addresses critical limitations in existing models but also sets the stage for future developments in efficient audio processing methodologies.
Recent advances in speech synthesis and editing have made speech spoofing increasingly challenging. However, most existing methods treat spoofing as binary classification, overlooking that diverse spoofing techniques manipulate multiple, coupled speech attributes and their semantic effects. In this paper, we introduce HoliAntiSpoof, the first audio large language model (ALLM) framework for holistic speech anti-spoofing analysis. HoliAntiSpoof reformulates spoofing analysis as a unified text generation task, enabling joint reasoning over spoofing methods, affected speech attributes, and their semantic impacts. To support semantic-level analysis, we introduce DailyTalkEdit, a new anti-spoofing benchmark that simulates realistic conversational manipulations and provides annotations of semantic influence. Extensive experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across multiple settings, while preliminary results show that in-context learning further improves out-of-domain generalization. These findings indicate that ALLMs not only enhance speech spoofing detection performance but also enable interpretable analysis of spoofing behaviors and their semantic effects, pointing towards more trustworthy and explainable speech security. Data and code are publicly available.
Primary: Shanghai Artificial Intelligence Laboratory
All Institutions: Shanghai Artificial Intelligence Laboratory, Nanjing University
The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
The paper introduces HoliAntiSpoof, a novel framework that reformulates speech anti-spoofing as a unified text generation task using an audio large language model (ALLM). This approach allows for holistic analysis of spoofing techniques, integrating authenticity classification, spoofing method identification, and semantic influence analysis. The methodology is innovative as it combines traditional signal-level detection with semantic reasoning, addressing a gap in existing research that primarily focuses on binary classification. The introduction of the DailyTalkEdit dataset to support semantic analysis is a significant contribution, allowing for more realistic evaluations of spoofing impacts in conversational contexts.
The experiments demonstrate that HoliAntiSpoof outperforms conventional baselines across various settings, including in-domain and out-of-domain evaluations. The authors provide extensive results that validate the effectiveness of their model, particularly in terms of robustness to domain shifts. The use of multiple datasets, including their newly proposed ones, strengthens the experimental design. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The authors have made their data and code publicly available, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training procedures, which could hinder full reproducibility for other researchers.
One limitation is the reliance on the quality of the datasets, particularly the DailyTalkEdit, which may not cover all possible spoofing scenarios. Additionally, while the model shows promise in generalization, the performance on truly unseen spoofing methods and languages remains to be fully validated. The paper also does not address potential adversarial uses of the methodology, which could be a concern given the nature of the research.
The research has significant implications for speech security, particularly in combating the rising threats posed by speech deepfakes. By providing a more nuanced understanding of spoofing techniques and their semantic impacts, the framework could enhance the development of more robust detection systems. However, there is a risk that the methodologies developed could also be exploited by malicious actors to improve spoofing techniques. The paper presents HoliAntiSpoof, a pioneering framework that integrates holistic speech spoofing analysis with ALLMs, significantly advancing the field of audio anti-spoofing. The innovative approach and comprehensive evaluation demonstrate its potential to enhance speech security and understanding of spoofing behaviors.
We study the fine-grained text-to-audio (T2A) generation task. While recent models can synthesize high-quality audio from text descriptions, they often lack precise control over attributes such as loudness, pitch, and sound events. Unlike prior approaches that retrain models for specific control types, we propose to train ControlNet models on top of pre-trained T2A backbones to achieve controllable generation over loudness, pitch, and event roll. We introduce two designs, T2A-ControlNet and T2A-Adapter, and show that the T2A-Adapter model offers a more efficient structure with strong control ability. With only 38M additional parameters, T2A-Adapter achieves state-of-the-art performance on the AudioSet-Strong in both event-level and segment-level F1 scores. We further extend this framework to audio editing, proposing T2A-Editor for removing and inserting audio events at time locations specified by instructions. Models, code, dataset pipelines, and benchmarks will be released to support future research on controllable audio generation and editing.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
The paper introduces the Audio ControlNet framework, which innovatively builds on pre-trained text-to-audio (T2A) models by integrating lightweight auxiliary networks for fine-grained control over audio attributes such as loudness, pitch, and sound events. The two proposed architectures, T2A-ControlNet and T2A-Adapter, are well-structured, with T2A-Adapter demonstrating efficiency through fewer parameters while maintaining high performance. The methodology is sound, leveraging established techniques from the ControlNet paradigm and adapting them to the audio domain, thus showcasing a thoughtful approach to enhancing existing models without extensive retraining.
The experiments are comprehensive, utilizing the AudioSet-Strong dataset for both training and evaluation, which is appropriate given the task. The results indicate that T2A-Adapter achieves state-of-the-art performance in sound event detection metrics, outperforming existing models while using significantly fewer parameters. The paper includes both objective metrics (F1 scores) and subjective evaluations (MOS), providing a well-rounded assessment of model performance. However, the paper could benefit from more detailed comparisons with a broader range of baseline models to further validate its claims.
The authors mention plans to release models, code, dataset pipelines, and benchmarks, which is a positive step towards reproducibility. However, specific implementation details, such as hyperparameter settings and training configurations, could be more explicitly stated to enhance clarity and facilitate replication by other researchers.
The paper acknowledges limitations, such as the computational constraints that prevented exhaustive hyperparameter searches and the focus on a limited set of control conditions. Additionally, the reliance on generalization for multi-condition control at inference time may not be robust across all scenarios. Future work is suggested to explore richer control signals and more comprehensive multi-condition training.
The framework has significant potential applications in sound design, music creation, and video production, where precise audio generation and editing are crucial. The ability to manipulate audio attributes with fine granularity can enhance creative workflows and enable new forms of audio content generation. However, ethical considerations regarding the misuse of generated audio, such as impersonation or disinformation, must be addressed to ensure responsible deployment. The paper presents the Audio ControlNet framework, which enhances text-to-audio generation and editing capabilities through lightweight auxiliary networks, achieving state-of-the-art performance with efficient parameter usage. The methodology and results indicate a meaningful contribution to the field of audio generation, with significant implications for creative industries.
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Primary: Huawei Noah's Ark Lab
All Institutions: Huawei Noah's Ark Lab
This paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representations and their alignment with human cognitive processes. The innovative methodology and rigorous experimental evaluation contribute valuable insights to the field of machine learning in audio processing.
The paper employs Sparse Autoencoders (SAEs) to analyze the activations of Whisper and HuBERT models, providing a systematic approach to feature extraction and interpretability in audio processing. The methodology includes a comprehensive evaluation of feature stability, interpretability, and practical applications, which is a significant advancement in the field. The use of various metrics for validation and the introduction of novel techniques for feature steering and EEG correlation analysis enhance the robustness of the methodology.
The experiments are well-structured, utilizing a diverse corpus of audio data for training and evaluation. The authors demonstrate the effectiveness of SAEs in capturing semantic and paralinguistic information, with results showing a substantial reduction in false detections when steering Whisper's features. The correlation with EEG activity adds a neuroscientific dimension to the findings, indicating a deeper understanding of audio processing in relation to human cognition.
The paper provides detailed implementation information, including model architectures, training setups, and hyperparameters, which supports reproducibility. The availability of code and checkpoints on GitHub further enhances the potential for other researchers to replicate the study and build upon its findings.
The paper acknowledges limitations in its scope, including a focus on specific classification tasks and the exclusion of larger model variants due to computational constraints. Additionally, the auto-interpretation method's reliance on a captioning model trained primarily on music and sound data may lead to generic interpretations of speech-related features.
The findings have significant implications for audio processing applications, particularly in improving speech recognition systems and understanding human auditory processing. The techniques developed could be applied to various domains, including speech enhancement, emotion recognition, and environmental sound classification, potentially leading to advancements in human-computer interaction and accessibility technologies. This paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representations and their alignment with human cognitive processes. The innovative methodology and rigorous experimental evaluation contribute valuable insights to the field of machine learning in audio processing.
Sparse Autoencoders (SAEs) are powerful tools for interpreting neural representations, yet their use in audio remains underexplored. We train SAEs across all encoder layers of Whisper and HuBERT, provide an extensive evaluation of their stability, interpretability, and show their practical utility. Over 50% of the features remain consistent across random seeds, and reconstruction quality is preserved. SAE features capture general acoustic and semantic information as well as specific events, including environmental noises and paralinguistic sounds (e.g. laughter, whispering) and disentangle them effectively, requiring removal of only 19-27% of features to erase a concept. Feature steering reduces Whisper's false speech detections by 70% with negligible WER increase, demonstrating real-world applicability. Finally, we find SAE features correlated with human EEG activity during speech perception, indicating alignment with human neural processing. The code and checkpoints are available at https://github.com/audiosae/audiosae_demo.
Primary: Huawei Noah's Ark Lab
All Institutions: Huawei Noah's Ark Lab
The paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representation learning. The methodology and results contribute to the field by enhancing model interpretability and practical utility, particularly in speech recognition tasks.
The paper employs Sparse Autoencoders (SAEs) to analyze the activations of Whisper and HuBERT models, a novel approach in the audio domain. The methodology is comprehensive, involving a multi-faceted evaluation of feature stability, interpretability, and practical utility. The use of distributional similarity metrics and the introduction of a novel steering technique to mitigate hallucinations in speech models are particularly noteworthy. The SAE architecture and training setup are well-detailed, ensuring clarity in the implementation.
The experiments are robust, utilizing a diverse dataset of approximately 2.8k hours of audio, which includes speech, music, and environmental sounds. The evaluation of SAE features across various tasks, including gender identification and emotion recognition, demonstrates the practical applicability of the proposed methods. The results indicate significant improvements in model performance and interpretability, particularly in reducing false positives in speech detection.
The paper provides sufficient details on the training setup, hyperparameters, and evaluation metrics, which enhances reproducibility. The availability of code and checkpoints on GitHub further supports this aspect, allowing other researchers to replicate the experiments and build upon the findings.
The paper acknowledges limitations in the scope of classification tasks evaluated and the focus on base model variants, suggesting that broader applications and larger architectures were not explored due to computational constraints. Additionally, the auto-interpretation method's reliance on a music-trained captioning model limits its effectiveness in capturing fine-grained speech features.
The findings have significant implications for audio processing, particularly in enhancing the interpretability of neural models and improving real-world applications like speech recognition. The correlation of SAE features with human EEG activity opens avenues for interdisciplinary research, linking machine learning with neuroscience. The paper presents a comprehensive investigation into the application of Sparse Autoencoders for interpreting audio models, significantly advancing the understanding of audio representation learning. The methodology and results contribute to the field by enhancing model interpretability and practical utility, particularly in speech recognition tasks.
Transformer-based models have shown strong performance in speech deepfake detection, largely due to the effectiveness of the multi-head self-attention (MHSA) mechanism. MHSA provides frame-level attention scores, which are particularly valuable because deepfake artifacts often occur in small, localized regions along the temporal dimension of speech. This makes fine-grained frame modeling essential for accurately detecting subtle spoofing cues. In this work, we propose fine-grained frame modeling (FGFM) for MHSA-based speech deepfake detection, where the most informative frames are first selected through a multi-head voting (MHV) module. These selected frames are then refined via a cross-layer refinement (CLR) module to enhance the model's ability to learn subtle spoofing cues. Experimental results demonstrate that our method outperforms the baseline model and achieves Equal Error Rate (EER) of 0.90%, 1.88%, and 6.64% on the LA21, DF21, and ITW datasets, respectively. These consistent improvements across multiple benchmarks highlight the effectiveness of our fine-grained modeling for robust speech deepfake detection.
Primary: Hanoi University of Science and Technology
All Institutions: Hanoi University of Science and Technology, Nanyang Technological University
The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
The proposed methodology introduces a novel fine-grained frame modeling (FGFM) approach that effectively enhances the multi-head self-attention (MHSA) mechanism for speech deepfake detection. The integration of the multi-head voting (MHV) module to select salient frames and the cross-layer refinement (CLR) module to aggregate information across layers is innovative. This dual approach addresses the limitations of conventional MHSA by focusing on localized artifacts, which are critical for detecting subtle spoofing cues. The methodology is well-structured and builds upon existing transformer architectures, demonstrating a clear understanding of the challenges in deepfake detection.
The experimental evaluation is robust, utilizing multiple datasets (ASVspoof 2021 LA, DF, and ITW) to validate the effectiveness of the proposed method. The reported Equal Error Rates (EER) indicate significant improvements over baseline models, showcasing the method's effectiveness across diverse conditions. The inclusion of ablation studies further strengthens the evaluation, providing insights into the contributions of individual components of the proposed framework.
The paper provides sufficient detail regarding the experimental setup, including model configurations and training procedures, which supports reproducibility. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings. Future work should consider making the implementation accessible to enhance reproducibility.
While the proposed method shows promising results, it may still be sensitive to variations in the quality of the input audio, such as background noise or recording conditions. Additionally, the reliance on specific datasets may limit the generalizability of the findings to real-world applications. The paper could benefit from a discussion on how the model performs under such conditions.
The implications of this research are significant, particularly in the context of biometric security and misinformation. As deepfake technology becomes more sophisticated, effective detection methods are crucial for safeguarding against potential abuses in various sectors, including finance and communication. The proposed FGFM approach could contribute to the development of more reliable detection systems, thereby enhancing trust in voice-based interactions. The paper presents a novel approach to speech deepfake detection through fine-grained frame modeling, significantly improving the ability to capture subtle artifacts. This work is a meaningful contribution to the field of audio processing and machine learning, addressing critical challenges in the detection of synthetic speech.
Spoken dialogues with and between voice agents are becoming increasingly common, yet assessing them for their socially harmful content such as violence, harassment, and hate remains text-centric and fails to account for audio-specific cues and transcription errors. We present LALM-as-a-Judge, the first controlled benchmark and systematic study of large audio-language models (LALMs) as safety judges for multi-turn spoken dialogues. We generate 24,000 unsafe and synthetic spoken dialogues in English that consist of 3-10 turns, by having a single dialogue turn including content with one of 8 harmful categories (e.g., violence) and on one of 5 grades, from very mild to severe. On 160 dialogues, 5 human raters confirmed reliable unsafe detection and a meaningful severity scale. We benchmark three open-source LALMs: Qwen2-Audio, Audio Flamingo 3, and MERaLiON as zero-shot judges that output a scalar safety score in [0,1] across audio-only, transcription-only, or multimodal inputs, along with a transcription-only LLaMA baseline. We measure the judges' sensitivity to detecting unsafe content, the specificity in ordering severity levels, and the stability of the score in dialogue turns. Results reveal architecture- and modality-dependent trade-offs: the most sensitive judge is also the least stable across turns, while stable configurations sacrifice detection of mild harmful content. Transcription quality is a key bottleneck: Whisper-Large may significantly reduce sensitivity for transcription-only modes, while largely preserving severity ordering. Audio becomes crucial when paralinguistic cues or transcription fidelity are category-critical. We summarize all findings and provide actionable guidance for practitioners.
Primary: Technion--Israel Institute of Technology
All Institutions: Technion--Israel Institute of Technology, Carnegie Mellon University
The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
The methodology presented in this paper is robust and innovative, focusing on the generation of unsafe spoken dialogues and the evaluation of large audio-language models (LALMs) as safety judges. The controlled generation of unsafe dialogue variants, along with the systematic benchmarking of LALMs across different modalities, is a significant contribution to the field. The use of human raters to validate the generated unsafe dialogues and the severity scale adds credibility to the findings. The paper also effectively addresses the challenges of audio-specific cues and transcription errors, which are often overlooked in text-centric assessments.
The experimental evaluation is thorough, with a well-defined dataset of 24,000 dialogues and a clear methodology for assessing the performance of the LALMs. The results reveal important trade-offs between sensitivity, specificity, and stability across different models and modalities. The use of various prompting strategies to optimize performance further demonstrates a comprehensive approach to evaluating the models. However, the paper could benefit from more detailed statistical analysis and comparisons with existing benchmarks in the field.
The paper mentions plans to release the dataset and code, which is crucial for reproducibility. However, specific implementation details, such as the exact configurations used for the LALMs and the human raters' instructions, should be more explicitly stated to facilitate replication of the study. The inclusion of supplementary materials or appendices would enhance reproducibility.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world dialogues. Additionally, the potential for bias in the generated unsafe dialogues and the subjective nature of human ratings could impact the validity of the findings. The paper also acknowledges the risk of misuse of the benchmark data, which is an important ethical consideration.
The findings of this research have significant implications for the development of safer spoken dialogue systems and voice agents. By providing a systematic approach to evaluating harmful content in multi-turn dialogues, the work aims to improve the safety and reliability of voice interfaces. However, the potential for misuse of the generated data and the reliance on automated judges without human oversight could lead to unintended consequences in real-world applications. The main contribution of this paper is the introduction of a controlled benchmark and systematic study of large audio-language models (LALMs) as automated safety judges for multi-turn spoken dialogues. This work addresses a critical gap in the evaluation of spoken dialogue systems, highlighting the importance of audio-specific cues and transcription fidelity in assessing socially harmful content. The comprehensive analysis of model performance across various configurations provides valuable insights for practitioners in the field.
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Independent Researcher
The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
The paper proposes a novel audio tokenizer, ReasoningCodec, which effectively separates audio representations into reasoning and reconstruction tokens. This dual-token approach allows for higher-level abstractions while maintaining fidelity in audio reconstruction. The architecture's functional layer specialization is a significant methodological advancement, optimizing the processing of audio and text tokens across different transformer layers, which is a departure from the traditional uniform approach. The introduction of auditory sentences as a means to unify task construction is innovative and enhances the model's ability to handle complex audio tasks.
The authors conducted extensive experiments across various speech, sound, and music tasks, demonstrating competitive performance on in-domain evaluations. The model's ability to generalize to unseen tasks in few-shot and zero-shot settings is particularly noteworthy, showcasing its robustness and versatility. However, the paper could benefit from more detailed quantitative results and comparisons with state-of-the-art models to better contextualize its performance.
The authors commit to providing demo, code, and checkpoints, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics and hyperparameter settings that would facilitate full reproducibility by other researchers.
The paper acknowledges potential risks associated with misuse of the technology, such as impersonation and copyright issues. However, it does not delve deeply into the technical limitations of the model itself, such as potential biases in the training data or the scalability of the approach to more complex audio tasks.
The proposed model has significant implications for applications in creative assistance, human-computer interaction, and audio generation. However, the authors rightly caution against potential misuse, emphasizing the need for responsible deployment practices to mitigate risks associated with audio generation technologies. The main contribution of this paper is the introduction of UniAudio 2.0, a unified audio language model that leverages a novel tokenization strategy and specialized architecture to achieve strong performance in both understanding and generation tasks. This work represents a meaningful advancement in the field of audio language modeling, addressing key challenges and setting the stage for future research in audio processing and generation.
We study two foundational problems in audio language models: (1) how to design an audio tokenizer that can serve as an intermediate representation for both understanding and generation; and (2) how to build an audio foundation model that generalizes in few-shot and zero-shot settings, analogous to large language models. To this end, we make the following two contributions. First, we propose ReasoningCodec, a discrete audio codec that factorizes audio into (i) reasoning tokens, which encode text-aligned, high-level analysis and planning representations for audio understanding and hierarchical generation, and (ii) reconstruction tokens, which encode semantic-rich acoustic cues for high-fidelity waveform reconstruction. This design achieves understanding performance comparable to strong continuous representations while improving generation quality and reconstruction fidelity over prior discrete tokenizers. Second, we introduce a unified autoregressive architecture for text and audio, together with multi-stage training and multi-task data construction. Using this framework, we train UniAudio 2.0 on 100B text tokens and 60B audio tokens. Across a wide range of speech, sound, and music tasks, UniAudio 2.0 performs competitively on in-domain evaluations and demonstrates strong few-shot and zero-shot generalization to unseen tasks. Demo, code, and checkpoints will be available at \href{https://dongchaoyang.top/UniAudio2Demo/}{https://dongchaoyang.top/UniAudio2Demo/}.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Independent Researcher
The main contribution of this work is the development of UniAudio 2.0, a unified audio language model that effectively integrates understanding and generation tasks through innovative tokenization and architecture strategies. This paper represents a meaningful advancement in the field of audio language models, addressing key challenges and providing a robust framework for future research and applications.
The paper introduces a novel audio tokenizer, ReasoningCodec, which effectively separates audio into reasoning and reconstruction tokens, addressing the limitations of existing discrete tokenizers. The proposed unified autoregressive architecture with functional layer specialization enhances the model's ability to process both audio and text, allowing for improved understanding and generation. The introduction of auditory sentences as a method for multi-task training is particularly innovative, as it facilitates the integration of diverse audio tasks without the need for extensive manual task design.
The authors report extensive experiments on a large dataset comprising 100B text tokens and 60B audio tokens, demonstrating competitive performance on various tasks. The few-shot and zero-shot generalization capabilities are particularly noteworthy, indicating the model's robustness and versatility across different audio-related tasks. However, specific metrics and comparisons with baseline models could be more thoroughly detailed to strengthen the claims of performance.
The paper mentions that demo, code, and checkpoints will be made available, which is a positive aspect for reproducibility. However, the absence of a detailed description of the experimental setup, hyperparameters, and model training procedures limits the ease with which others can replicate the results.
The paper acknowledges potential risks associated with audio generation, such as misuse and copyright issues, but it could benefit from a more in-depth discussion of the limitations of the proposed model itself, including any biases in the training data or challenges in the generalization to highly diverse audio tasks.
The implications of this research are significant, as it opens avenues for advanced applications in creative assistance, human-computer interaction, and audio content generation. However, the authors rightly highlight the ethical considerations and potential for misuse, which need to be addressed as the technology develops. The main contribution of this work is the development of UniAudio 2.0, a unified audio language model that effectively integrates understanding and generation tasks through innovative tokenization and architecture strategies. This paper represents a meaningful advancement in the field of audio language models, addressing key challenges and providing a robust framework for future research and applications.
Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
Primary: National Taiwan University
All Institutions: National Taiwan University
The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
The proposed URSA-GAN framework presents a novel approach to address the challenges of domain adaptation in ASR and SE by leveraging a dual-embedding architecture that captures noise and channel characteristics. This method is innovative in its use of generative adversarial networks (GANs) combined with dynamic stochastic perturbation for enhanced robustness. The architecture is well-structured, with a clear delineation of roles for the noise encoder, channel encoder, and generator, which collectively facilitate effective domain adaptation. The introduction of instance-level embeddings and the use of feature-wise linear modulation (FiLM) for conditioning the generator on noise and channel characteristics are particularly noteworthy. However, the complexity of the model may pose challenges in practical applications.
The experiments conducted are extensive and cover a variety of datasets and scenarios, demonstrating the effectiveness of URSA-GAN in improving ASR and SE performance under mismatched conditions. The results show significant improvements in character error rates and perceptual metrics, validating the framework's robustness. The evaluation metrics used are appropriate, and the comparative analysis against baseline models and previous works strengthens the claims made by the authors. However, the paper could benefit from more detailed ablation studies to further clarify the contributions of individual components.
The paper provides a comprehensive description of the methodology, including the architecture, training process, and evaluation metrics, which facilitates reproducibility. However, the lack of a publicly available code repository or demo limits the ability of other researchers to replicate the experiments fully. Clearer documentation of hyperparameters and training configurations would enhance reproducibility.
One limitation is the reliance on pre-trained models for the noise and channel encoders, which may not generalize well to all domains. Additionally, the model's complexity could hinder its deployment in real-time applications, especially on resource-constrained devices. The performance gap between URSA-GAN and models trained on labeled target-domain data suggests that while the framework is effective, it may still require some labeled data for optimal performance.
The proposed framework has significant implications for real-world applications of ASR and SE, particularly in environments with varying noise and channel conditions. By improving the robustness of these systems, URSA-GAN could enhance user experiences in various domains, including telecommunications, voice assistants, and hearing aids. The approach also opens avenues for further research in domain adaptation techniques across different audio processing tasks. The main contribution of this paper is the introduction of URSA-GAN, a unified framework for robust speech adaptation that effectively addresses domain mismatches in ASR and SE through innovative use of dual-embedding architectures and GANs. This work significantly advances the state of the art in speech processing, providing a scalable solution for real-world applications.
We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.
The proposed methodology of PFluxTTS is innovative, combining a dual-decoder architecture that integrates both duration-guided and alignment-free models through inference-time vector-field fusion. This hybrid approach effectively addresses the stability-naturalness trade-off prevalent in existing TTS systems. The use of FLUX-based speech-prompt embeddings for robust cross-lingual voice cloning is a significant advancement, allowing the model to maintain speaker identity across languages without relying on prompt transcripts. Additionally, the integration of a modified PeriodWave vocoder with super-resolution capabilities to synthesize high-quality audio at 48 kHz from low-rate mel features is a noteworthy enhancement.
The experimental evaluation is comprehensive, utilizing a variety of datasets that reflect real-world challenges in TTS, particularly in cross-lingual scenarios. The authors provide both subjective and objective metrics to assess performance, demonstrating that PFluxTTS outperforms several state-of-the-art systems in terms of naturalness and speaker similarity. The use of statistical significance tests to validate the results adds rigor to the findings. However, the reliance on a limited number of baselines may restrict the generalizability of the conclusions.
The paper includes detailed descriptions of the model architecture, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to replicate the results fully. The authors could improve reproducibility by providing access to their training data and model checkpoints.
One limitation of the study is the potential overfitting to the specific datasets used for training and evaluation, which may not represent the full diversity of real-world speech. Additionally, while the system shows robustness in challenging conditions, the performance on extremely noisy or low-quality inputs is not thoroughly explored. The authors also note that the model's performance may vary with different languages, which could limit its applicability in multilingual contexts.
The advancements presented in PFluxTTS have significant implications for applications in AI dubbing, virtual assistants, and accessibility technologies. By improving cross-lingual voice cloning and audio quality, the system can enhance user experience in multilingual environments, making technology more inclusive. Furthermore, the research contributes to the ongoing development of high-fidelity TTS systems, which can benefit various industries, including entertainment, education, and customer service. The main contribution of this paper is the introduction of PFluxTTS, a hybrid TTS system that effectively combines duration-guided and alignment-free models to improve naturalness and stability in speech synthesis. This work represents a meaningful step forward in addressing key challenges in the field of text-to-speech technology, particularly in cross-lingual applications.