The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
Primary: Indraprastha Institute of Information Technology Delhi
All Institutions: Indraprastha Institute of Information Technology Delhi, Manipal University Jaipur, Netaji Subhas University of Technology
The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
The paper introduces a novel dual-stream architecture, CLAM, which leverages two distinct pre-trained audio encoders to capture the nuances of vocal and instrumental elements in music. The methodology is well-structured, focusing on the contrastive learning approach to enhance the model's sensitivity to synthetic artifacts. The introduction of the Melody or Machine (MoM) benchmark is a significant advancement, addressing the limitations of existing datasets by providing a more diverse and challenging evaluation framework. The dual-loss objective, combining binary cross-entropy with a contrastive triplet loss, is a thoughtful design choice that enhances the model's robustness against out-of-distribution samples.
The experiments are comprehensive, demonstrating the efficacy of the proposed model against existing state-of-the-art methods. The results on the MoM benchmark, achieving an F1 score of 0.925, significantly outperforming previous models, underscore the technical impact of the research. The ablation studies provide a solid foundation for understanding the contributions of various components of the model, validating the architectural choices made.
The paper provides sufficient implementation details, including the training setup, model architecture, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly accessible code repository or demo URL limits the ease with which other researchers can replicate the results.
The primary limitation noted is the rapid pace of innovation in AI music generation, which may render the proposed detection methods obsolete as new models emerge. Additionally, the dataset's predominant focus on English songs may limit its applicability across diverse linguistic and cultural contexts.
The research has significant implications for the music industry, particularly in protecting intellectual property rights and maintaining artistic authenticity in the face of advancing AI technologies. The MoM dataset and CLAM model can aid content platforms and rights holders in identifying synthetic music, fostering trust in music distribution. However, there are ethical considerations regarding the potential misuse of the dataset for training more effective generative models. The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.
Primary: Institut Polytechnique de Paris
All Institutions: Institut Polytechnique de Paris
The paper introduces a novel approach to bandwidth extension using a Harmonic-Percussive disentangled neural audio codec, demonstrating significant improvements in high-frequency reconstruction through a well-integrated transformer-based language model. This work not only advances the state of the art in audio processing but also opens avenues for further research in audio representation learning and codec design.
The paper presents a novel approach to bandwidth extension by introducing a Harmonic-Percussive disentangled neural audio codec (HP-codec) that separates high and low-frequency components and utilizes a transformer-based language model for token prediction. This dual-architecture design is innovative as it integrates codec structure directly into the generative modeling process, allowing for improved high-frequency reconstruction. The methodology is well-structured, leveraging existing techniques in audio processing while introducing significant enhancements in representation learning and model coupling.
The experimental setup is robust, utilizing multiple datasets including MUSDB18 and JAMENDO for training and testing. The authors compare their model against established baselines (Apollo and AudioSR), providing both objective metrics and subjective evaluations through MUSHRA tests. The results indicate that HP-codecX outperforms these baselines in reconstructing high-frequency content, demonstrating the effectiveness of the proposed approach. The comprehensive evaluation across different datasets adds credibility to the findings.
The authors emphasize reproducibility by detailing their experimental setup, training procedures, and the datasets used. They plan to release their implementation upon acceptance, which is a positive step towards ensuring that other researchers can replicate their results. However, the paper could benefit from providing more specific information about hyperparameters and training conditions.
The paper acknowledges several limitations, including the constraint of fixed sampling rates and the architectural coupling between the codec and language model. The reliance on a specific input-output mapping (16 kHz to 48 kHz) may limit the model's applicability in broader contexts. Additionally, the potential for artifacts in high-frequency reconstructions is noted, which could affect perceptual quality despite favorable listening test results.
The advancements in bandwidth extension have significant implications for audio processing applications, including telecommunications, music restoration, and speech enhancement. The proposed model's ability to improve high-frequency reconstruction could enhance user experiences in various audio-related technologies, making it a valuable contribution to the field. The paper introduces a novel approach to bandwidth extension using a Harmonic-Percussive disentangled neural audio codec, demonstrating significant improvements in high-frequency reconstruction through a well-integrated transformer-based language model. This work not only advances the state of the art in audio processing but also opens avenues for further research in audio representation learning and codec design.
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.
Primary: Tencent Technology Co.Ltd
All Institutions: Tencent Technology Co.Ltd
The paper presents a novel multi-reward GRPO framework that significantly enhances the performance of single-codebook TTS LLMs by addressing key challenges in prosody and speaker similarity. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field of TTS synthesis, with the potential for broad applications in human-computer interaction.
The paper introduces a multi-reward Group Relative Policy Optimization (GRPO) framework that enhances the token generation policy of single-codebook TTS LLMs. The integration of multiple rule-based rewards (length penalty, entropy regularization, and prosody alignment) is a novel approach that addresses common issues in TTS systems, such as prosody instability and speaker drift. The use of an external reasoning LLM to predict pause structures for prosody alignment is particularly innovative, leveraging in-context learning to provide a human-preference-aligned supervisory signal. The methodology is well-structured, with clear definitions of the reward functions and their intended impacts on the model's performance.
The experiments are comprehensive, utilizing a large bilingual corpus and various evaluation metrics (CER, SIM, MOS) to assess the effectiveness of the proposed framework. The results demonstrate significant improvements in prosodic stability, speaker similarity, and naturalness compared to existing models. The scalability analysis across different model sizes and data scales adds depth to the evaluation, showing that the proposed method is effective across a range of conditions. The ablation study further validates the contribution of each reward component, providing insights into their individual impacts on performance.
The paper provides detailed implementation details, including the architecture, training configurations, and data sources. However, the absence of a public code repository or demo URL limits the reproducibility of the results. While the methodology is well-explained, the lack of accessible resources may hinder other researchers from replicating the study.
One limitation of the study is the reliance on a specific reasoning LLM for prosody alignment, which may not generalize across all languages or dialects. Additionally, while the results are promising, the paper does not address potential computational costs associated with the proposed GRPO framework, particularly in terms of training time and resource requirements. The evaluation is primarily focused on objective metrics, and further subjective assessments could strengthen the findings.
The proposed framework has significant implications for the field of TTS synthesis, particularly in enhancing the naturalness and expressivity of synthesized speech. Improved prosody and speaker similarity can lead to more engaging and human-like interactions in applications such as virtual assistants, audiobooks, and language learning tools. The integration of reinforcement learning in TTS systems could pave the way for more adaptive and context-aware speech synthesis technologies. The paper presents a novel multi-reward GRPO framework that significantly enhances the performance of single-codebook TTS LLMs by addressing key challenges in prosody and speaker similarity. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field of TTS synthesis, with the potential for broad applications in human-computer interaction.
Advanced deep learning architectures, particularly recurrent neural networks (RNNs), have been widely applied in audio, bioacoustic, and biomedical signal analysis, especially in data-scarce environments. While gated RNNs remain effective, they can be relatively over-parameterised and less training-efficient in some regimes, while linear RNNs tend to fall short in capturing the complexity inherent in bio-signals. To address these challenges, we propose the Parallel Delayed Memory Unit (PDMU), a {delay-gated state-space module for short-term temporal credit assignment} targeting audio and bioacoustic signals, which enhances short-term temporal state interactions and memory efficiency via a gated delay-line mechanism. Unlike previous Delayed Memory Units (DMU) that embed temporal dynamics into the delay-line architecture, the PDMU further compresses temporal information into vector representations using Legendre Memory Units (LMU). This design serves as a form of causal attention, allowing the model to dynamically adjust its reliance on past states and improve real-time learning performance. Notably, in low-information scenarios, the gating mechanism behaves similarly to skip connections by bypassing state decay and preserving early representations, thereby facilitating long-term memory retention. The PDMU is modular, supporting parallel training and sequential inference, and can be easily integrated into existing linear RNN frameworks. Furthermore, we introduce bidirectional, efficient, and spiking variants of the architecture, each offering additional gains in performance or energy efficiency. Experimental results on diverse audio and biomedical benchmarks demonstrate that the PDMU significantly enhances both memory capacity and overall model performance.
Primary: Ghent University
All Institutions: Institute for Infocomm Research (IR), Agency for Science, Technology and Research (A*STAR), Ghent University, Department of Electrical and Electronic Engineering
The main contribution of this paper is the introduction of the Parallel Delayed Memory Unit (PDMU), which enhances temporal modeling in audio and biomedical signal analysis through a novel delay-gated architecture. This work represents a significant advancement in the efficiency and effectiveness of RNNs for processing complex temporal data, with potential applications in real-time healthcare solutions and audio processing technologies.
The proposed Parallel Delayed Memory Unit (PDMU) introduces a novel architecture that effectively combines delay-gated mechanisms with Legendre Memory Units to enhance temporal modeling in audio and biomedical signal processing. The methodology is well-structured, leveraging existing frameworks while innovatively addressing the limitations of traditional RNNs and linear models. The introduction of various PDMU variants (bi-directional, efficient, and spiking) demonstrates a comprehensive approach to optimizing performance and energy efficiency, which is particularly relevant for real-time applications.
The experimental evaluation is robust, utilizing a diverse set of benchmarks across audio and biomedical domains. The results demonstrate significant performance improvements over existing models, particularly in low-information scenarios, which is a critical aspect of real-world applications. The ablation studies further validate the contributions of individual components of the PDMU, providing clear evidence of its effectiveness.
The paper includes sufficient implementation details, such as the use of the PyTorch library and specific training configurations, which enhance reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results.
While the PDMU shows promise, the paper does not extensively discuss potential limitations, such as the scalability of the model to larger datasets or its performance in highly variable real-world conditions. Additionally, the reliance on specific datasets may limit generalizability.
The PDMU has significant implications for fields requiring efficient processing of temporal data, particularly in healthcare and audio signal analysis. Its ability to enhance real-time learning and memory retention could lead to advancements in medical diagnostics and monitoring technologies. The main contribution of this paper is the introduction of the Parallel Delayed Memory Unit (PDMU), which enhances temporal modeling in audio and biomedical signal analysis through a novel delay-gated architecture. This work represents a significant advancement in the efficiency and effectiveness of RNNs for processing complex temporal data, with potential applications in real-time healthcare solutions and audio processing technologies.
Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech domain compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Q2D2, a geometry-aware quantization scheme that enhances audio codec performance by effectively capturing feature correlations through structured two-dimensional grids. This innovative approach not only improves reconstruction quality but also maintains high codebook utilization, positioning Q2D2 as a promising alternative to traditional quantization methods in audio processing.
The proposed Q2D2 quantization method introduces a novel approach to audio codec design by utilizing two-dimensional geometric structures for quantization. This method addresses limitations in existing quantization techniques, such as RVQ and FSQ, by capturing correlations between features more effectively. The methodology is well-structured, with clear explanations of the geometric tiling strategies and their implications for audio representation. The use of lightweight linear projections and Straight-Through Estimators (STE) enhances the differentiability and stability of the quantization process, making it suitable for end-to-end training.
The experimental evaluation is comprehensive, involving extensive datasets and multiple state-of-the-art (SOTA) models for comparison. The results demonstrate that Q2D2 achieves competitive to superior performance in reconstruction quality across various metrics, including UTMOS, PESQ, and STOI. The paper includes ablation studies that effectively highlight the impact of design choices, such as grid type and quantization levels, on performance. The thoroughness of the experiments lends credibility to the claims made regarding the advantages of Q2D2.
The paper provides detailed implementation and experimental setup information, which is crucial for reproducibility. However, the absence of a specific project or code repository limits the ability for others to fully replicate the results. The authors mention using the WavTokenizer framework, which is a positive aspect as it allows for some level of reproducibility if the framework is accessible.
One limitation is the lack of a clear primary institution and the absence of a demo or project URL, which could enhance the visibility and accessibility of the research. Additionally, while the paper focuses on speech reconstruction, the generalizability of Q2D2 to other audio domains remains to be explored in future work.
The introduction of Q2D2 has the potential to significantly impact the field of audio processing, particularly in applications requiring efficient audio compression without sacrificing quality. Its implications extend to areas such as speech synthesis, music generation, and multimodal systems that integrate audio with other modalities. The main contribution of this paper is the introduction of Q2D2, a geometry-aware quantization scheme that enhances audio codec performance by effectively capturing feature correlations through structured two-dimensional grids. This innovative approach not only improves reconstruction quality but also maintains high codebook utilization, positioning Q2D2 as a promising alternative to traditional quantization methods in audio processing.
The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
Primary: Indraprastha Institute of Information Technology Delhi
All Institutions: Indraprastha Institute of Information Technology Delhi, Manipal University Jaipur, Netaji Subhas University of Technology
The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
The paper introduces a novel dual-stream architecture, CLAM, which leverages two distinct pre-trained audio encoders to capture the nuances of vocal and instrumental elements in music. The methodology is well-structured, focusing on the contrastive learning approach to enhance the model's sensitivity to synthetic artifacts. The introduction of the Melody or Machine (MoM) benchmark is a significant advancement, addressing the limitations of existing datasets by providing a more diverse and challenging evaluation framework. The dual-loss objective, combining binary cross-entropy with a contrastive triplet loss, is a thoughtful design choice that enhances the model's robustness against out-of-distribution samples.
The experiments are comprehensive, demonstrating the efficacy of the proposed model against existing state-of-the-art methods. The results on the MoM benchmark, achieving an F1 score of 0.925, significantly outperforming previous models, underscore the technical impact of the research. The ablation studies provide a solid foundation for understanding the contributions of various components of the model, validating the architectural choices made.
The paper provides sufficient implementation details, including the training setup, model architecture, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly accessible code repository or demo URL limits the ease with which other researchers can replicate the results.
The primary limitation noted is the rapid pace of innovation in AI music generation, which may render the proposed detection methods obsolete as new models emerge. Additionally, the dataset's predominant focus on English songs may limit its applicability across diverse linguistic and cultural contexts.
The research has significant implications for the music industry, particularly in protecting intellectual property rights and maintaining artistic authenticity in the face of advancing AI technologies. The MoM dataset and CLAM model can aid content platforms and rights holders in identifying synthetic music, fostering trust in music distribution. However, there are ethical considerations regarding the potential misuse of the dataset for training more effective generative models. The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at approximately 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (approximately 70 bps), sparse prosody transmission via TTS interpolation (less than 14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS greater than 4.26). We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.
Primary: Fudan University
All Institutions: Fudan University, Tsinghua University
The main contribution of this paper is the STCTS framework, which achieves ultra-low bitrate speech communication through a novel approach of explicitly decomposing speech into linguistic, prosodic, and timbral components, significantly enhancing the efficiency and quality of voice transmission in constrained environments. This work represents a meaningful advancement in the field of audio processing and communication technologies, with potential applications in various critical domains.
The paper introduces STCTS, a novel framework for ultra-low bitrate speech communication, which decomposes speech into three components: linguistic content, prosody, and timbre. This explicit decomposition allows for tailored compression strategies that significantly reduce bandwidth usage while maintaining perceptual quality. The methodology is well-structured, leveraging existing technologies (STT, TTS) and introducing innovative strategies for prosody transmission and speaker embedding. The use of context-aware text encoding and sparse prosody transmission is particularly noteworthy, as it showcases a deep understanding of the temporal dynamics of speech components.
The experimental evaluation is robust, utilizing the LibriSpeech dataset to benchmark STCTS against established codecs like Opus and EnCodec. The reported results demonstrate a significant bitrate reduction while achieving high perceptual quality (NISQA MOS > 4.26). The discovery of a bimodal quality distribution concerning prosody sampling rates provides valuable insights for future configurations. However, the paper could benefit from more extensive user studies to assess real-world performance in diverse communication scenarios.
The authors provide an open-source implementation of their system, which is a strong point for reproducibility. The detailed description of the system architecture, configuration options, and the availability of the source code facilitate replication of the experiments. However, the paper lacks comprehensive benchmarking infrastructure details that could aid other researchers in reproducing the results precisely.
One limitation is the reliance on specific datasets for evaluation, which may not fully capture the variability of real-world speech communication in bandwidth-constrained environments. Additionally, the paper does not address potential challenges in adapting the system to different languages or dialects, which could affect its generalizability. The performance under extreme network conditions or with varying speaker characteristics also requires further exploration.
The STCTS framework has significant implications for voice communication in bandwidth-constrained environments, such as maritime, satellite, and tactical networks. By enabling natural and expressive communication at ultra-low bitrates, it addresses critical needs in various fields, including emergency response, remote work, and IoT applications. The modular architecture also supports future advancements in speech technology, making it a versatile tool for diverse applications. The main contribution of this paper is the STCTS framework, which achieves ultra-low bitrate speech communication through a novel approach of explicitly decomposing speech into linguistic, prosodic, and timbral components, significantly enhancing the efficiency and quality of voice transmission in constrained environments. This work represents a meaningful advancement in the field of audio processing and communication technologies, with potential applications in various critical domains.
Respiratory diseases remain major global health challenges, and traditional auscultation is often limited by subjectivity, environmental noise, and inter-clinician variability. This study presents an explainable multimodal deep learning framework for automatic lung-disease detection using respiratory audio signals. The proposed system integrates two complementary representations: a spectral-temporal encoder based on a CNN-BiLSTM Attention architecture, and a handcrafted acoustic-feature encoder capturing physiologically meaningful descriptors such as MFCCs, spectral centroid, spectral bandwidth, and zero-crossing rate. These branches are combined through late-stage fusion to leverage both data-driven learning and domain-informed acoustic cues. The model is trained and evaluated on the Asthma Detection Dataset Version 2 using rigorous preprocessing, including resampling, normalization, noise filtering, data augmentation, and patient-level stratified partitioning. The study achieved strong generalization with 91.21% accuracy, 0.899 macro F1-score, and 0.9866 macro ROC-AUC, outperforming all ablated variants. An ablation study confirms the importance of temporal modeling, attention mechanisms, and multimodal fusion. The framework incorporates Grad-CAM, Integrated Gradients, and SHAP, generating interpretable spectral, temporal, and feature-level explanations aligned with known acoustic biomarkers to build clinical transparency. The findings demonstrate the framework's potential for telemedicine, point-of-care diagnostics, and real-world respiratory screening.
Primary: Albukhary International University
All Institutions: Albukhary International University
This study presents a novel explainable multimodal deep learning framework for automatic lung disease detection from respiratory audio signals, addressing critical challenges in traditional auscultation methods. The integration of deep learning with handcrafted features and explainable AI techniques represents a significant advancement in the field, with the potential to improve clinical outcomes through enhanced diagnostic accuracy and transparency.
The proposed methodology integrates a hybrid deep learning architecture combining CNN, BiLSTM, and attention mechanisms with handcrafted acoustic features, which is innovative in the context of respiratory sound analysis. The late-stage fusion approach effectively leverages both data-driven and domain-informed representations, enhancing the model's robustness and interpretability. The incorporation of explainable AI techniques such as Grad-CAM, Integrated Gradients, and SHAP adds significant value by providing clinical transparency, which is often lacking in deep learning applications in healthcare.
The experiments are well-structured, utilizing a publicly available dataset with a comprehensive evaluation strategy that includes accuracy, F1-score, and ROC-AUC metrics. The ablation study is particularly noteworthy, as it rigorously tests the contributions of various components of the model, confirming the importance of multimodal fusion and attention mechanisms. The reported results demonstrate strong generalization capabilities across different respiratory conditions, indicating the model's practical applicability.
The paper outlines a clear training strategy, including hyperparameter settings, data preprocessing steps, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly accessible code repository limits the ability for others to replicate the study fully.
While the study presents a robust framework, it relies on a single dataset, which may limit the generalizability of the findings. The model's performance on underrepresented classes, such as Bronchial sounds, suggests that further refinement may be necessary to improve classification accuracy across all categories. Additionally, the lack of a demo or project URL restricts practical engagement with the research.
The framework has significant implications for telemedicine and point-of-care diagnostics, potentially improving early detection and management of respiratory diseases. By enhancing the interpretability of AI models in clinical settings, this work contributes to building trust in automated diagnostic systems, which is crucial for their acceptance in healthcare. This study presents a novel explainable multimodal deep learning framework for automatic lung disease detection from respiratory audio signals, addressing critical challenges in traditional auscultation methods. The integration of deep learning with handcrafted features and explainable AI techniques represents a significant advancement in the field, with the potential to improve clinical outcomes through enhanced diagnostic accuracy and transparency.
Recent advances in Speech Large Language Models (Speech LLMs) have led to great progress in speech understanding tasks such as Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, whether these models can achieve human-level auditory perception, particularly in terms of their ability to comprehend latent intentions and implicit emotions in real-world spoken language, remains underexplored. To this end, we introduce the Human-level Perception in Spoken Speech Understanding (HPSU), a new benchmark for fully evaluating the human-level perceptual and understanding capabilities of Speech LLMs. HPSU comprises over 20,000 expert-validated spoken language understanding samples in English and Chinese. It establishes a comprehensive evaluation framework by encompassing a spectrum of tasks, ranging from basic speaker attribute recognition to complex inference of latent intentions and implicit emotions. To address the issues of data scarcity and high cost of manual annotation in real-world scenarios, we developed a semi-automatic annotation process. This process fuses audio, textual, and visual information to enable precise speech understanding and labeling, thus enhancing both annotation efficiency and quality. We systematically evaluate various open-source and proprietary Speech LLMs. The results demonstrate that even top-performing models still fall considerably short of human capabilities in understanding genuine spoken interactions. Consequently, HPSU will be useful for guiding the development of Speech LLMs toward human-level perception and cognition.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the HPSU benchmark, which systematically evaluates the human-level perceptual and understanding capabilities of Speech LLMs in real-world spoken language contexts. This comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the importance of addressing the limitations of current models and guiding future research towards achieving more sophisticated speech understanding.
The paper presents a well-structured methodology for constructing the HPSU benchmark, incorporating a semi-automatic annotation process that integrates audio, text, and visual modalities. This innovative approach addresses the challenges of data scarcity and high costs associated with manual annotation, enhancing both efficiency and quality. The hierarchical taxonomy of tasks and the adversarial induction protocol for robustness testing are commendable features that significantly improve the evaluation framework.
The experimental evaluation is comprehensive, involving 13 leading models and a detailed analysis of their performance across various tasks. The results highlight the significant gap between human capabilities and those of current Speech LLMs, particularly in complex reasoning tasks. The use of a human baseline and random guessing as benchmarks provides a clear context for interpreting model performance.
The paper provides sufficient details about the datasets, annotation process, and evaluation metrics, which would allow other researchers to replicate the study. However, the lack of specific details regarding the training and evaluation of the models limits full reproducibility.
The paper acknowledges limitations related to the performance of Speech LLMs, particularly their struggles with complex semantic reasoning and susceptibility to misleading prompts. Additionally, the reliance on specific datasets may introduce biases that affect generalizability.
The HPSU benchmark has the potential to significantly influence research in speech understanding by providing a rigorous evaluation framework that encourages the development of models capable of human-level perception. This could lead to advancements in applications such as human-computer interaction, sentiment analysis, and multilingual communication. The main contribution of this paper is the introduction of the HPSU benchmark, which systematically evaluates the human-level perceptual and understanding capabilities of Speech LLMs in real-world spoken language contexts. This comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the importance of addressing the limitations of current models and guiding future research towards achieving more sophisticated speech understanding.
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated calibration and fusion techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, and that the Fuse-then-Calibrate ordering generally outperforms calibrating individual models before fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
Primary: IEEE Publication Technology Group
All Institutions: IEEE Publication Technology Group
This paper provides a comprehensive framework for calibrating and fusing EEND systems at the probability level, marking a significant contribution to the field of speaker diarization. The innovative methodology and thorough experimental evaluation demonstrate its potential to enhance the reliability and effectiveness of neural diarization systems.
The paper presents a novel framework for calibrating and fusing End-to-End Neural Diarization (EEND) models at the probability level, which is a significant advancement over existing methods that primarily operate on hard decisions. The authors explore two output formulations (multilabel and powerset) and their effects on calibration and fusion, providing a systematic approach that leverages model uncertainty. The methodology is well-structured, with clear definitions of calibration strategies and fusion methods, including both unsupervised and supervised techniques.
The experimental evaluation is thorough, utilizing the CallHome two-speaker benchmark to demonstrate the effectiveness of the proposed methods. The results show substantial improvements in Diarization Error Rate (DER) and calibration quality, with detailed comparisons against existing methods like DOVER-Lap. The experiments are well-designed, covering various configurations and providing insights into the impact of calibration and fusion strategies.
The paper includes a link to the publicly available code repository, which enhances reproducibility. The implementation details are sufficiently described, allowing other researchers to replicate the experiments. However, the paper could benefit from more explicit details on hyperparameter settings and training procedures.
One limitation is the focus on two-speaker scenarios, which may not generalize to more complex multi-speaker environments. Additionally, while the paper discusses the importance of calibration, it does not explore the potential trade-offs between calibration quality and computational efficiency in depth.
The proposed framework has significant implications for improving speaker diarization systems, particularly in applications where accurate speaker identification is critical, such as in transcription services, meeting analysis, and audio indexing. By enhancing the reliability of confidence scores, this work can lead to better performance in downstream tasks that rely on speaker diarization. This paper provides a comprehensive framework for calibrating and fusing EEND systems at the probability level, marking a significant contribution to the field of speaker diarization. The innovative methodology and thorough experimental evaluation demonstrate its potential to enhance the reliability and effectiveness of neural diarization systems.
Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfolding of Residual Entropy), a novel framework that guides multi-stage quantization using a pre-trained speech enhancement model. The first quantization stage reconstructs low-entropy, denoised speech embeddings, while subsequent stages encode residual high-entropy components. This design improves training stability significantly. Experiments demonstrate that PURE consistently outperforms conventional RVQ-based codecs in reconstruction and downstream speech language model-based text-to-speech, particularly under noisy training conditions.
Primary: CMU
All Institutions: CMU, SJTU
The main contribution of this paper is the introduction of PURE Codec, a novel framework that enhances the stability and efficiency of speech codecs through progressive unfolding of residual entropy guided by a pre-trained speech enhancement model. This work significantly advances the field of neural speech coding by addressing key challenges in training stability and reconstruction quality, making it a valuable contribution to audio processing research.
The proposed PURE Codec introduces a novel approach to residual vector quantization (RVQ) by incorporating enhancement-guided supervision, which anchors the quantization process to low-entropy, denoised speech embeddings. This multi-stage quantization framework effectively stabilizes training and improves reconstruction quality, particularly in challenging noisy environments. The methodology is well-structured, detailing the integration of a pre-trained speech enhancement model and a stochastic scheduling mechanism that balances the use of enhanced and original embeddings during training.
The experiments are comprehensive, utilizing multiple datasets to evaluate the codec's performance under various conditions. The results demonstrate that PURE Codec consistently outperforms conventional RVQ-based codecs across several metrics, including signal-to-distortion ratio (SDR) and perceptual evaluation metrics. The ablation studies provide valuable insights into the impact of different design choices, reinforcing the robustness of the proposed method.
The paper provides a clear description of the training process, including the two-stage training strategy and the specific hyperparameters used. The codebase is shared on GitHub, enhancing the reproducibility of the experiments. However, the reliance on specific enhancement models may limit the generalizability of the findings.
A notable limitation is that the PURE Codec is heavily dependent on speech-specific enhancement models, which may not be applicable to general audio processing tasks. Additionally, while the training stability is improved, the paper does not extensively discuss potential drawbacks or scenarios where the method might underperform.
The advancements in speech codec technology have significant implications for real-time communication, mobile applications, and speech-driven generative models. By improving the efficiency and quality of speech compression, this work could enhance user experiences in various applications, from telephony to virtual assistants. The main contribution of this paper is the introduction of PURE Codec, a novel framework that enhances the stability and efficiency of speech codecs through progressive unfolding of residual entropy guided by a pre-trained speech enhancement model. This work significantly advances the field of neural speech coding by addressing key challenges in training stability and reconstruction quality, making it a valuable contribution to audio processing research.
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
Primary: IEEE Publication Technology Group
All Institutions: IEEE Publication Technology Group
This paper establishes a comprehensive framework for calibrating and fusing EEND systems at the probability level, significantly advancing the state of speaker diarization. The methodology is innovative, addressing critical gaps in existing approaches and demonstrating substantial improvements in performance through rigorous experimentation.
The paper presents a novel framework for calibrating and fusing End-to-End Neural Diarization (EEND) models, which is a significant advancement in the field. It introduces two output formulations (multilabel and powerset) and explores their implications for calibration and fusion. The methodology is well-structured, with clear definitions of calibration strategies and fusion techniques, including both unsupervised and supervised methods. The use of Platt scaling for calibration and the exploration of different fusion strategies demonstrate a comprehensive approach to addressing the limitations of existing methods.
The experiments are thorough, utilizing the CallHome two-speaker benchmark to validate the proposed methods. The results indicate substantial improvements in Diarization Error Rate (DER) and calibration quality, with detailed comparisons across various configurations. The paper effectively illustrates the impact of calibration and fusion on model performance, providing a robust analysis of the results. However, the reliance on a single benchmark may limit the generalizability of the findings.
The authors provide a GitHub repository for their calibration and fusion framework, which enhances reproducibility. The paper includes detailed implementation details, experimental setups, and evaluation metrics, allowing other researchers to replicate the study. However, the absence of a demo URL limits the accessibility of the results for broader audiences.
One limitation is the focus on a specific benchmark, which may not capture the full range of challenges present in real-world scenarios with more speakers or varied acoustic conditions. Additionally, while the paper discusses the importance of calibration, it does not explore alternative calibration methods beyond Platt scaling, which could provide further insights.
The proposed framework has significant implications for improving speaker diarization systems, particularly in applications involving multi-speaker audio. By enhancing the reliability of confidence scores and enabling better model fusion, this work could lead to advancements in various domains, including automatic speech recognition and audio analysis. The findings encourage further exploration of probabilistic outputs in machine learning, potentially influencing future research directions. This paper establishes a comprehensive framework for calibrating and fusing EEND systems at the probability level, significantly advancing the state of speaker diarization. The methodology is innovative, addressing critical gaps in existing approaches and demonstrating substantial improvements in performance through rigorous experimentation.
Wave-guide-based physical systems provide a promising route toward energy-efficient analog computing beyond traditional electronics. Within this landscape, acoustic neural networks represent a promising approach for achieving low-power computation in environments where electronics are inefficient or limited, yet their systematic design has remained largely unexplored. Here we introduce a framework for designing and simulating acoustic neural networks, which perform computation through the propagation of sound waves. Using a digital-twin approach, we train conventional neural network architectures under physically motivated constraints including non-negative signals and weights, the absence of bias terms, and nonlinearities compatible with intensity-based, non-negative acoustic signals. Our work provides a general framework for acoustic neural networks that connects learnable network components directly to physically measurable acoustic properties, enabling the systematic design of realizable acoustic computing systems. We demonstrate that constrained recurrent and hierarchical architectures can perform accurate speech classification, and we propose the SincHSRNN, a hybrid model that combines learnable acoustic bandpass filters with hierarchical temporal processing. The SincHSRNN achieves up to 95% accuracy on the AudioMNIST dataset while remaining compatible with passive acoustic components. Beyond computational performance, the learned parameters correspond to measurable material and geometric properties such as attenuation and transmission. Our results establish general design principles for physically realizable acoustic neural networks and outline a pathway toward low-power, wave-based neural computing.
Primary: RWTH Aachen University
All Institutions: RWTH Aachen University, DWI -- Leibniz Institute for Interactive Materials, Institute of Theoretical Physics, Center for Soft Nanoscience, University of Münster
The paper establishes a framework for designing and simulating acoustic neural networks, demonstrating that neural computation can be achieved through the physics of sound. This work not only advances the theoretical understanding of acoustic computing but also lays the groundwork for practical implementations in low-power, wave-based neural processing.
The paper introduces a novel framework for designing and simulating acoustic neural networks that leverage the physical properties of sound waves for computation. The authors employ a digital-twin approach, which allows for the systematic design of neural architectures constrained by physical realizability. The methodology is well-structured, beginning with the foundational concepts of acoustic neural networks and progressing through the development of constrained recurrent architectures, culminating in the SincHSRNN model. The constraints imposed on the network (non-negative weights and activations, absence of bias terms) are well-justified and aligned with the physical characteristics of acoustic systems. The proposed architectures are rigorously defined, and the transition from RNNs to more complex hierarchical models demonstrates a clear progression in sophistication while maintaining physical feasibility.
The experimental evaluation is robust, utilizing the AudioMNIST dataset to assess the performance of various network architectures. The authors provide comprehensive results, including training and test accuracies across different configurations of RNNs, HSRNNs, and SincHSRNNs. The results indicate that the proposed models can achieve competitive performance, with the SincHSRNN reaching up to 95% accuracy. However, the experiments are primarily focused on a single dataset, which may limit the generalizability of the findings. The evaluation of model performance under constrained conditions provides valuable insights into the trade-offs between physical constraints and computational efficacy.
The paper includes detailed descriptions of the training procedures, hyperparameters, and model architectures, which enhances reproducibility. However, the absence of a publicly available code repository or supplementary materials limits the ability for independent verification of results. The authors mention that supplementary materials are available but do not provide a direct link, which could hinder broader accessibility.
One limitation of the study is the reliance on a single dataset (AudioMNIST), which may not fully capture the complexities of real-world audio processing tasks. Additionally, the constrained architectures exhibit sensitivity to initialization and weight scaling, which could affect training stability and performance. The paper also does not explore the potential for active elements in acoustic systems, which could enhance the capabilities of the proposed networks.
The implications of this research are significant, particularly in the context of low-power computing and analog processing in environments where traditional electronics are less effective. The development of acoustic neural networks could lead to advancements in applications such as speech recognition, smart hearing aids, and other acoustic processing tasks that benefit from energy-efficient solutions. The findings also contribute to the growing field of neuromorphic computing, positioning acoustic systems as viable alternatives to optical and electronic approaches. The paper establishes a framework for designing and simulating acoustic neural networks, demonstrating that neural computation can be achieved through the physics of sound. This work not only advances the theoretical understanding of acoustic computing but also lays the groundwork for practical implementations in low-power, wave-based neural processing.
Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.
Primary: Simon Fraser University
All Institutions: Simon Fraser University, Dalhousie University
The main contribution of this paper is the introduction of a hybrid augmentation strategy that leverages deep generative models to enhance the detection of Southern Resident Killer Whale vocalizations, addressing critical challenges in marine bioacoustics. This work significantly advances the field by demonstrating the effectiveness of combining traditional and generative augmentation methods, paving the way for improved conservation efforts.
The paper presents a robust methodology that integrates deep generative models (Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models) for augmenting a limited dataset of Southern Resident Killer Whale vocalizations. The hybrid approach combines these generative techniques with traditional augmentation methods, which is innovative in the context of bioacoustics. The authors provide a clear rationale for their choices and demonstrate a systematic evaluation of the methods, which is commendable. However, the methodology could benefit from a more detailed discussion on the selection criteria for generative models and the specific hyperparameters used.
The experiments are well-structured, comparing multiple augmentation strategies and their impact on classification performance. The use of two distinct datasets for training and testing enhances the validity of the results. The reported metrics (recall and F1-score) provide a clear picture of model performance. However, the paper could improve by including more statistical analyses to support the significance of the results and by discussing the potential variability in performance across different acoustic environments.
The authors provide a GitHub repository with code and documentation, which is a strong point for reproducibility. The detailed description of model architectures and training procedures allows for replication. However, the paper lacks specific details on the computational resources used, which could affect reproducibility for others attempting to replicate the study.
One limitation is the reliance on a relatively small annotated dataset, which may affect the generalizability of the findings. Additionally, while the paper discusses the potential for overfitting with generative models, it does not provide extensive analysis on how the models were validated against this risk. The authors also acknowledge the challenges of background noise in marine environments but do not explore potential solutions or mitigations in depth.
The research has significant implications for marine conservation efforts, particularly for endangered species like the Southern Resident Killer Whale. By improving the accuracy of automated detection systems, the study can enhance monitoring and conservation strategies. Furthermore, the exploration of deep generative models in bioacoustics opens avenues for future research in other areas of wildlife monitoring and environmental sound analysis. The main contribution of this paper is the introduction of a hybrid augmentation strategy that leverages deep generative models to enhance the detection of Southern Resident Killer Whale vocalizations, addressing critical challenges in marine bioacoustics. This work significantly advances the field by demonstrating the effectiveness of combining traditional and generative augmentation methods, paving the way for improved conservation efforts.
Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that integrates singing voice synthesis and conversion while bridging human and non-human singing generation. CartoonSing employs a two-stage pipeline: a score representation encoder trained with annotated human singing and a timbre-aware vocoder that reconstructs waveforms for both human and non-human audio. Experiments demonstrate that CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS and SVC toward creative, non-human singing generation.
Primary: Mohamed bin Zayed University of Artificial Intelligence
All Institutions: Carnegie Mellon University, Mohamed bin Zayed University of Artificial Intelligence, Renmin University of China, University of Southern California
This paper introduces CartoonSing, a pioneering framework for Non-Human Singing Generation, significantly advancing the capabilities of singing voice synthesis and conversion by integrating non-human timbres into the synthesis process. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the field of audio machine learning.
The methodology presented in this paper is innovative, introducing a two-stage framework that effectively bridges the gap between human and non-human singing voice synthesis and conversion. The authors address significant challenges, such as the lack of non-human singing data and the absence of symbolic alignment, by utilizing a combination of self-supervised learning features and a timbre-aware vocoder. This approach not only allows for the generation of non-human singing voices but also maintains musical coherence and intelligibility, which is a notable advancement in the field.
The experimental setup is robust, utilizing a diverse set of datasets for both training and evaluation. The authors conduct comprehensive evaluations using both objective and subjective metrics, demonstrating the effectiveness of their approach compared to existing systems. The results indicate that the proposed method achieves superior timbre similarity and maintains audio quality, which is critical for practical applications in creative domains.
The paper emphasizes reproducibility by committing to release source code, training scripts, and detailed hyperparameter settings. This transparency is crucial for enabling other researchers to replicate the findings and build upon the work. The authors also provide a clear description of the datasets and processing methods used, which further supports reproducibility.
While the paper presents a significant advancement, it acknowledges the inherent limitations in synthesizing non-human voices, particularly regarding the clarity of consonantal articulation. The trade-off between timbre similarity and intelligibility is a critical challenge that the authors highlight, suggesting that further research is needed to improve this aspect.
The implications of this work are substantial, particularly for creative industries such as video game development, film, and music production, where non-human vocalizations are increasingly sought after. The ability to generate diverse and musically coherent non-human singing voices could open new avenues for artistic expression and innovation in audio synthesis. This paper introduces CartoonSing, a pioneering framework for Non-Human Singing Generation, significantly advancing the capabilities of singing voice synthesis and conversion by integrating non-human timbres into the synthesis process. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the field of audio machine learning.
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.
Primary: Institut Polytechnique de Paris
All Institutions: Institut Polytechnique de Paris
The paper introduces a novel approach to bandwidth extension using a Harmonic-Percussive disentangled neural audio codec, demonstrating significant improvements in high-frequency reconstruction through a well-integrated transformer-based language model. This work not only advances the state of the art in audio processing but also opens avenues for further research in audio representation learning and codec design.
The paper presents a novel approach to bandwidth extension by introducing a Harmonic-Percussive disentangled neural audio codec (HP-codec) that separates high and low-frequency components and utilizes a transformer-based language model for token prediction. This dual-architecture design is innovative as it integrates codec structure directly into the generative modeling process, allowing for improved high-frequency reconstruction. The methodology is well-structured, leveraging existing techniques in audio processing while introducing significant enhancements in representation learning and model coupling.
The experimental setup is robust, utilizing multiple datasets including MUSDB18 and JAMENDO for training and testing. The authors compare their model against established baselines (Apollo and AudioSR), providing both objective metrics and subjective evaluations through MUSHRA tests. The results indicate that HP-codecX outperforms these baselines in reconstructing high-frequency content, demonstrating the effectiveness of the proposed approach. The comprehensive evaluation across different datasets adds credibility to the findings.
The authors emphasize reproducibility by detailing their experimental setup, training procedures, and the datasets used. They plan to release their implementation upon acceptance, which is a positive step towards ensuring that other researchers can replicate their results. However, the paper could benefit from providing more specific information about hyperparameters and training conditions.
The paper acknowledges several limitations, including the constraint of fixed sampling rates and the architectural coupling between the codec and language model. The reliance on a specific input-output mapping (16 kHz to 48 kHz) may limit the model's applicability in broader contexts. Additionally, the potential for artifacts in high-frequency reconstructions is noted, which could affect perceptual quality despite favorable listening test results.
The advancements in bandwidth extension have significant implications for audio processing applications, including telecommunications, music restoration, and speech enhancement. The proposed model's ability to improve high-frequency reconstruction could enhance user experiences in various audio-related technologies, making it a valuable contribution to the field. The paper introduces a novel approach to bandwidth extension using a Harmonic-Percussive disentangled neural audio codec, demonstrating significant improvements in high-frequency reconstruction through a well-integrated transformer-based language model. This work not only advances the state of the art in audio processing but also opens avenues for further research in audio representation learning and codec design.
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.
Primary: Tencent Technology Co.Ltd
All Institutions: Tencent Technology Co.Ltd
The paper presents a novel multi-reward GRPO framework that significantly enhances the performance of single-codebook TTS LLMs by addressing key challenges in prosody and speaker similarity. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field of TTS synthesis, with the potential for broad applications in human-computer interaction.
The paper introduces a multi-reward Group Relative Policy Optimization (GRPO) framework that enhances the token generation policy of single-codebook TTS LLMs. The integration of multiple rule-based rewards (length penalty, entropy regularization, and prosody alignment) is a novel approach that addresses common issues in TTS systems, such as prosody instability and speaker drift. The use of an external reasoning LLM to predict pause structures for prosody alignment is particularly innovative, leveraging in-context learning to provide a human-preference-aligned supervisory signal. The methodology is well-structured, with clear definitions of the reward functions and their intended impacts on the model's performance.
The experiments are comprehensive, utilizing a large bilingual corpus and various evaluation metrics (CER, SIM, MOS) to assess the effectiveness of the proposed framework. The results demonstrate significant improvements in prosodic stability, speaker similarity, and naturalness compared to existing models. The scalability analysis across different model sizes and data scales adds depth to the evaluation, showing that the proposed method is effective across a range of conditions. The ablation study further validates the contribution of each reward component, providing insights into their individual impacts on performance.
The paper provides detailed implementation details, including the architecture, training configurations, and data sources. However, the absence of a public code repository or demo URL limits the reproducibility of the results. While the methodology is well-explained, the lack of accessible resources may hinder other researchers from replicating the study.
One limitation of the study is the reliance on a specific reasoning LLM for prosody alignment, which may not generalize across all languages or dialects. Additionally, while the results are promising, the paper does not address potential computational costs associated with the proposed GRPO framework, particularly in terms of training time and resource requirements. The evaluation is primarily focused on objective metrics, and further subjective assessments could strengthen the findings.
The proposed framework has significant implications for the field of TTS synthesis, particularly in enhancing the naturalness and expressivity of synthesized speech. Improved prosody and speaker similarity can lead to more engaging and human-like interactions in applications such as virtual assistants, audiobooks, and language learning tools. The integration of reinforcement learning in TTS systems could pave the way for more adaptive and context-aware speech synthesis technologies. The paper presents a novel multi-reward GRPO framework that significantly enhances the performance of single-codebook TTS LLMs by addressing key challenges in prosody and speaker similarity. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field of TTS synthesis, with the potential for broad applications in human-computer interaction.
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
Primary: University of Texas at Austin
All Institutions: University of Texas at Austin, Amazon
RosettaSpeech presents a novel framework for zero-shot speech-to-speech translation utilizing monolingual data, significantly advancing the field by addressing the critical issue of data scarcity. The comprehensive methodology and strong experimental results underscore its potential to transform speech translation technologies for underrepresented languages.
The methodology presented in RosettaSpeech is innovative, as it introduces a zero-shot speech-to-speech translation framework that leverages monolingual speech-text data and machine translation supervision. By decoupling the need for parallel speech corpora and utilizing text as an intermediate bridge, the authors effectively address a significant bottleneck in the field. The model architecture, which combines speech modeling with a large language model (LLM) backbone and multi-head projection layers, is well-conceived and demonstrates a thoughtful integration of existing technologies. However, the reliance on NMT-generated pseudo-parallel data raises questions about the potential for noise and inaccuracies in the training process.
The experimental evaluation is robust, with the authors providing comprehensive results on standard benchmarks, including the CVSS-C test set. The reported ASR-BLEU scores indicate substantial improvements over existing systems, showcasing the effectiveness of the proposed method. The ablation studies conducted further validate the necessity of the joint training approach and the benefits of fine-tuning, providing a clear understanding of the model's capabilities. However, the experiments are primarily focused on a limited set of high-resource languages, which may not fully represent the model's performance across a broader linguistic landscape.
The paper includes detailed implementation details, including training procedures, dataset descriptions, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly available code repository or demo limits the ability for external validation of the results. The authors should consider releasing their code to facilitate further research and experimentation.
The paper acknowledges several limitations, including the focus on a narrow set of high-resource languages and the challenges associated with extending the framework to low-resource languages. Additionally, the potential for noise in NMT-generated targets is a concern that could affect the quality of the final translations. Future work should address these limitations to broaden the applicability of the framework.
The implications of RosettaSpeech are significant, as it provides a scalable solution for speech-to-speech translation in languages that lack parallel speech corpora. By enabling high-quality translation for a wider array of languages, this work has the potential to enhance communication across linguistic barriers and contribute to global accessibility. The framework's design could inspire further research into efficient translation methods that leverage abundant text data. RosettaSpeech presents a novel framework for zero-shot speech-to-speech translation utilizing monolingual data, significantly advancing the field by addressing the critical issue of data scarcity. The comprehensive methodology and strong experimental results underscore its potential to transform speech translation technologies for underrepresented languages.
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of SONAR, a frequency-guided framework for audio deepfake detection that effectively addresses spectral bias by disentangling low- and high-frequency audio components. This innovative approach not only improves detection performance but also accelerates model convergence, setting a new standard in the field of audio forensics.
The methodology presented in the paper is innovative, leveraging a dual-path framework to disentangle low-frequency content from high-frequency residuals in audio signals. The use of learnable spectral residual modules (SRM) and a Jensen-Shannon divergence loss to align real and fake audio embeddings is a significant advancement over existing methods. The frequency cross-attention mechanism enhances the model's ability to capture long- and short-range dependencies effectively. However, the complexity of the architecture may pose challenges for implementation and understanding.
The experiments are robust, utilizing well-established benchmarks such as ASVspoof 2021 and In-the-Wild datasets. The paper demonstrates state-of-the-art performance and rapid convergence, achieving results that significantly outperform previous methods. The evaluation metrics are clearly defined, and the results are presented in a manner that allows for easy comparison with existing techniques. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The authors have taken steps to ensure reproducibility, including the use of publicly available datasets and detailed descriptions of their experimental setup. They mention that the code will be released upon acceptance, which is a positive aspect. However, the lack of specific URLs for the code repository or demo limits immediate accessibility for other researchers.
One limitation is the potential for overfitting due to the complexity of the model, especially when training on smaller datasets. Additionally, the reliance on high-frequency artifacts may not generalize well across all types of audio deepfakes, particularly those that may not exhibit clear high-frequency discrepancies. The paper does not address how the model performs in scenarios where high-frequency artifacts are less pronounced.
The implications of this work are significant, as deepfake audio detection is increasingly critical in various domains, including security, media integrity, and misinformation prevention. The proposed method could enhance the reliability of audio content verification systems, thereby contributing to the broader fight against misinformation and fraud in digital media. The main contribution of this paper is the introduction of SONAR, a frequency-guided framework for audio deepfake detection that effectively addresses spectral bias by disentangling low- and high-frequency audio components. This innovative approach not only improves detection performance but also accelerates model convergence, setting a new standard in the field of audio forensics.
The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.
Primary: University of Toronto
All Institutions: University of Toronto
The main contribution of this paper is the introduction of HarmonicAttack, an efficient audio watermark removal method that demonstrates improved performance over existing techniques. This research addresses critical security challenges posed by AI-generated audio, providing a foundation for future work in audio watermarking and security.
The proposed methodology, HarmonicAttack, utilizes a dual-path convolutional autoencoder that operates in both temporal and frequency domains, which is a notable innovation in the context of audio watermark removal. The integration of GAN-style training enhances the model's ability to separate watermarks from original audio effectively. However, the paper could benefit from a more detailed explanation of the architecture and training process, including hyperparameter choices and the rationale behind them.
The experimental evaluation is robust, comparing HarmonicAttack against established watermarking schemes such as AudioSeal, WavMark, and Silentcipher. The results indicate superior watermark removal capabilities and near real-time performance, which are significant advancements. However, the paper lacks detailed metrics on the performance comparisons, such as exact numerical values or visualizations of the results, which would strengthen the claims made.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. For the findings to be validated by the community, it is essential to include a clear description of the dataset used, the training process, and ideally, a link to a code repository.
One limitation is the reliance on the ability to generate watermarks from the targeted scheme, which may not be feasible in all scenarios. Additionally, while the model shows promise in transferring to out-of-distribution samples, the extent of this transferability and its implications on real-world applications remain unclear.
The implications of this research are significant, particularly in the context of combating misinformation and voice-cloning fraud. By improving watermark removal techniques, the study contributes to the ongoing dialogue on audio security and the ethical use of AI-generated content. The findings could influence future watermarking strategies and security measures in audio applications. The main contribution of this paper is the introduction of HarmonicAttack, an efficient audio watermark removal method that demonstrates improved performance over existing techniques. This research addresses critical security challenges posed by AI-generated audio, providing a foundation for future work in audio watermarking and security.