The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
Primary: Indraprastha Institute of Information Technology Delhi
All Institutions: Indraprastha Institute of Information Technology Delhi, Manipal University Jaipur, Netaji Subhas University of Technology
The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
The paper introduces a novel dual-stream architecture, CLAM, which leverages two distinct pre-trained audio encoders to capture the nuances of vocal and instrumental elements in music. The methodology is well-structured, focusing on the contrastive learning approach to enhance the model's sensitivity to synthetic artifacts. The introduction of the Melody or Machine (MoM) benchmark is a significant advancement, addressing the limitations of existing datasets by providing a more diverse and challenging evaluation framework. The dual-loss objective, combining binary cross-entropy with a contrastive triplet loss, is a thoughtful design choice that enhances the model's robustness against out-of-distribution samples.
The experiments are comprehensive, demonstrating the efficacy of the proposed model against existing state-of-the-art methods. The results on the MoM benchmark, achieving an F1 score of 0.925, significantly outperforming previous models, underscore the technical impact of the research. The ablation studies provide a solid foundation for understanding the contributions of various components of the model, validating the architectural choices made.
The paper provides sufficient implementation details, including the training setup, model architecture, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly accessible code repository or demo URL limits the ease with which other researchers can replicate the results.
The primary limitation noted is the rapid pace of innovation in AI music generation, which may render the proposed detection methods obsolete as new models emerge. Additionally, the dataset's predominant focus on English songs may limit its applicability across diverse linguistic and cultural contexts.
The research has significant implications for the music industry, particularly in protecting intellectual property rights and maintaining artistic authenticity in the face of advancing AI technologies. The MoM dataset and CLAM model can aid content platforms and rights holders in identifying synthetic music, fostering trust in music distribution. However, there are ethical considerations regarding the potential misuse of the dataset for training more effective generative models. The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing signals such as images, audio, and 3D scenes. However, existing INR frameworks -- including MLPs with Fourier features, SIREN, and multiresolution hash grids -- implicitly assume a \textit{global and stationary} spectral basis. This assumption is fundamentally misaligned with real-world signals whose frequency characteristics vary significantly across space, exhibiting local high-frequency textures, smooth regions, and frequency drift phenomena. We propose \textbf{Neural Spectral Transport Representation (NSTR)}, the first INR framework that \textbf{explicitly models a spatially varying local frequency field}. NSTR introduces a learnable \emph{frequency transport equation}, a PDE that governs how local spectral compositions evolve across space. Given a learnable local spectrum field $S(x)$ and a frequency transport network $F_θ$ enforcing $\nabla S(x) \approx F_θ(x, S(x))$, NSTR reconstructs signals by spatially modulating a compact set of global sinusoidal bases. This formulation enables strong local adaptivity and offers a new level of interpretability via visualizing frequency flows. Experiments on 2D image regression, audio reconstruction, and implicit 3D geometry show that NSTR achieves significantly better accuracy-parameter trade-offs than SIREN, Fourier-feature MLPs, and Instant-NGP. NSTR requires fewer global frequencies, converges faster, and naturally explains signal structure through spectral transport fields. We believe NSTR opens a new direction in INR research by introducing explicit modeling of space-varying spectrum.
Primary: unknown
All Institutions: unknown
The paper presents NSTR, a novel framework for modeling spatially varying frequency fields in implicit neural representations, which enhances expressivity, stability, and interpretability in signal reconstruction. The innovative use of a learnable PDE for frequency transport represents a significant advancement in the field, addressing key limitations of existing INR methodologies.
The proposed methodology introduces a novel framework, NSTR, which explicitly models spatially varying frequency fields through a learnable frequency transport PDE. This approach effectively decouples global frequency content from local spectral variation, allowing for adaptive representation of signals. The use of a PDE to govern the evolution of the local spectrum is particularly innovative, as it introduces a structured constraint that enhances interpretability and stability in the representation learning process. The parameterization of the local spectrum field using a coarse grid and a lightweight MLP is efficient, addressing the limitations of traditional INRs that rely on fixed global bases.
The experiments conducted across diverse tasks, including 2D image regression, audio waveform reconstruction, and implicit 3D geometry, demonstrate the effectiveness of NSTR in achieving superior accuracy-parameter trade-offs compared to existing methods like SIREN and Fourier-feature MLPs. The evaluation metrics used are appropriate for the tasks, and the results indicate significant improvements in fidelity and convergence speed. However, the paper could benefit from additional quantitative comparisons and visualizations to further substantiate its claims.
While the paper provides a detailed description of the architecture and training setup, it lacks specific implementation details such as code availability or links to datasets used for experiments. This hinders reproducibility, as independent researchers may struggle to replicate the results without access to the exact configurations and data.
One limitation is the lack of real-world application examples, as the experiments are primarily conducted on standard datasets. Additionally, the paper does not address potential computational overhead associated with the learnable PDE, which may impact scalability in more complex scenarios. The reliance on a fixed number of global frequencies may also limit the adaptability of the model in highly variable signal contexts.
The introduction of NSTR has the potential to significantly advance the field of implicit neural representations by providing a more flexible and interpretable framework for modeling complex signals. Its applications could extend to various domains, including graphics, audio processing, and scientific simulations, where understanding local frequency variations is crucial. The ability to visualize frequency flows could also enhance interpretability in machine learning models, fostering trust and understanding in AI systems. The paper presents NSTR, a novel framework for modeling spatially varying frequency fields in implicit neural representations, which enhances expressivity, stability, and interpretability in signal reconstruction. The innovative use of a learnable PDE for frequency transport represents a significant advancement in the field, addressing key limitations of existing INR methodologies.
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.
Primary: Universidad Carlos III de Madrid
All Institutions: Universidad Carlos III de Madrid, Universidad de Valencia
This work presents a compact, energy-efficient embedded system that integrates visual depth estimation with acoustic beamforming for real-time directional audio capture. The combination of deep learning and advanced signal processing techniques demonstrates a meaningful contribution to the field of audio processing and machine learning, particularly in dynamic environments.
The paper presents a novel integration of deep learning-based object tracking with acoustic beamforming, utilizing a compact MEMS microphone array and an NVIDIA Jetson Orin Nano for real-time processing. The methodology effectively combines stereo vision for depth estimation and a frequency-domain delay-and-sum beamformer, demonstrating a well-structured approach to achieving low-latency audio capture in dynamic environments. The choice of YOLOv11 for object detection and the optimization strategies for real-time performance are commendable, showcasing a thoughtful balance between computational efficiency and accuracy.
The experimental evaluation is robust, with tests conducted in both anechoic and dynamic environments to assess the system's performance under varying conditions. The use of signal-to-interference ratio (SIR) as a metric for performance evaluation is appropriate, and the results indicate significant improvements in SIR with the proposed system. However, the paper could benefit from more detailed statistical analysis and comparisons with baseline methods to further substantiate the claims of performance enhancement.
While the paper provides a comprehensive description of the system architecture and experimental setup, it lacks specific implementation details that would aid in reproducibility. Key parameters for the algorithms used, as well as the datasets employed for training and testing, should be explicitly stated to enable others to replicate the study effectively.
The paper acknowledges some limitations, such as the potential variability in performance due to environmental factors and the reliance on specific hardware configurations. Additionally, the omission of a dedicated multi-object tracking algorithm may limit the system's effectiveness in scenarios with closely spaced sound sources.
The proposed system has significant implications for applications in teleconferencing, smart home devices, and assistive technologies, where precise sound localization and directional audio capture are critical. The integration of visual and acoustic modalities opens avenues for further research in multimodal perception systems, potentially enhancing human-computer interaction and situational awareness in various domains. This work presents a compact, energy-efficient embedded system that integrates visual depth estimation with acoustic beamforming for real-time directional audio capture. The combination of deep learning and advanced signal processing techniques demonstrates a meaningful contribution to the field of audio processing and machine learning, particularly in dynamic environments.
The rapid evolution of end-to-end AI music generation poses an escalating threat to artistic authenticity and copyright, demanding detection methods that can keep pace. While foundational, existing models like SpecTTTra falter when faced with the diverse and rapidly advancing ecosystem of new generators, exhibiting significant performance drops on out-of-distribution (OOD) content. This generalization failure highlights a critical gap: the need for more challenging benchmarks and more robust detection architectures. To address this, we first introduce Melody or Machine (MoM), a new large-scale benchmark of over 130,000 songs (6,665 hours). MoM is the most diverse dataset to date, built with a mix of open and closed-source models and a curated OOD test set designed specifically to foster the development of truly generalizable detectors. Alongside this benchmark, we introduce CLAM, a novel dual-stream detection architecture. We hypothesize that subtle, machine-induced inconsistencies between vocal and instrumental elements, often imperceptible in a mixed signal, offer a powerful tell-tale sign of synthesis. CLAM is designed to test this hypothesis by employing two distinct pre-trained audio encoders (MERT and Wave2Vec2) to create parallel representations of the audio. These representations are fused by a learnable cross-aggregation module that models their inter-dependencies. The model is trained with a dual-loss objective: a standard binary cross-entropy loss for classification, complemented by a contrastive triplet loss which trains the model to distinguish between coherent and artificially mismatched stream pairings, enhancing its sensitivity to synthetic artifacts without presuming a simple feature alignment. CLAM establishes a new state-of-the-art in synthetic music forensics. It achieves an F1 score of 0.925 on our challenging MoM benchmark.
Primary: Indraprastha Institute of Information Technology Delhi
All Institutions: Indraprastha Institute of Information Technology Delhi, Manipal University Jaipur, Netaji Subhas University of Technology
The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
The paper introduces a novel dual-stream architecture, CLAM, which leverages two distinct pre-trained audio encoders to capture the nuances of vocal and instrumental elements in music. The methodology is well-structured, focusing on the contrastive learning approach to enhance the model's sensitivity to synthetic artifacts. The introduction of the Melody or Machine (MoM) benchmark is a significant advancement, addressing the limitations of existing datasets by providing a more diverse and challenging evaluation framework. The dual-loss objective, combining binary cross-entropy with a contrastive triplet loss, is a thoughtful design choice that enhances the model's robustness against out-of-distribution samples.
The experiments are comprehensive, demonstrating the efficacy of the proposed model against existing state-of-the-art methods. The results on the MoM benchmark, achieving an F1 score of 0.925, significantly outperforming previous models, underscore the technical impact of the research. The ablation studies provide a solid foundation for understanding the contributions of various components of the model, validating the architectural choices made.
The paper provides sufficient implementation details, including the training setup, model architecture, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly accessible code repository or demo URL limits the ease with which other researchers can replicate the results.
The primary limitation noted is the rapid pace of innovation in AI music generation, which may render the proposed detection methods obsolete as new models emerge. Additionally, the dataset's predominant focus on English songs may limit its applicability across diverse linguistic and cultural contexts.
The research has significant implications for the music industry, particularly in protecting intellectual property rights and maintaining artistic authenticity in the face of advancing AI technologies. The MoM dataset and CLAM model can aid content platforms and rights holders in identifying synthetic music, fostering trust in music distribution. However, there are ethical considerations regarding the potential misuse of the dataset for training more effective generative models. The main contribution of this paper is the introduction of a robust framework for detecting AI-generated music through a novel dual-stream architecture and a comprehensive benchmark dataset. This work significantly advances the field of audio forensics by addressing critical challenges in generalization and robustness against evolving generative models.
Voice communication in bandwidth-constrained environments--maritime, satellite, and tactical networks--remains prohibitively expensive. Traditional codecs struggle below 1 kbps, while existing semantic approaches (STT-TTS) sacrifice prosody and speaker identity. We present STCTS, a generative semantic compression framework enabling natural voice communication at approximately 80 bps. STCTS explicitly decomposes speech into linguistic content, prosodic expression, and speaker timbre, applying tailored compression: context-aware text encoding (approximately 70 bps), sparse prosody transmission via TTS interpolation (less than 14 bps at 0.1-1 Hz), and amortized speaker embedding. Evaluations on LibriSpeech demonstrate a 75x bitrate reduction versus Opus (6 kbps) and 12x versus EnCodec (1 kbps), while maintaining perceptual quality (NISQA MOS greater than 4.26). We also discover a bimodal quality distribution with prosody sampling rate: sparse and dense updates both achieve high quality, while mid-range rates degrade due to perceptual discontinuities--guiding optimal configuration design. Beyond efficiency, our modular architecture supports privacy-preserving encryption, human-interpretable transmission, and flexible deployment on edge devices, offering a robust solution for ultra-low bandwidth scenarios.
Primary: Fudan University
All Institutions: Fudan University, Tsinghua University
The main contribution of this paper is the STCTS framework, which achieves ultra-low bitrate speech communication through a novel approach of explicitly decomposing speech into linguistic, prosodic, and timbral components, significantly enhancing the efficiency and quality of voice transmission in constrained environments. This work represents a meaningful advancement in the field of audio processing and communication technologies, with potential applications in various critical domains.
The paper introduces STCTS, a novel framework for ultra-low bitrate speech communication, which decomposes speech into three components: linguistic content, prosody, and timbre. This explicit decomposition allows for tailored compression strategies that significantly reduce bandwidth usage while maintaining perceptual quality. The methodology is well-structured, leveraging existing technologies (STT, TTS) and introducing innovative strategies for prosody transmission and speaker embedding. The use of context-aware text encoding and sparse prosody transmission is particularly noteworthy, as it showcases a deep understanding of the temporal dynamics of speech components.
The experimental evaluation is robust, utilizing the LibriSpeech dataset to benchmark STCTS against established codecs like Opus and EnCodec. The reported results demonstrate a significant bitrate reduction while achieving high perceptual quality (NISQA MOS > 4.26). The discovery of a bimodal quality distribution concerning prosody sampling rates provides valuable insights for future configurations. However, the paper could benefit from more extensive user studies to assess real-world performance in diverse communication scenarios.
The authors provide an open-source implementation of their system, which is a strong point for reproducibility. The detailed description of the system architecture, configuration options, and the availability of the source code facilitate replication of the experiments. However, the paper lacks comprehensive benchmarking infrastructure details that could aid other researchers in reproducing the results precisely.
One limitation is the reliance on specific datasets for evaluation, which may not fully capture the variability of real-world speech communication in bandwidth-constrained environments. Additionally, the paper does not address potential challenges in adapting the system to different languages or dialects, which could affect its generalizability. The performance under extreme network conditions or with varying speaker characteristics also requires further exploration.
The STCTS framework has significant implications for voice communication in bandwidth-constrained environments, such as maritime, satellite, and tactical networks. By enabling natural and expressive communication at ultra-low bitrates, it addresses critical needs in various fields, including emergency response, remote work, and IoT applications. The modular architecture also supports future advancements in speech technology, making it a versatile tool for diverse applications. The main contribution of this paper is the STCTS framework, which achieves ultra-low bitrate speech communication through a novel approach of explicitly decomposing speech into linguistic, prosodic, and timbral components, significantly enhancing the efficiency and quality of voice transmission in constrained environments. This work represents a meaningful advancement in the field of audio processing and communication technologies, with potential applications in various critical domains.
Respiratory diseases remain major global health challenges, and traditional auscultation is often limited by subjectivity, environmental noise, and inter-clinician variability. This study presents an explainable multimodal deep learning framework for automatic lung-disease detection using respiratory audio signals. The proposed system integrates two complementary representations: a spectral-temporal encoder based on a CNN-BiLSTM Attention architecture, and a handcrafted acoustic-feature encoder capturing physiologically meaningful descriptors such as MFCCs, spectral centroid, spectral bandwidth, and zero-crossing rate. These branches are combined through late-stage fusion to leverage both data-driven learning and domain-informed acoustic cues. The model is trained and evaluated on the Asthma Detection Dataset Version 2 using rigorous preprocessing, including resampling, normalization, noise filtering, data augmentation, and patient-level stratified partitioning. The study achieved strong generalization with 91.21% accuracy, 0.899 macro F1-score, and 0.9866 macro ROC-AUC, outperforming all ablated variants. An ablation study confirms the importance of temporal modeling, attention mechanisms, and multimodal fusion. The framework incorporates Grad-CAM, Integrated Gradients, and SHAP, generating interpretable spectral, temporal, and feature-level explanations aligned with known acoustic biomarkers to build clinical transparency. The findings demonstrate the framework's potential for telemedicine, point-of-care diagnostics, and real-world respiratory screening.
Primary: Albukhary International University
All Institutions: Albukhary International University
This study presents a novel explainable multimodal deep learning framework for automatic lung disease detection from respiratory audio signals, addressing critical challenges in traditional auscultation methods. The integration of deep learning with handcrafted features and explainable AI techniques represents a significant advancement in the field, with the potential to improve clinical outcomes through enhanced diagnostic accuracy and transparency.
The proposed methodology integrates a hybrid deep learning architecture combining CNN, BiLSTM, and attention mechanisms with handcrafted acoustic features, which is innovative in the context of respiratory sound analysis. The late-stage fusion approach effectively leverages both data-driven and domain-informed representations, enhancing the model's robustness and interpretability. The incorporation of explainable AI techniques such as Grad-CAM, Integrated Gradients, and SHAP adds significant value by providing clinical transparency, which is often lacking in deep learning applications in healthcare.
The experiments are well-structured, utilizing a publicly available dataset with a comprehensive evaluation strategy that includes accuracy, F1-score, and ROC-AUC metrics. The ablation study is particularly noteworthy, as it rigorously tests the contributions of various components of the model, confirming the importance of multimodal fusion and attention mechanisms. The reported results demonstrate strong generalization capabilities across different respiratory conditions, indicating the model's practical applicability.
The paper outlines a clear training strategy, including hyperparameter settings, data preprocessing steps, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly accessible code repository limits the ability for others to replicate the study fully.
While the study presents a robust framework, it relies on a single dataset, which may limit the generalizability of the findings. The model's performance on underrepresented classes, such as Bronchial sounds, suggests that further refinement may be necessary to improve classification accuracy across all categories. Additionally, the lack of a demo or project URL restricts practical engagement with the research.
The framework has significant implications for telemedicine and point-of-care diagnostics, potentially improving early detection and management of respiratory diseases. By enhancing the interpretability of AI models in clinical settings, this work contributes to building trust in automated diagnostic systems, which is crucial for their acceptance in healthcare. This study presents a novel explainable multimodal deep learning framework for automatic lung disease detection from respiratory audio signals, addressing critical challenges in traditional auscultation methods. The integration of deep learning with handcrafted features and explainable AI techniques represents a significant advancement in the field, with the potential to improve clinical outcomes through enhanced diagnostic accuracy and transparency.
Recent advances in Speech Large Language Models (Speech LLMs) have led to great progress in speech understanding tasks such as Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER). However, whether these models can achieve human-level auditory perception, particularly in terms of their ability to comprehend latent intentions and implicit emotions in real-world spoken language, remains underexplored. To this end, we introduce the Human-level Perception in Spoken Speech Understanding (HPSU), a new benchmark for fully evaluating the human-level perceptual and understanding capabilities of Speech LLMs. HPSU comprises over 20,000 expert-validated spoken language understanding samples in English and Chinese. It establishes a comprehensive evaluation framework by encompassing a spectrum of tasks, ranging from basic speaker attribute recognition to complex inference of latent intentions and implicit emotions. To address the issues of data scarcity and high cost of manual annotation in real-world scenarios, we developed a semi-automatic annotation process. This process fuses audio, textual, and visual information to enable precise speech understanding and labeling, thus enhancing both annotation efficiency and quality. We systematically evaluate various open-source and proprietary Speech LLMs. The results demonstrate that even top-performing models still fall considerably short of human capabilities in understanding genuine spoken interactions. Consequently, HPSU will be useful for guiding the development of Speech LLMs toward human-level perception and cognition.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the HPSU benchmark, which systematically evaluates the human-level perceptual and understanding capabilities of Speech LLMs in real-world spoken language contexts. This comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the importance of addressing the limitations of current models and guiding future research towards achieving more sophisticated speech understanding.
The paper presents a well-structured methodology for constructing the HPSU benchmark, incorporating a semi-automatic annotation process that integrates audio, text, and visual modalities. This innovative approach addresses the challenges of data scarcity and high costs associated with manual annotation, enhancing both efficiency and quality. The hierarchical taxonomy of tasks and the adversarial induction protocol for robustness testing are commendable features that significantly improve the evaluation framework.
The experimental evaluation is comprehensive, involving 13 leading models and a detailed analysis of their performance across various tasks. The results highlight the significant gap between human capabilities and those of current Speech LLMs, particularly in complex reasoning tasks. The use of a human baseline and random guessing as benchmarks provides a clear context for interpreting model performance.
The paper provides sufficient details about the datasets, annotation process, and evaluation metrics, which would allow other researchers to replicate the study. However, the lack of specific details regarding the training and evaluation of the models limits full reproducibility.
The paper acknowledges limitations related to the performance of Speech LLMs, particularly their struggles with complex semantic reasoning and susceptibility to misleading prompts. Additionally, the reliance on specific datasets may introduce biases that affect generalizability.
The HPSU benchmark has the potential to significantly influence research in speech understanding by providing a rigorous evaluation framework that encourages the development of models capable of human-level perception. This could lead to advancements in applications such as human-computer interaction, sentiment analysis, and multilingual communication. The main contribution of this paper is the introduction of the HPSU benchmark, which systematically evaluates the human-level perceptual and understanding capabilities of Speech LLMs in real-world spoken language contexts. This comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the importance of addressing the limitations of current models and guiding future research towards achieving more sophisticated speech understanding.
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated calibration and fusion techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, and that the Fuse-then-Calibrate ordering generally outperforms calibrating individual models before fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
Primary: IEEE Publication Technology Group
All Institutions: IEEE Publication Technology Group
This paper provides a comprehensive framework for calibrating and fusing EEND systems at the probability level, marking a significant contribution to the field of speaker diarization. The innovative methodology and thorough experimental evaluation demonstrate its potential to enhance the reliability and effectiveness of neural diarization systems.
The paper presents a novel framework for calibrating and fusing End-to-End Neural Diarization (EEND) models at the probability level, which is a significant advancement over existing methods that primarily operate on hard decisions. The authors explore two output formulations (multilabel and powerset) and their effects on calibration and fusion, providing a systematic approach that leverages model uncertainty. The methodology is well-structured, with clear definitions of calibration strategies and fusion methods, including both unsupervised and supervised techniques.
The experimental evaluation is thorough, utilizing the CallHome two-speaker benchmark to demonstrate the effectiveness of the proposed methods. The results show substantial improvements in Diarization Error Rate (DER) and calibration quality, with detailed comparisons against existing methods like DOVER-Lap. The experiments are well-designed, covering various configurations and providing insights into the impact of calibration and fusion strategies.
The paper includes a link to the publicly available code repository, which enhances reproducibility. The implementation details are sufficiently described, allowing other researchers to replicate the experiments. However, the paper could benefit from more explicit details on hyperparameter settings and training procedures.
One limitation is the focus on two-speaker scenarios, which may not generalize to more complex multi-speaker environments. Additionally, while the paper discusses the importance of calibration, it does not explore the potential trade-offs between calibration quality and computational efficiency in depth.
The proposed framework has significant implications for improving speaker diarization systems, particularly in applications where accurate speaker identification is critical, such as in transcription services, meeting analysis, and audio indexing. By enhancing the reliability of confidence scores, this work can lead to better performance in downstream tasks that rely on speaker diarization. This paper provides a comprehensive framework for calibrating and fusing EEND systems at the probability level, marking a significant contribution to the field of speaker diarization. The innovative methodology and thorough experimental evaluation demonstrate its potential to enhance the reliability and effectiveness of neural diarization systems.
Neural speech codecs have achieved strong performance in low-bitrate compression, but residual vector quantization (RVQ) often suffers from unstable training and ineffective decomposition, limiting reconstruction quality and efficiency. We propose PURE Codec (Progressive Unfolding of Residual Entropy), a novel framework that guides multi-stage quantization using a pre-trained speech enhancement model. The first quantization stage reconstructs low-entropy, denoised speech embeddings, while subsequent stages encode residual high-entropy components. This design improves training stability significantly. Experiments demonstrate that PURE consistently outperforms conventional RVQ-based codecs in reconstruction and downstream speech language model-based text-to-speech, particularly under noisy training conditions.
Primary: CMU
All Institutions: CMU, SJTU
The main contribution of this paper is the introduction of PURE Codec, a novel framework that enhances the stability and efficiency of speech codecs through progressive unfolding of residual entropy guided by a pre-trained speech enhancement model. This work significantly advances the field of neural speech coding by addressing key challenges in training stability and reconstruction quality, making it a valuable contribution to audio processing research.
The proposed PURE Codec introduces a novel approach to residual vector quantization (RVQ) by incorporating enhancement-guided supervision, which anchors the quantization process to low-entropy, denoised speech embeddings. This multi-stage quantization framework effectively stabilizes training and improves reconstruction quality, particularly in challenging noisy environments. The methodology is well-structured, detailing the integration of a pre-trained speech enhancement model and a stochastic scheduling mechanism that balances the use of enhanced and original embeddings during training.
The experiments are comprehensive, utilizing multiple datasets to evaluate the codec's performance under various conditions. The results demonstrate that PURE Codec consistently outperforms conventional RVQ-based codecs across several metrics, including signal-to-distortion ratio (SDR) and perceptual evaluation metrics. The ablation studies provide valuable insights into the impact of different design choices, reinforcing the robustness of the proposed method.
The paper provides a clear description of the training process, including the two-stage training strategy and the specific hyperparameters used. The codebase is shared on GitHub, enhancing the reproducibility of the experiments. However, the reliance on specific enhancement models may limit the generalizability of the findings.
A notable limitation is that the PURE Codec is heavily dependent on speech-specific enhancement models, which may not be applicable to general audio processing tasks. Additionally, while the training stability is improved, the paper does not extensively discuss potential drawbacks or scenarios where the method might underperform.
The advancements in speech codec technology have significant implications for real-time communication, mobile applications, and speech-driven generative models. By improving the efficiency and quality of speech compression, this work could enhance user experiences in various applications, from telephony to virtual assistants. The main contribution of this paper is the introduction of PURE Codec, a novel framework that enhances the stability and efficiency of speech codecs through progressive unfolding of residual entropy guided by a pre-trained speech enhancement model. This work significantly advances the field of neural speech coding by addressing key challenges in training stability and reconstruction quality, making it a valuable contribution to audio processing research.
End-to-End Neural Diarization (EEND) systems produce frame-level probabilistic speaker activity estimates, yet since evaluation focuses primarily on Diarization Error Rate (DER), the reliability and calibration of these confidence scores have been largely neglected. When fusing multiple diarization systems, DOVER-Lap remains the only established approach, operating at the segment level with hard decisions. We propose working with continuous probability outputs, which enables more sophisticated fusion and calibration techniques that can leverage model uncertainty and complementary strengths across different architectures. This paper presents the first comprehensive framework for calibrating and fusing EEND models at the probability level. We investigate two output formulations (multilabel and powerset representations) and their impact on calibration and fusion effectiveness. Through extensive experiments on the CallHome two-speaker benchmark, we demonstrate that proper calibration provides substantial improvements even for individual models (up to 19% relative DER reduction), in some cases mitigating the absence of domain adaptation. We reveal that joint calibration in powerset space consistently outperforms independent per-speaker calibration, that fusion substantially improves over individual models, and that the Fuse-then-Calibrate ordering generally outperforms both calibrating before fusion and uncalibrated fusion while requiring calibration of only a single combined model. Our best configuration outperforms DOVER-Lap in terms of DER while providing reliable confidence estimates essential for downstream applications. This work proposes best practices for probability-level fusion of EEND systems and demonstrates the advantages of leveraging soft outputs over hard decisions.
Primary: IEEE Publication Technology Group
All Institutions: IEEE Publication Technology Group
This paper establishes a comprehensive framework for calibrating and fusing EEND systems at the probability level, significantly advancing the state of speaker diarization. The methodology is innovative, addressing critical gaps in existing approaches and demonstrating substantial improvements in performance through rigorous experimentation.
The paper presents a novel framework for calibrating and fusing End-to-End Neural Diarization (EEND) models, which is a significant advancement in the field. It introduces two output formulations (multilabel and powerset) and explores their implications for calibration and fusion. The methodology is well-structured, with clear definitions of calibration strategies and fusion techniques, including both unsupervised and supervised methods. The use of Platt scaling for calibration and the exploration of different fusion strategies demonstrate a comprehensive approach to addressing the limitations of existing methods.
The experiments are thorough, utilizing the CallHome two-speaker benchmark to validate the proposed methods. The results indicate substantial improvements in Diarization Error Rate (DER) and calibration quality, with detailed comparisons across various configurations. The paper effectively illustrates the impact of calibration and fusion on model performance, providing a robust analysis of the results. However, the reliance on a single benchmark may limit the generalizability of the findings.
The authors provide a GitHub repository for their calibration and fusion framework, which enhances reproducibility. The paper includes detailed implementation details, experimental setups, and evaluation metrics, allowing other researchers to replicate the study. However, the absence of a demo URL limits the accessibility of the results for broader audiences.
One limitation is the focus on a specific benchmark, which may not capture the full range of challenges present in real-world scenarios with more speakers or varied acoustic conditions. Additionally, while the paper discusses the importance of calibration, it does not explore alternative calibration methods beyond Platt scaling, which could provide further insights.
The proposed framework has significant implications for improving speaker diarization systems, particularly in applications involving multi-speaker audio. By enhancing the reliability of confidence scores and enabling better model fusion, this work could lead to advancements in various domains, including automatic speech recognition and audio analysis. The findings encourage further exploration of probabilistic outputs in machine learning, potentially influencing future research directions. This paper establishes a comprehensive framework for calibrating and fusing EEND systems at the probability level, significantly advancing the state of speaker diarization. The methodology is innovative, addressing critical gaps in existing approaches and demonstrating substantial improvements in performance through rigorous experimentation.
Wave-guide-based physical systems provide a promising route toward energy-efficient analog computing beyond traditional electronics. Within this landscape, acoustic neural networks represent a promising approach for achieving low-power computation in environments where electronics are inefficient or limited, yet their systematic design has remained largely unexplored. Here we introduce a framework for designing and simulating acoustic neural networks, which perform computation through the propagation of sound waves. Using a digital-twin approach, we train conventional neural network architectures under physically motivated constraints including non-negative signals and weights, the absence of bias terms, and nonlinearities compatible with intensity-based, non-negative acoustic signals. Our work provides a general framework for acoustic neural networks that connects learnable network components directly to physically measurable acoustic properties, enabling the systematic design of realizable acoustic computing systems. We demonstrate that constrained recurrent and hierarchical architectures can perform accurate speech classification, and we propose the SincHSRNN, a hybrid model that combines learnable acoustic bandpass filters with hierarchical temporal processing. The SincHSRNN achieves up to 95% accuracy on the AudioMNIST dataset while remaining compatible with passive acoustic components. Beyond computational performance, the learned parameters correspond to measurable material and geometric properties such as attenuation and transmission. Our results establish general design principles for physically realizable acoustic neural networks and outline a pathway toward low-power, wave-based neural computing.
Primary: RWTH Aachen University
All Institutions: RWTH Aachen University, DWI -- Leibniz Institute for Interactive Materials, Institute of Theoretical Physics, Center for Soft Nanoscience, University of MĂĽnster
The paper establishes a framework for designing and simulating acoustic neural networks, demonstrating that neural computation can be achieved through the physics of sound. This work not only advances the theoretical understanding of acoustic computing but also lays the groundwork for practical implementations in low-power, wave-based neural processing.
The paper introduces a novel framework for designing and simulating acoustic neural networks that leverage the physical properties of sound waves for computation. The authors employ a digital-twin approach, which allows for the systematic design of neural architectures constrained by physical realizability. The methodology is well-structured, beginning with the foundational concepts of acoustic neural networks and progressing through the development of constrained recurrent architectures, culminating in the SincHSRNN model. The constraints imposed on the network (non-negative weights and activations, absence of bias terms) are well-justified and aligned with the physical characteristics of acoustic systems. The proposed architectures are rigorously defined, and the transition from RNNs to more complex hierarchical models demonstrates a clear progression in sophistication while maintaining physical feasibility.
The experimental evaluation is robust, utilizing the AudioMNIST dataset to assess the performance of various network architectures. The authors provide comprehensive results, including training and test accuracies across different configurations of RNNs, HSRNNs, and SincHSRNNs. The results indicate that the proposed models can achieve competitive performance, with the SincHSRNN reaching up to 95% accuracy. However, the experiments are primarily focused on a single dataset, which may limit the generalizability of the findings. The evaluation of model performance under constrained conditions provides valuable insights into the trade-offs between physical constraints and computational efficacy.
The paper includes detailed descriptions of the training procedures, hyperparameters, and model architectures, which enhances reproducibility. However, the absence of a publicly available code repository or supplementary materials limits the ability for independent verification of results. The authors mention that supplementary materials are available but do not provide a direct link, which could hinder broader accessibility.
One limitation of the study is the reliance on a single dataset (AudioMNIST), which may not fully capture the complexities of real-world audio processing tasks. Additionally, the constrained architectures exhibit sensitivity to initialization and weight scaling, which could affect training stability and performance. The paper also does not explore the potential for active elements in acoustic systems, which could enhance the capabilities of the proposed networks.
The implications of this research are significant, particularly in the context of low-power computing and analog processing in environments where traditional electronics are less effective. The development of acoustic neural networks could lead to advancements in applications such as speech recognition, smart hearing aids, and other acoustic processing tasks that benefit from energy-efficient solutions. The findings also contribute to the growing field of neuromorphic computing, positioning acoustic systems as viable alternatives to optical and electronic approaches. The paper establishes a framework for designing and simulating acoustic neural networks, demonstrating that neural computation can be achieved through the physics of sound. This work not only advances the theoretical understanding of acoustic computing but also lays the groundwork for practical implementations in low-power, wave-based neural processing.
Automated detection and classification of marine mammals vocalizations is critical for conservation and management efforts but is hindered by limited annotated datasets and the acoustic complexity of real-world marine environments. Data augmentation has proven to be an effective strategy to address this limitation by increasing dataset diversity and improving model generalization without requiring additional field data. However, most augmentation techniques used to date rely on effective but relatively simple transformations, leaving open the question of whether deep generative models can provide additional benefits. In this study, we evaluate the potential of deep generative for data augmentation in marine mammal call detection including: Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models. Using Southern Resident Killer Whale (Orcinus orca) vocalizations from two long-term hydrophone deployments in the Salish Sea, we compare these approaches against traditional augmentation methods such as time-shifting and vocalization masking. While all generative approaches improved classification performance relative to the baseline, diffusion-based augmentation yielded the highest recall (0.87) and overall F1-score (0.75). A hybrid strategy combining generative-based synthesis with traditional methods achieved the best overall performance with an F1-score of 0.81. We hope this study encourages further exploration of deep generative models as complementary augmentation strategies to advance acoustic monitoring of threatened marine mammal populations.
Primary: Simon Fraser University
All Institutions: Simon Fraser University, Dalhousie University
The main contribution of this paper is the introduction of a hybrid augmentation strategy that leverages deep generative models to enhance the detection of Southern Resident Killer Whale vocalizations, addressing critical challenges in marine bioacoustics. This work significantly advances the field by demonstrating the effectiveness of combining traditional and generative augmentation methods, paving the way for improved conservation efforts.
The paper presents a robust methodology that integrates deep generative models (Variational Autoencoders, Generative Adversarial Networks, and Denoising Diffusion Probabilistic Models) for augmenting a limited dataset of Southern Resident Killer Whale vocalizations. The hybrid approach combines these generative techniques with traditional augmentation methods, which is innovative in the context of bioacoustics. The authors provide a clear rationale for their choices and demonstrate a systematic evaluation of the methods, which is commendable. However, the methodology could benefit from a more detailed discussion on the selection criteria for generative models and the specific hyperparameters used.
The experiments are well-structured, comparing multiple augmentation strategies and their impact on classification performance. The use of two distinct datasets for training and testing enhances the validity of the results. The reported metrics (recall and F1-score) provide a clear picture of model performance. However, the paper could improve by including more statistical analyses to support the significance of the results and by discussing the potential variability in performance across different acoustic environments.
The authors provide a GitHub repository with code and documentation, which is a strong point for reproducibility. The detailed description of model architectures and training procedures allows for replication. However, the paper lacks specific details on the computational resources used, which could affect reproducibility for others attempting to replicate the study.
One limitation is the reliance on a relatively small annotated dataset, which may affect the generalizability of the findings. Additionally, while the paper discusses the potential for overfitting with generative models, it does not provide extensive analysis on how the models were validated against this risk. The authors also acknowledge the challenges of background noise in marine environments but do not explore potential solutions or mitigations in depth.
The research has significant implications for marine conservation efforts, particularly for endangered species like the Southern Resident Killer Whale. By improving the accuracy of automated detection systems, the study can enhance monitoring and conservation strategies. Furthermore, the exploration of deep generative models in bioacoustics opens avenues for future research in other areas of wildlife monitoring and environmental sound analysis. The main contribution of this paper is the introduction of a hybrid augmentation strategy that leverages deep generative models to enhance the detection of Southern Resident Killer Whale vocalizations, addressing critical challenges in marine bioacoustics. This work significantly advances the field by demonstrating the effectiveness of combining traditional and generative augmentation methods, paving the way for improved conservation efforts.
Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that integrates singing voice synthesis and conversion while bridging human and non-human singing generation. CartoonSing employs a two-stage pipeline: a score representation encoder trained with annotated human singing and a timbre-aware vocoder that reconstructs waveforms for both human and non-human audio. Experiments demonstrate that CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS and SVC toward creative, non-human singing generation.
Primary: Mohamed bin Zayed University of Artificial Intelligence
All Institutions: Carnegie Mellon University, Mohamed bin Zayed University of Artificial Intelligence, Renmin University of China, University of Southern California
This paper introduces CartoonSing, a pioneering framework for Non-Human Singing Generation, significantly advancing the capabilities of singing voice synthesis and conversion by integrating non-human timbres into the synthesis process. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the field of audio machine learning.
The methodology presented in this paper is innovative, introducing a two-stage framework that effectively bridges the gap between human and non-human singing voice synthesis and conversion. The authors address significant challenges, such as the lack of non-human singing data and the absence of symbolic alignment, by utilizing a combination of self-supervised learning features and a timbre-aware vocoder. This approach not only allows for the generation of non-human singing voices but also maintains musical coherence and intelligibility, which is a notable advancement in the field.
The experimental setup is robust, utilizing a diverse set of datasets for both training and evaluation. The authors conduct comprehensive evaluations using both objective and subjective metrics, demonstrating the effectiveness of their approach compared to existing systems. The results indicate that the proposed method achieves superior timbre similarity and maintains audio quality, which is critical for practical applications in creative domains.
The paper emphasizes reproducibility by committing to release source code, training scripts, and detailed hyperparameter settings. This transparency is crucial for enabling other researchers to replicate the findings and build upon the work. The authors also provide a clear description of the datasets and processing methods used, which further supports reproducibility.
While the paper presents a significant advancement, it acknowledges the inherent limitations in synthesizing non-human voices, particularly regarding the clarity of consonantal articulation. The trade-off between timbre similarity and intelligibility is a critical challenge that the authors highlight, suggesting that further research is needed to improve this aspect.
The implications of this work are substantial, particularly for creative industries such as video game development, film, and music production, where non-human vocalizations are increasingly sought after. The ability to generate diverse and musically coherent non-human singing voices could open new avenues for artistic expression and innovation in audio synthesis. This paper introduces CartoonSing, a pioneering framework for Non-Human Singing Generation, significantly advancing the capabilities of singing voice synthesis and conversion by integrating non-human timbres into the synthesis process. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the field of audio machine learning.
Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.
Primary: Institut Polytechnique de Paris
All Institutions: Institut Polytechnique de Paris
The paper introduces a novel approach to bandwidth extension using a Harmonic-Percussive disentangled neural audio codec, demonstrating significant improvements in high-frequency reconstruction through a well-integrated transformer-based language model. This work not only advances the state of the art in audio processing but also opens avenues for further research in audio representation learning and codec design.
The paper presents a novel approach to bandwidth extension by introducing a Harmonic-Percussive disentangled neural audio codec (HP-codec) that separates high and low-frequency components and utilizes a transformer-based language model for token prediction. This dual-architecture design is innovative as it integrates codec structure directly into the generative modeling process, allowing for improved high-frequency reconstruction. The methodology is well-structured, leveraging existing techniques in audio processing while introducing significant enhancements in representation learning and model coupling.
The experimental setup is robust, utilizing multiple datasets including MUSDB18 and JAMENDO for training and testing. The authors compare their model against established baselines (Apollo and AudioSR), providing both objective metrics and subjective evaluations through MUSHRA tests. The results indicate that HP-codecX outperforms these baselines in reconstructing high-frequency content, demonstrating the effectiveness of the proposed approach. The comprehensive evaluation across different datasets adds credibility to the findings.
The authors emphasize reproducibility by detailing their experimental setup, training procedures, and the datasets used. They plan to release their implementation upon acceptance, which is a positive step towards ensuring that other researchers can replicate their results. However, the paper could benefit from providing more specific information about hyperparameters and training conditions.
The paper acknowledges several limitations, including the constraint of fixed sampling rates and the architectural coupling between the codec and language model. The reliance on a specific input-output mapping (16 kHz to 48 kHz) may limit the model's applicability in broader contexts. Additionally, the potential for artifacts in high-frequency reconstructions is noted, which could affect perceptual quality despite favorable listening test results.
The advancements in bandwidth extension have significant implications for audio processing applications, including telecommunications, music restoration, and speech enhancement. The proposed model's ability to improve high-frequency reconstruction could enhance user experiences in various audio-related technologies, making it a valuable contribution to the field. The paper introduces a novel approach to bandwidth extension using a Harmonic-Percussive disentangled neural audio codec, demonstrating significant improvements in high-frequency reconstruction through a well-integrated transformer-based language model. This work not only advances the state of the art in audio processing but also opens avenues for further research in audio representation learning and codec design.
Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.
Primary: Tencent Technology Co.Ltd
All Institutions: Tencent Technology Co.Ltd
The paper presents a novel multi-reward GRPO framework that significantly enhances the performance of single-codebook TTS LLMs by addressing key challenges in prosody and speaker similarity. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field of TTS synthesis, with the potential for broad applications in human-computer interaction.
The paper introduces a multi-reward Group Relative Policy Optimization (GRPO) framework that enhances the token generation policy of single-codebook TTS LLMs. The integration of multiple rule-based rewards (length penalty, entropy regularization, and prosody alignment) is a novel approach that addresses common issues in TTS systems, such as prosody instability and speaker drift. The use of an external reasoning LLM to predict pause structures for prosody alignment is particularly innovative, leveraging in-context learning to provide a human-preference-aligned supervisory signal. The methodology is well-structured, with clear definitions of the reward functions and their intended impacts on the model's performance.
The experiments are comprehensive, utilizing a large bilingual corpus and various evaluation metrics (CER, SIM, MOS) to assess the effectiveness of the proposed framework. The results demonstrate significant improvements in prosodic stability, speaker similarity, and naturalness compared to existing models. The scalability analysis across different model sizes and data scales adds depth to the evaluation, showing that the proposed method is effective across a range of conditions. The ablation study further validates the contribution of each reward component, providing insights into their individual impacts on performance.
The paper provides detailed implementation details, including the architecture, training configurations, and data sources. However, the absence of a public code repository or demo URL limits the reproducibility of the results. While the methodology is well-explained, the lack of accessible resources may hinder other researchers from replicating the study.
One limitation of the study is the reliance on a specific reasoning LLM for prosody alignment, which may not generalize across all languages or dialects. Additionally, while the results are promising, the paper does not address potential computational costs associated with the proposed GRPO framework, particularly in terms of training time and resource requirements. The evaluation is primarily focused on objective metrics, and further subjective assessments could strengthen the findings.
The proposed framework has significant implications for the field of TTS synthesis, particularly in enhancing the naturalness and expressivity of synthesized speech. Improved prosody and speaker similarity can lead to more engaging and human-like interactions in applications such as virtual assistants, audiobooks, and language learning tools. The integration of reinforcement learning in TTS systems could pave the way for more adaptive and context-aware speech synthesis technologies. The paper presents a novel multi-reward GRPO framework that significantly enhances the performance of single-codebook TTS LLMs by addressing key challenges in prosody and speaker similarity. The comprehensive methodology and rigorous experimental evaluation contribute valuable insights to the field of TTS synthesis, with the potential for broad applications in human-computer interaction.
The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.
Primary: University of Texas at Austin
All Institutions: University of Texas at Austin, Amazon
RosettaSpeech presents a novel framework for zero-shot speech-to-speech translation utilizing monolingual data, significantly advancing the field by addressing the critical issue of data scarcity. The comprehensive methodology and strong experimental results underscore its potential to transform speech translation technologies for underrepresented languages.
The methodology presented in RosettaSpeech is innovative, as it introduces a zero-shot speech-to-speech translation framework that leverages monolingual speech-text data and machine translation supervision. By decoupling the need for parallel speech corpora and utilizing text as an intermediate bridge, the authors effectively address a significant bottleneck in the field. The model architecture, which combines speech modeling with a large language model (LLM) backbone and multi-head projection layers, is well-conceived and demonstrates a thoughtful integration of existing technologies. However, the reliance on NMT-generated pseudo-parallel data raises questions about the potential for noise and inaccuracies in the training process.
The experimental evaluation is robust, with the authors providing comprehensive results on standard benchmarks, including the CVSS-C test set. The reported ASR-BLEU scores indicate substantial improvements over existing systems, showcasing the effectiveness of the proposed method. The ablation studies conducted further validate the necessity of the joint training approach and the benefits of fine-tuning, providing a clear understanding of the model's capabilities. However, the experiments are primarily focused on a limited set of high-resource languages, which may not fully represent the model's performance across a broader linguistic landscape.
The paper includes detailed implementation details, including training procedures, dataset descriptions, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly available code repository or demo limits the ability for external validation of the results. The authors should consider releasing their code to facilitate further research and experimentation.
The paper acknowledges several limitations, including the focus on a narrow set of high-resource languages and the challenges associated with extending the framework to low-resource languages. Additionally, the potential for noise in NMT-generated targets is a concern that could affect the quality of the final translations. Future work should address these limitations to broaden the applicability of the framework.
The implications of RosettaSpeech are significant, as it provides a scalable solution for speech-to-speech translation in languages that lack parallel speech corpora. By enabling high-quality translation for a wider array of languages, this work has the potential to enhance communication across linguistic barriers and contribute to global accessibility. The framework's design could inspire further research into efficient translation methods that leverage abundant text data. RosettaSpeech presents a novel framework for zero-shot speech-to-speech translation utilizing monolingual data, significantly advancing the field by addressing the critical issue of data scarcity. The comprehensive methodology and strong experimental results underscore its potential to transform speech translation technologies for underrepresented languages.
Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of SONAR, a frequency-guided framework for audio deepfake detection that effectively addresses spectral bias by disentangling low- and high-frequency audio components. This innovative approach not only improves detection performance but also accelerates model convergence, setting a new standard in the field of audio forensics.
The methodology presented in the paper is innovative, leveraging a dual-path framework to disentangle low-frequency content from high-frequency residuals in audio signals. The use of learnable spectral residual modules (SRM) and a Jensen-Shannon divergence loss to align real and fake audio embeddings is a significant advancement over existing methods. The frequency cross-attention mechanism enhances the model's ability to capture long- and short-range dependencies effectively. However, the complexity of the architecture may pose challenges for implementation and understanding.
The experiments are robust, utilizing well-established benchmarks such as ASVspoof 2021 and In-the-Wild datasets. The paper demonstrates state-of-the-art performance and rapid convergence, achieving results that significantly outperform previous methods. The evaluation metrics are clearly defined, and the results are presented in a manner that allows for easy comparison with existing techniques. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The authors have taken steps to ensure reproducibility, including the use of publicly available datasets and detailed descriptions of their experimental setup. They mention that the code will be released upon acceptance, which is a positive aspect. However, the lack of specific URLs for the code repository or demo limits immediate accessibility for other researchers.
One limitation is the potential for overfitting due to the complexity of the model, especially when training on smaller datasets. Additionally, the reliance on high-frequency artifacts may not generalize well across all types of audio deepfakes, particularly those that may not exhibit clear high-frequency discrepancies. The paper does not address how the model performs in scenarios where high-frequency artifacts are less pronounced.
The implications of this work are significant, as deepfake audio detection is increasingly critical in various domains, including security, media integrity, and misinformation prevention. The proposed method could enhance the reliability of audio content verification systems, thereby contributing to the broader fight against misinformation and fraud in digital media. The main contribution of this paper is the introduction of SONAR, a frequency-guided framework for audio deepfake detection that effectively addresses spectral bias by disentangling low- and high-frequency audio components. This innovative approach not only improves detection performance but also accelerates model convergence, setting a new standard in the field of audio forensics.
The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.
Primary: University of Toronto
All Institutions: University of Toronto
The main contribution of this paper is the introduction of HarmonicAttack, an efficient audio watermark removal method that demonstrates improved performance over existing techniques. This research addresses critical security challenges posed by AI-generated audio, providing a foundation for future work in audio watermarking and security.
The proposed methodology, HarmonicAttack, utilizes a dual-path convolutional autoencoder that operates in both temporal and frequency domains, which is a notable innovation in the context of audio watermark removal. The integration of GAN-style training enhances the model's ability to separate watermarks from original audio effectively. However, the paper could benefit from a more detailed explanation of the architecture and training process, including hyperparameter choices and the rationale behind them.
The experimental evaluation is robust, comparing HarmonicAttack against established watermarking schemes such as AudioSeal, WavMark, and Silentcipher. The results indicate superior watermark removal capabilities and near real-time performance, which are significant advancements. However, the paper lacks detailed metrics on the performance comparisons, such as exact numerical values or visualizations of the results, which would strengthen the claims made.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. For the findings to be validated by the community, it is essential to include a clear description of the dataset used, the training process, and ideally, a link to a code repository.
One limitation is the reliance on the ability to generate watermarks from the targeted scheme, which may not be feasible in all scenarios. Additionally, while the model shows promise in transferring to out-of-distribution samples, the extent of this transferability and its implications on real-world applications remain unclear.
The implications of this research are significant, particularly in the context of combating misinformation and voice-cloning fraud. By improving watermark removal techniques, the study contributes to the ongoing dialogue on audio security and the ethical use of AI-generated content. The findings could influence future watermarking strategies and security measures in audio applications. The main contribution of this paper is the introduction of HarmonicAttack, an efficient audio watermark removal method that demonstrates improved performance over existing techniques. This research addresses critical security challenges posed by AI-generated audio, providing a foundation for future work in audio watermarking and security.
Extracting individual elements from music mixtures is a valuable tool for music production and practice. While neural networks optimized to mask or transform mixture spectrograms into the individual source(s) have been the leading approach, the source overlap and correlation in music signals poses an inherent challenge. Also, accessing all sources in the mixture is crucial to train these systems, while complicated. Attempts to address these challenges in a generative fashion exist, however, the separation performance and inference efficiency remain limited. In this work, we study the potential of diffusion models to advance toward bridging this gap, focusing on generative singing voice separation relying only on corresponding pairs of isolated vocals and mixtures for training. To align with creative workflows, we leverage latent diffusion: the system generates samples encoded in a compact latent space, and subsequently decodes these into audio. This enables efficient optimization and faster inference. Our system is trained using only open data. We outperform existing generative separation systems, and level the compared non-generative systems on a list of signal quality measures and on interference removal. We provide a noise robustness study on the latent encoder, providing insights on its potential for the task. We release a modular toolkit for further research on the topic.
Primary: Music.AI
All Institutions: Music.AI, Music Technology Group, Universitat Pompeu Fabra
The main contribution of this paper is the development of an efficient and effective generative model for singing voice separation using latent diffusion, which significantly advances the state of the art in music source separation. The combination of innovative methodology and rigorous evaluation positions this work as a valuable addition to the field of audio processing and machine learning.
The paper introduces a novel approach to singing voice separation using latent diffusion models (LDM), which operate in a compact latent space rather than directly in the audio domain. This method leverages the strengths of denoising diffusion probabilistic models (DDPM) while addressing the challenges of source overlap and the need for extensive training data. The use of a pre-trained neural audio codec (EnCodec) for generating latent representations is particularly innovative, as it allows for efficient training and inference. The methodology is well-structured, detailing the diffusion process, the architecture of the U-Net generator, and the conditioning mechanism that guides the separation process.
The authors conduct a thorough evaluation of their model against both generative and non-generative baselines using objective metrics such as log-spectral distance (LSD), Mel-spectrogram Mean Absolute Error (Mel-MAE), and perceptual evaluation metrics like PESQ. The results indicate that the proposed system outperforms existing generative models and matches the performance of non-generative systems on several metrics, demonstrating its effectiveness in real-world applications. Additionally, the perceptual tests provide valuable insights into the quality of the generated vocals, highlighting the importance of user-centered evaluations in audio processing.
The paper includes a modular Python toolkit released on GitHub, which facilitates reproducibility. The authors provide detailed information about the experimental setup, including the training process, data augmentation techniques, and the architecture of the model. However, the reliance on a specific pre-trained codec (EnCodec) may limit reproducibility for those without access to the same resources.
While the paper presents promising results, it acknowledges the presence of high-frequency artifacts and reconstruction errors that can affect the output quality. The authors suggest that fine-tuning the latent encoder with more vocal data could mitigate these issues. Additionally, the model's performance may vary based on the quality and diversity of the training data, which could limit its generalizability to other musical contexts.
The proposed method has significant implications for music production and education, as it provides a more accessible tool for vocal separation that can be utilized by musicians and audio engineers. By reducing the computational demands and training data requirements, this approach could democratize access to high-quality audio processing tools, fostering creativity and innovation in the music industry. The main contribution of this paper is the development of an efficient and effective generative model for singing voice separation using latent diffusion, which significantly advances the state of the art in music source separation. The combination of innovative methodology and rigorous evaluation positions this work as a valuable addition to the field of audio processing and machine learning.
Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a novel reference-free APC framework that corrects pitch errors while maintaining the natural expressiveness of vocal performances. In BERT-APC, a novel stationary pitch predictor first estimates the perceived pitch of each note from the detuned singing voice. A context-aware note pitch predictor estimates the intended pitch sequence by leveraging a music language model repurposed to incorporate musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional pitch deviations for emotional expression. In addition, we introduce a learnable data augmentation strategy that improves the robustness of the music language model by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples in terms of the raw pitch accuracy. In the MOS test, BERT-APC achieved the highest score of $4.32 \pm 0.15$, which is significantly higher than those of the widely-used commercial APC tools, AutoTune ($3.22 \pm 0.18$) and Melodyne ($3.08 \pm 0.18$), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples of BERT-APC are available online.
Primary: Handong Global University
All Institutions: Handong Global University
The main contribution of this paper is the introduction of BERT-APC, a novel reference-free framework for automatic pitch correction that leverages musical context inference to improve pitch accuracy and maintain vocal expressiveness. This work represents a significant advancement in the field of audio processing and machine learning, addressing critical limitations of existing systems while providing a robust experimental evaluation of its effectiveness.
The methodology presented in BERT-APC is innovative, leveraging a music language model (MusicBERT) to address the limitations of existing Automatic Pitch Correction (APC) systems that rely on reference pitches. The framework consists of a three-stage process: a note segmentator, a stationary pitch predictor, and a context-aware note pitch predictor. The integration of a learnable data augmentation strategy to simulate realistic detuning patterns is a notable contribution, enhancing the robustness of the model. However, the paper could benefit from a more detailed explanation of the training process and hyperparameter tuning, as well as a clearer depiction of the model architecture.
The experiments conducted are robust, comparing BERT-APC against two recent singing voice transcription models and commercial tools, demonstrating significant improvements in pitch accuracy and expressive preservation. The use of Mean Opinion Score (MOS) tests to evaluate perceptual quality adds credibility to the findings. However, the dataset's diversity and the specific metrics used for evaluation could be elaborated upon to strengthen the experimental framework.
The paper provides substantial implementation details, including architecture specifications and training procedures, which are essential for reproducibility. However, the lack of a publicly available code repository limits the ease of reproduction for other researchers. Including a link to the code would greatly enhance the paper's reproducibility.
One limitation noted is the potential degradation of BERT-APC's performance on songs that deviate significantly from typical musical patterns. Additionally, while the model performs well on highly detuned samples, the paper does not address how it handles various genres or styles of music, which could affect generalizability.
The implications of this research are significant for the music production industry, particularly in enhancing vocal recordings without the need for reference pitches. This could democratize access to high-quality pitch correction tools for amateur musicians and content creators. The model's ability to preserve expressive nuances also opens avenues for more emotionally resonant music production. The main contribution of this paper is the introduction of BERT-APC, a novel reference-free framework for automatic pitch correction that leverages musical context inference to improve pitch accuracy and maintain vocal expressiveness. This work represents a significant advancement in the field of audio processing and machine learning, addressing critical limitations of existing systems while providing a robust experimental evaluation of its effectiveness.
Duo-Tok is a source-aware dual-codebook tokenizer for vocal-accompaniment music that targets the growing tension between reconstruction quality and language-model (LM) learnability in modern lyrics-to-song systems. Existing codecs either prioritize high-fidelity reconstruction with difficult-to-model acoustic tokens or compress aggressively into semantic tokens that are LM-friendly but lossy, and they rarely make the tokenizer itself aware of dual-track structure. Duo-Tok follows a four-stage, SSL-centered pipeline: we first pretrain a BEST-RQ-style encoder on large-scale audio, then stabilize and factorize the representation with Gaussian replacement noise and multi-task supervision, before freezing the encoder to learn SimVQ-based dual codebooks with hard routing for vocals and accompaniment, and finally training latent diffusion decoders on top of the discrete tokens. Duo-Tok at 0.75 kbps shifts the empirical reconstruction-generation Pareto frontier, achieving the best music-tagging AP and the lowest vocabulary-normalized LM perplexity among compared codecs while maintaining reconstruction quality comparable to state-of-the-art music tokenizers.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Duo-Tok, a novel dual-track semantic music tokenizer that effectively balances reconstruction quality and language model learnability for vocal-accompaniment generation. The proposed methodology and experimental results demonstrate a meaningful advancement in the field, although further work is needed to enhance reproducibility and explore broader applications.
The methodology presented in the paper is innovative, utilizing a dual-codebook approach that addresses the limitations of existing music tokenization methods. The four-stage pipeline is well-structured, beginning with pretraining on large-scale audio data, followed by representation stabilization and factorization, which enhances the model's robustness. The use of SimVQ-based dual codebooks for vocals and accompaniment is particularly noteworthy, as it allows for better semantic representation while maintaining high reconstruction quality. However, the paper could benefit from a more detailed explanation of the multi-task supervision and Gaussian replacement noise techniques, as these are critical to understanding the effectiveness of the proposed method.
The experiments conducted are comprehensive, comparing Duo-Tok against state-of-the-art codecs in terms of music tagging and language model perplexity. The results indicate a significant improvement in the empirical reconstruction-generation Pareto frontier, showcasing the effectiveness of the proposed approach. However, the paper lacks detailed descriptions of the datasets used, which could affect the reproducibility and generalizability of the results. Additionally, more extensive ablation studies could strengthen the claims regarding the advantages of the dual-codebook approach.
The paper does not provide sufficient implementation details or access to code repositories, which raises concerns about reproducibility. While the methodology is well-defined, the absence of a clear implementation guide or publicly available code makes it challenging for other researchers to replicate the results or build upon the work.
One limitation of the study is the lack of a thorough exploration of the trade-offs between reconstruction quality and learnability in different contexts. Additionally, the paper does not address potential scalability issues when applying the method to larger datasets or more complex music genres. The reliance on specific training data may also limit the applicability of the findings to diverse musical styles.
The implications of this research are significant for the fields of music generation and audio processing. By improving the efficiency and quality of vocal-accompaniment generation, Duo-Tok has the potential to enhance various applications, including music production, AI-assisted songwriting, and interactive music systems. The advancements in tokenization methods could also influence future research in related areas, such as audio synthesis and machine learning for creative tasks. The main contribution of this paper is the introduction of Duo-Tok, a novel dual-track semantic music tokenizer that effectively balances reconstruction quality and language model learnability for vocal-accompaniment generation. The proposed methodology and experimental results demonstrate a meaningful advancement in the field, although further work is needed to enhance reproducibility and explore broader applications.
We introduce a novel method for designing attenuation filters in digital audio reverberation systems based on Feedback Delay Networks (FDNs). Our approach uses Second Order Sections (SOS) of Infinite Impulse Response (IIR) filters arranged as parametric equalizers (PEQ), enabling fine control over frequency-dependent reverberation decay. Unlike traditional graphic equalizer designs, which require numerous filters per delay line, we propose a scalable solution where the number of filters can be adjusted. The frequency, gain, and quality factor (Q) parameters are shared parameters across delay lines and only the gain is adjusted based on delay length. This design not only reduces the number of optimization parameters, but also remains fully differentiable and compatible with gradient-based learning frameworks. Leveraging principles of analog filter design, our method allows for efficient and accurate filter fitting using supervised learning. Our method delivers a flexible and differentiable design, achieving state-of-the-art performance while significantly reducing computational cost.
Primary: unknown
All Institutions: unknown
The paper presents a novel method for designing attenuation filters in digital audio reverberation systems, significantly improving the efficiency and performance of Feedback Delay Networks. The technical contributions are substantial, demonstrating a blend of digital signal processing and machine learning that could influence future research and applications in audio technology.
The proposed methodology leverages Second Order Sections of Infinite Impulse Response filters arranged as parametric equalizers, which is innovative in the context of Feedback Delay Networks. The design allows for a scalable and differentiable approach to filter design, optimizing parameters through gradient descent. This is a significant improvement over traditional methods that often struggle with differentiability and optimization complexity. The paper provides a clear mathematical foundation for the filter design and optimization process, demonstrating a solid understanding of both digital signal processing and machine learning principles.
The experiments are well-structured, utilizing a dataset of 1000 room impulse responses to validate the method's effectiveness. The evaluation metrics, including Mean Squared Error and Maximum Absolute Error, provide a robust framework for assessing performance. The results show that the proposed method achieves comparable accuracy to state-of-the-art approaches while using significantly fewer filters, which is a crucial aspect for real-time applications.
The implementation details are adequately described, and the availability of the code on GitHub enhances reproducibility. However, the paper could benefit from more explicit instructions on how to replicate the experiments, including specific configurations and parameter settings used during training.
One limitation is the reliance on a specific dataset of room impulse responses, which may not generalize to all acoustic environments. Additionally, while the method shows promise, further validation in real-world applications and across a broader range of scenarios would strengthen the findings. The paper also does not address potential challenges in integrating this method into existing audio processing pipelines.
The proposed method has significant implications for audio processing, particularly in real-time applications such as music production, virtual reality, and gaming. By reducing computational costs while maintaining high performance, this approach can enable more efficient audio effects processing in resource-constrained environments, potentially leading to broader adoption of advanced reverberation techniques in various audio applications. The paper presents a novel method for designing attenuation filters in digital audio reverberation systems, significantly improving the efficiency and performance of Feedback Delay Networks. The technical contributions are substantial, demonstrating a blend of digital signal processing and machine learning that could influence future research and applications in audio technology.
Generation of dynamic, scalable multi-species bird soundscapes remains a significant challenge in computer music and algorithmic sound design. Birdsongs involve rapid frequency-modulated chirps, complex amplitude envelopes, distinctive acoustic patterns, overlapping calls, and dynamic inter-bird interactions, all of which require precise temporal and spatial control in 3D environments. Existing approaches, whether Digital Signal Processing (DSP)-based or data-driven, typically focus only on single species modeling, static call structures, or synthesis directly from recordings, and often suffer from noise, limited flexibility, or large data needs. To address these challenges, we present a novel, fully algorithm-driven framework that generates dynamic multi-species bird soundscapes using DSP-based chirp generation and 3D spatialization, without relying on recordings or training data. Our approach simulates multiple independently-moving birds per species along different moving 3D trajectories, supporting controllable chirp sequences, overlapping choruses, and realistic 3D motion in scalable soundscapes while preserving species-specific acoustic patterns. A visualization interface provides bird trajectories, spectrograms, activity timelines, and sound waves for analytical and creative purposes. Both visual and audio evaluations demonstrate the ability of the system to generate dense, immersive, and ecologically inspired soundscapes, highlighting its potential for computer music, interactive virtual environments, and computational bioacoustics research.
Primary: IntelliSky
All Institutions: IntelliSky, George Mason University, Stanford University
The paper introduces a novel framework for generating dynamic multi-species bird soundscapes using algorithmic methods, significantly advancing the field of computer music and ecological sound simulation. The comprehensive methodology and potential applications underscore its importance in both artistic and scientific domains.
The paper presents a robust and innovative methodology for generating dynamic multi-species bird soundscapes using a fully algorithmic approach. The use of Digital Signal Processing (DSP) techniques to synthesize chirps with species-specific characteristics, combined with 3D spatialization, is a significant advancement over existing methods that rely on recordings or machine learning. The framework is well-structured, detailing the stages of chirp generation, spatialization, and soundscape synthesis, with mathematical formulations provided for clarity. The integration of visualization tools for analysis further enhances the methodology's comprehensiveness.
The experiments conducted demonstrate the system's capabilities effectively, showcasing the generation of diverse soundscapes with multiple bird species. The use of visualizations and audio evaluations to assess the quality of generated sounds is commendable, providing a clear understanding of the system's performance. However, the paper could benefit from more quantitative metrics or perceptual tests to validate the effectiveness of the soundscapes in comparison to real-world recordings.
The implementation is described in detail, with a clear outline of the framework stages and the mathematical models used. However, the lack of a publicly available code repository limits reproducibility, as other researchers may find it challenging to replicate the results without access to the underlying code.
While the approach is innovative, it may not fully capture the complexity of real-world bird interactions, as it relies on algorithmic generation without incorporating environmental factors or real-time interactivity. Additionally, the absence of a comparative analysis with existing methods in terms of sound quality and realism could be seen as a limitation.
The potential applications of this work are extensive, ranging from computer music and interactive virtual environments to ecological simulations and bioacoustics research. The ability to generate realistic and scalable bird soundscapes could enhance immersive experiences in various fields, including entertainment and environmental education. The paper introduces a novel framework for generating dynamic multi-species bird soundscapes using algorithmic methods, significantly advancing the field of computer music and ecological sound simulation. The comprehensive methodology and potential applications underscore its importance in both artistic and scientific domains.
This paper presents the design, implementation, and evolution of a comprehensive multimodal room-monitoring system that integrates synchronized video and audio processing for real-time activity recognition and anomaly detection. We describe two iterations of the system: an initial lightweight implementation using YOLOv8, ByteTrack, and the Audio Spectrogram Transformer (AST), and an advanced version that incorporates multi-model audio ensembles, hybrid object detection, bidirectional cross-modal attention, and multi-method anomaly detection. The evolution demonstrates significant improvements in accuracy, robustness, and industrial applicability. The advanced system combines three audio models (AST, Wav2Vec2, and HuBERT) for comprehensive audio understanding, dual object detectors (YOLO and DETR) for improved accuracy, and sophisticated fusion mechanisms for enhanced cross-modal learning. Experimental evaluation shows the system's effectiveness in general monitoring scenarios as well as specialized industrial safety applications, achieving real-time performance on standard hardware while maintaining high accuracy.
Primary: IIT Bombay
All Institutions: IIT Bombay
The paper presents a comprehensive evolution of a multimodal room monitoring system, demonstrating significant advancements in real-time anomaly detection through innovative methodologies and robust experimental evaluations. The integration of audio and video processing, along with sophisticated fusion techniques, positions this work as a valuable contribution to the field of machine learning and its applications in safety-critical environments.
The paper presents a well-structured methodology for a multimodal room monitoring system that integrates audio and video processing for real-time anomaly detection. The initial system employs YOLOv8 for object detection and the Audio Spectrogram Transformer (AST) for audio classification, while the advanced system enhances this with a multi-model audio ensemble, hybrid object detection, and bidirectional cross-modal attention. The use of a lightweight cross-modal transformer architecture for fusion is innovative, allowing for efficient real-time processing. The detailed architectural evolution and the introduction of multiple anomaly detection methods demonstrate a thorough understanding of the challenges in multimodal systems.
The experimental evaluation is comprehensive, showcasing the system's effectiveness in various scenarios, including industrial safety applications. The paper discusses the performance of the system in terms of accuracy and real-time processing capabilities, which are critical for practical applications. However, specific quantitative results and comparisons with baseline methods could enhance the evaluation's rigor.
The paper provides a detailed description of the system architecture, including preprocessing steps, model configurations, and fusion mechanisms, which aids reproducibility. However, the absence of a publicly accessible code repository or demo limits the ability for others to replicate the results. Including such resources would significantly enhance the reproducibility of the findings.
The paper acknowledges the increased computational requirements of the advanced system compared to the initial implementation. Additionally, while the system is designed for real-time performance, the trade-off between accuracy and efficiency is a concern that requires careful consideration in deployment scenarios. The reliance on multiple models may also complicate the system's integration into existing industrial setups.
The proposed system has significant implications for various fields, including industrial monitoring, smart homes, and healthcare. By effectively detecting anomalies in real-time, the system can enhance safety and operational efficiency in critical environments. The integration of multimodal data processing represents a step forward in developing intelligent monitoring systems that can adapt to complex real-world scenarios. The paper presents a comprehensive evolution of a multimodal room monitoring system, demonstrating significant advancements in real-time anomaly detection through innovative methodologies and robust experimental evaluations. The integration of audio and video processing, along with sophisticated fusion techniques, positions this work as a valuable contribution to the field of machine learning and its applications in safety-critical environments.
Advances in object tracking and acoustic beamforming are driving new capabilities in surveillance, human-computer interaction, and robotics. This work presents an embedded system that integrates deep learning-based tracking with beamforming to achieve precise sound source localization and directional audio capture in dynamic environments. The approach combines single-camera depth estimation and stereo vision to enable accurate 3D localization of moving objects. A planar concentric circular microphone array constructed with MEMS microphones provides a compact, energy-efficient platform supporting 2D beam steering across azimuth and elevation. Real-time tracking outputs continuously adapt the array's focus, synchronizing the acoustic response with the target's position. By uniting learned spatial awareness with dynamic steering, the system maintains robust performance in the presence of multiple or moving sources. Experimental evaluation demonstrates significant gains in signal-to-interference ratio, making the design well-suited for teleconferencing, smart home devices, and assistive technologies.
Primary: Universidad Carlos III de Madrid
All Institutions: Universidad Carlos III de Madrid, Universidad de Valencia
This work presents a compact, energy-efficient embedded system that integrates visual depth estimation with acoustic beamforming for real-time directional audio capture. The combination of deep learning and advanced signal processing techniques demonstrates a meaningful contribution to the field of audio processing and machine learning, particularly in dynamic environments.
The paper presents a novel integration of deep learning-based object tracking with acoustic beamforming, utilizing a compact MEMS microphone array and an NVIDIA Jetson Orin Nano for real-time processing. The methodology effectively combines stereo vision for depth estimation and a frequency-domain delay-and-sum beamformer, demonstrating a well-structured approach to achieving low-latency audio capture in dynamic environments. The choice of YOLOv11 for object detection and the optimization strategies for real-time performance are commendable, showcasing a thoughtful balance between computational efficiency and accuracy.
The experimental evaluation is robust, with tests conducted in both anechoic and dynamic environments to assess the system's performance under varying conditions. The use of signal-to-interference ratio (SIR) as a metric for performance evaluation is appropriate, and the results indicate significant improvements in SIR with the proposed system. However, the paper could benefit from more detailed statistical analysis and comparisons with baseline methods to further substantiate the claims of performance enhancement.
While the paper provides a comprehensive description of the system architecture and experimental setup, it lacks specific implementation details that would aid in reproducibility. Key parameters for the algorithms used, as well as the datasets employed for training and testing, should be explicitly stated to enable others to replicate the study effectively.
The paper acknowledges some limitations, such as the potential variability in performance due to environmental factors and the reliance on specific hardware configurations. Additionally, the omission of a dedicated multi-object tracking algorithm may limit the system's effectiveness in scenarios with closely spaced sound sources.
The proposed system has significant implications for applications in teleconferencing, smart home devices, and assistive technologies, where precise sound localization and directional audio capture are critical. The integration of visual and acoustic modalities opens avenues for further research in multimodal perception systems, potentially enhancing human-computer interaction and situational awareness in various domains. This work presents a compact, energy-efficient embedded system that integrates visual depth estimation with acoustic beamforming for real-time directional audio capture. The combination of deep learning and advanced signal processing techniques demonstrates a meaningful contribution to the field of audio processing and machine learning, particularly in dynamic environments.
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.
Primary: unknown
All Institutions: unknown
PrismAudio presents a novel framework for Video-to-Audio generation that effectively addresses the challenges of objective entanglement through the use of specialized Chain-of-Thought modules and Reinforcement Learning. The contributions made in methodology, dataset creation, and experimental validation mark a significant step forward in the field of audio generation, although the paper would benefit from improved reproducibility and a more thorough exploration of its limitations.
The methodology presented in PrismAudio is innovative, particularly in its integration of Reinforcement Learning (RL) with specialized Chain-of-Thought (CoT) modules. The decomposition of the V2A generation task into four distinct CoT modules (Semantic, Temporal, Aesthetic, and Spatial) is a significant advancement that addresses the issue of objective entanglement in existing models. The targeted reward functions associated with each module allow for a more nuanced optimization process, which is a notable improvement over traditional single loss function approaches. Furthermore, the introduction of Fast-GRPO for efficient training is commendable, as it enhances the practicality of the proposed framework.
The experimental results are robust, showcasing state-of-the-art performance across multiple perceptual dimensions on both the in-domain VGGSound test set and the out-of-domain AudioCanvas benchmark. The creation of the AudioCanvas dataset itself is a valuable contribution, as it provides a more balanced and diverse set of scenarios for evaluating V2A generation models. However, the paper could benefit from a more detailed analysis of the experimental setup and the specific metrics used to assess performance.
The paper lacks sufficient detail regarding the implementation, which could hinder reproducibility. While it mentions the use of hybrid ODE-SDE sampling, further specifics on the architecture, hyperparameters, and training procedures would be beneficial for other researchers looking to replicate the results. The absence of a publicly available code repository is a significant drawback in this regard.
One limitation is the potential complexity introduced by the multi-dimensional optimization process, which may require careful tuning of the reward functions to achieve optimal performance. Additionally, while the paper claims state-of-the-art results, it does not sufficiently address how the model performs in edge cases or under varying conditions that may not be represented in the training data.
The implications of this research are substantial, as it could pave the way for more sophisticated audio generation systems that can be applied in various fields, including film, gaming, and virtual reality. By improving the alignment of generated audio with visual content, this work has the potential to enhance user experience in multimedia applications. PrismAudio presents a novel framework for Video-to-Audio generation that effectively addresses the challenges of objective entanglement through the use of specialized Chain-of-Thought modules and Reinforcement Learning. The contributions made in methodology, dataset creation, and experimental validation mark a significant step forward in the field of audio generation, although the paper would benefit from improved reproducibility and a more thorough exploration of its limitations.
Video-to-Audio (V2A) generation requires balancing four critical perceptual dimensions: semantic consistency, audio-visual temporal synchrony, aesthetic quality, and spatial accuracy; yet existing methods suffer from objective entanglement that conflates competing goals in single loss functions and lack human preference alignment. We introduce PrismAudio, the first framework to integrate Reinforcement Learning into V2A generation with specialized Chain-of-Thought (CoT) planning. Our approach decomposes monolithic reasoning into four specialized CoT modules (Semantic, Temporal, Aesthetic, and Spatial CoT), each paired with targeted reward functions. This CoT-reward correspondence enables multidimensional RL optimization that guides the model to jointly generate better reasoning across all perspectives, solving the objective entanglement problem while preserving interpretability. To make this optimization computationally practical, we propose Fast-GRPO, which employs hybrid ODE-SDE sampling that dramatically reduces the training overhead compared to existing GRPO implementations. We also introduce AudioCanvas, a rigorous benchmark that is more distributionally balanced and covers more realistically diverse and challenging scenarios than existing datasets, with 300 single-event classes and 501 multi-event samples. Experimental results demonstrate that PrismAudio achieves state-of-the-art performance across all four perceptual dimensions on both the in-domain VGGSound test set and out-of-domain AudioCanvas benchmark. The project page is available at https://PrismAudio-Project.github.io.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of PrismAudio, a novel framework that integrates Reinforcement Learning with specialized Chain-of-Thought planning to improve video-to-audio generation across multiple perceptual dimensions. This work is significant as it addresses critical limitations in existing methods, providing a more interpretable and effective approach to V2A generation, with promising experimental results that suggest a strong potential for real-world applications.
The methodology presented in PrismAudio is innovative, particularly in its integration of Reinforcement Learning (RL) with specialized Chain-of-Thought (CoT) modules. By decomposing the V2A generation task into distinct perceptual dimensions—semantic, temporal, aesthetic, and spatial—the authors effectively address the issue of objective entanglement that hampers existing methods. The introduction of Fast-GRPO for computational efficiency is a significant contribution, as it enhances the practicality of their approach. However, the paper could benefit from a more detailed explanation of the CoT modules and how they interact within the RL framework.
The experimental results are compelling, demonstrating state-of-the-art performance on both the VGGSound and AudioCanvas benchmarks. The introduction of AudioCanvas as a more balanced and diverse dataset is a notable advancement, as it allows for a more rigorous evaluation of the proposed method. However, the paper could improve by providing more comprehensive comparisons with other state-of-the-art methods and discussing the implications of the results in greater detail.
The paper mentions the availability of a project page, which is a positive step towards reproducibility. However, it lacks detailed implementation specifics, such as hyperparameters, training procedures, and code availability, which are crucial for other researchers to replicate the results. Providing a clear link to the code repository would enhance the reproducibility of the findings.
One limitation of the study is the potential complexity of the proposed framework, which may hinder its adoption in practical applications. Additionally, while the model achieves high performance across multiple dimensions, the paper does not sufficiently address how it performs in real-world scenarios or with varying input quality. There is also a lack of discussion on the computational resources required for training and inference, which could be a barrier for wider use.
The implications of this work are significant, as it opens new avenues for audio generation in multimedia applications, including film, gaming, and virtual reality. By improving the alignment of generated audio with visual content, PrismAudio could enhance user experience in these domains. However, ethical considerations regarding the use of AI-generated audio in media should be addressed, particularly concerning authenticity and potential misuse. The main contribution of this paper is the introduction of PrismAudio, a novel framework that integrates Reinforcement Learning with specialized Chain-of-Thought planning to improve video-to-audio generation across multiple perceptual dimensions. This work is significant as it addresses critical limitations in existing methods, providing a more interpretable and effective approach to V2A generation, with promising experimental results that suggest a strong potential for real-world applications.
Understanding complete musical scores requires reasoning over symbolic structures such as pitch, rhythm, harmony, and form. Despite the rapid progress of Large Language Models (LLMs) and Vision-Language Models (VLMs) in natural language and multimodal tasks, their ability to comprehend musical notation remains underexplored. We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for evaluating score-level musical understanding across both textual (ABC notation) and visual (PDF) modalities. MSU-Bench comprises 1,800 generative question-answer (QA) pairs drawn from works spanning Bach, Beethoven, Chopin, Debussy, and others, organised into four progressive levels of comprehension: Onset Information, Notation & Note, Chord & Harmony, and Texture & Form. Through extensive zero-shot and fine-tuned evaluations of over 15+ state-of-the-art (SOTA) models, we reveal sharp modality gaps, fragile level-wise success rates, and the difficulty of sustaining multilevel correctness. Fine-tuning markedly improves performance in both modalities while preserving general knowledge, establishing MSU-Bench as a rigorous foundation for future research at the intersection of Artificial Intelligence (AI), musicological, and multimodal reasoning.
Primary: unknown
All Institutions: unknown
The paper presents the Musical Score Understanding Benchmark (MSU-Bench), a pioneering effort to evaluate large language models' comprehension of musical scores, highlighting significant gaps in current models' abilities and establishing a foundation for future research in AI and music. The combination of textual and visual modalities, along with a structured assessment framework, marks a notable contribution to the field, although further details on methodology and reproducibility could enhance its impact.
The paper introduces a novel benchmark, MSU-Bench, which is a significant advancement in evaluating LLMs and VLMs in the context of musical score understanding. The methodology is well-structured, with a clear delineation of comprehension levels that allows for a comprehensive assessment of models. The approach of combining both textual and visual modalities is innovative, addressing a gap in existing research. However, the paper could benefit from a more detailed explanation of the criteria for selecting the QA pairs and the rationale behind the progressive levels of comprehension.
The experiments are robust, involving a wide range of state-of-the-art models and thorough evaluations in both zero-shot and fine-tuned settings. The results highlight significant modality gaps and the challenges of achieving multilevel correctness, providing valuable insights into the capabilities of current models. However, the paper could improve by including more quantitative metrics and comparisons with baseline models to strengthen the findings.
The paper lacks detailed implementation specifics, such as the exact configurations used for the fine-tuning of models and the preprocessing steps for the musical scores. This omission may hinder reproducibility efforts by other researchers. Including a supplementary material or a dedicated section with these details would enhance the paper's reproducibility.
One limitation is the reliance on a specific set of composers, which may not fully represent the diversity of musical styles and complexities. Additionally, the benchmark may not account for the nuances of musical interpretation, which could affect the generalizability of the results. The paper also does not address potential biases in the dataset or the models used.
The establishment of MSU-Bench has the potential to significantly impact the fields of AI and musicology by providing a standardized framework for evaluating musical understanding in AI systems. This could lead to advancements in music generation, analysis, and education tools, fostering greater interaction between AI and the arts. The research opens avenues for interdisciplinary collaboration and could inspire further exploration into multimodal AI applications. The paper presents the Musical Score Understanding Benchmark (MSU-Bench), a pioneering effort to evaluate large language models' comprehension of musical scores, highlighting significant gaps in current models' abilities and establishing a foundation for future research in AI and music. The combination of textual and visual modalities, along with a structured assessment framework, marks a notable contribution to the field, although further details on methodology and reproducibility could enhance its impact.
Implicit Neural Representations (INRs) have emerged as a powerful paradigm for representing signals such as images, audio, and 3D scenes. However, existing INR frameworks -- including MLPs with Fourier features, SIREN, and multiresolution hash grids -- implicitly assume a \textit{global and stationary} spectral basis. This assumption is fundamentally misaligned with real-world signals whose frequency characteristics vary significantly across space, exhibiting local high-frequency textures, smooth regions, and frequency drift phenomena. We propose \textbf{Neural Spectral Transport Representation (NSTR)}, the first INR framework that \textbf{explicitly models a spatially varying local frequency field}. NSTR introduces a learnable \emph{frequency transport equation}, a PDE that governs how local spectral compositions evolve across space. Given a learnable local spectrum field $S(x)$ and a frequency transport network $F_θ$ enforcing $\nabla S(x) \approx F_θ(x, S(x))$, NSTR reconstructs signals by spatially modulating a compact set of global sinusoidal bases. This formulation enables strong local adaptivity and offers a new level of interpretability via visualizing frequency flows. Experiments on 2D image regression, audio reconstruction, and implicit 3D geometry show that NSTR achieves significantly better accuracy-parameter trade-offs than SIREN, Fourier-feature MLPs, and Instant-NGP. NSTR requires fewer global frequencies, converges faster, and naturally explains signal structure through spectral transport fields. We believe NSTR opens a new direction in INR research by introducing explicit modeling of space-varying spectrum.
Primary: unknown
All Institutions: unknown
The paper presents NSTR, a novel framework for modeling spatially varying frequency fields in implicit neural representations, which enhances expressivity, stability, and interpretability in signal reconstruction. The innovative use of a learnable PDE for frequency transport represents a significant advancement in the field, addressing key limitations of existing INR methodologies.
The proposed methodology introduces a novel framework, NSTR, which explicitly models spatially varying frequency fields through a learnable frequency transport PDE. This approach effectively decouples global frequency content from local spectral variation, allowing for adaptive representation of signals. The use of a PDE to govern the evolution of the local spectrum is particularly innovative, as it introduces a structured constraint that enhances interpretability and stability in the representation learning process. The parameterization of the local spectrum field using a coarse grid and a lightweight MLP is efficient, addressing the limitations of traditional INRs that rely on fixed global bases.
The experiments conducted across diverse tasks, including 2D image regression, audio waveform reconstruction, and implicit 3D geometry, demonstrate the effectiveness of NSTR in achieving superior accuracy-parameter trade-offs compared to existing methods like SIREN and Fourier-feature MLPs. The evaluation metrics used are appropriate for the tasks, and the results indicate significant improvements in fidelity and convergence speed. However, the paper could benefit from additional quantitative comparisons and visualizations to further substantiate its claims.
While the paper provides a detailed description of the architecture and training setup, it lacks specific implementation details such as code availability or links to datasets used for experiments. This hinders reproducibility, as independent researchers may struggle to replicate the results without access to the exact configurations and data.
One limitation is the lack of real-world application examples, as the experiments are primarily conducted on standard datasets. Additionally, the paper does not address potential computational overhead associated with the learnable PDE, which may impact scalability in more complex scenarios. The reliance on a fixed number of global frequencies may also limit the adaptability of the model in highly variable signal contexts.
The introduction of NSTR has the potential to significantly advance the field of implicit neural representations by providing a more flexible and interpretable framework for modeling complex signals. Its applications could extend to various domains, including graphics, audio processing, and scientific simulations, where understanding local frequency variations is crucial. The ability to visualize frequency flows could also enhance interpretability in machine learning models, fostering trust and understanding in AI systems. The paper presents NSTR, a novel framework for modeling spatially varying frequency fields in implicit neural representations, which enhances expressivity, stability, and interpretability in signal reconstruction. The innovative use of a learnable PDE for frequency transport represents a significant advancement in the field, addressing key limitations of existing INR methodologies.
Text-to-speech (TTS) and text-to-music (TTM) models face significant limitations in instruction-based control. TTS systems usually depend on reference audio for timbre, offer only limited text-level attribute control, and rarely support dialogue generation. TTM systems are constrained by input conditioning requirements that depend on expert knowledge annotations. The high heterogeneity of these input control conditions makes them difficult to joint modeling with speech synthesis. Despite sharing common acoustic modeling characteristics, these two tasks have long been developed independently, leaving open the challenge of achieving unified modeling through natural language instructions. We introduce InstructAudio, a unified framework that enables instruction-based (natural language descriptions) control of acoustic attributes including timbre (gender, age), paralinguistic (emotion, style, accent), and musical (genre, instrument, rhythm, atmosphere). It supports expressive speech, music, and dialogue generation in English and Chinese. The model employs joint and single diffusion transformer layers with a standardized instruction-phoneme input format, trained on 50K hours of speech and 20K hours of music data, enabling multi-task learning and cross-modal alignment. Fig. 1 visualizes performance comparisons with mainstream TTS and TTM models, demonstrating that InstructAudio achieves optimal results on most metrics. To our best knowledge, InstructAudio represents the first instruction-controlled framework unifying speech and music generation. Audio samples are available at: https://qiangchunyu.github.io/InstructAudio/
Primary: Institute of Automation, Chinese Academy of Sciences
All Institutions: Institute of Automation, Chinese Academy of Sciences, Tianjin University
InstructAudio represents a significant advancement in unified audio generation, combining speech and music synthesis under a single instruction-controlled framework. This innovative approach not only enhances the flexibility of audio generation but also sets a foundation for future research in multimodal AI systems.
The methodology presented in InstructAudio is robust, employing a multimodal diffusion transformer architecture that effectively integrates both speech and music generation tasks. The authors introduce a standardized instruction-phoneme input format that allows for unified control over various acoustic attributes through natural language descriptions. This approach is innovative, as it addresses the limitations of existing models that require reference audio for timbre control, thus enabling a more flexible and user-friendly interaction with the model. The use of joint and single diffusion transformer layers is well-justified, and the training on a substantial dataset of 50K hours of speech and 20K hours of music enhances the model's capacity for multi-task learning and cross-modal alignment.
The experimental evaluation is thorough, comparing InstructAudio against state-of-the-art models in both TTS and TTM tasks. The authors provide a comprehensive set of metrics, including objective measures like Word Error Rate (WER) and subjective evaluations such as Mean Opinion Scores (MOS). The results demonstrate that InstructAudio achieves superior performance in instruction-based TTS tasks while maintaining competitive capabilities in music generation. However, the paper could benefit from additional clarity in the presentation of results, particularly in the tables and figures, to enhance the reader's understanding of the comparative performance.
The paper provides a detailed account of the architecture, training process, and datasets used, which supports reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. The authors mention a significant dataset and specific training configurations, but sharing the code and model weights would greatly enhance reproducibility.
The paper acknowledges some limitations, such as the inherent information loss associated with the text-only control mechanism, which can lead to one-to-many mapping ambiguities and potentially lower audio quality compared to reference audio-based methods. Additionally, the constraint of generating short audio clips for music may limit the model's applicability in scenarios requiring longer compositions. These limitations are important to consider for future work.
The potential applications of InstructAudio are significant, spanning various domains such as entertainment, education, and accessibility. By enabling unified control over speech and music generation through natural language instructions, this framework could facilitate more intuitive interactions with AI systems in creative fields. The ability to generate expressive speech and music could also enhance user experiences in virtual environments and assistive technologies. InstructAudio represents a significant advancement in unified audio generation, combining speech and music synthesis under a single instruction-controlled framework. This innovative approach not only enhances the flexibility of audio generation but also sets a foundation for future research in multimodal AI systems.
Audio classifiers frequently face domain shift, when models trained on one dataset lose accuracy on data recorded in acoustically different conditions. Previous Test-Time Adaptation (TTA) research in speech and sound analysis often evaluates models under fixed or mismatched noise settings, that fail to mimic real-world variability. To overcome these limitations, this paper presents DHAuDS (Dynamic and Heterogeneous Audio Domain Shift), a benchmark designed to assess TTA approaches under more realistic and diverse acoustic shifts. DHAuDS comprises four standardized benchmarks: UrbanSound8K-C, SpeechCommandsV2-C, VocalSound-C, and ReefSet-C, each constructed with dynamic corruption severity levels and heterogeneous noise types to simulate authentic audio degradation scenarios. The framework defines 14 evaluation criteria for each benchmark (8 for UrbanSound8K-C), resulting in 50 unrepeated criteria (124 experiments) that collectively enable fair, reproducible, and cross-domain comparison of TTA algorithms. Through the inclusion of dynamic and mixed-domain noise settings, DHAuDS offers a consistent and publicly reproducible testbed to support ongoing studies in robust and adaptive audio modeling.
Primary: ImanYi Liao
All Institutions: ImanYi Liao
The main contribution of this paper is the introduction of the DHAuDS benchmark, which provides a comprehensive and realistic framework for evaluating test-time adaptation in audio classification. This benchmark addresses critical gaps in existing methodologies and sets a new standard for future research in the field.
The methodology presented in the paper is robust, introducing the DHAuDS benchmark which effectively addresses the limitations of existing TTA approaches by incorporating dynamic and heterogeneous noise types. The framework's design allows for a comprehensive evaluation of TTA methods across multiple audio domains, significantly enhancing the realism of the testing conditions. The detailed categorization of noise types and the implementation of variable corruption levels reflect a deep understanding of real-world audio challenges.
The experiments conducted are thorough, with a total of 124 individual evaluations across four distinct benchmarks. The use of multiple models (HuBERT, AMAuT, and CoNMix++) provides a comparative analysis that highlights the effectiveness of the proposed benchmark. The results indicate that TTA consistently improves performance, although the extent of improvement varies by dataset and corruption type, which is a critical insight for future research.
The authors have taken steps to ensure reproducibility by publicly releasing the benchmark datasets and evaluation sets. The use of different random seeds for generating corrupted sets further supports reproducibility. However, the paper could benefit from more detailed implementation instructions to facilitate easier replication of the experiments by other researchers.
The paper acknowledges limitations, particularly the restricted TTA performance on the UrbanSound8K dataset and the evaluation of only the base version of HuBERT due to GPU constraints. Additionally, the narrow comparative scope with limited existing TTA baselines may affect the generalizability of the findings.
The DHAuDS benchmark has the potential to significantly influence future research in audio classification and TTA by providing a standardized framework that can be utilized to develop more robust audio models. Its implications extend to various applications, including speech recognition, environmental sound classification, and bioacoustic monitoring. The main contribution of this paper is the introduction of the DHAuDS benchmark, which provides a comprehensive and realistic framework for evaluating test-time adaptation in audio classification. This benchmark addresses critical gaps in existing methodologies and sets a new standard for future research in the field.