We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, The Hong Kong Polytechnic University, City University of Hong Kong, The Hong Kong University of Science and Technology, Tsinghua University, Huawei Research Hong Kong
UNISON presents a unified framework for sound generation and editing through deep LLM fusion, significantly advancing the state of multimodal audio processing. The methodology effectively combines diverse tasks into a single model while demonstrating competitive performance against specialized systems, showcasing the potential for more efficient and scalable audio generation solutions.
The proposed methodology in UNISON is innovative, featuring a unified architecture that integrates multiple audio generation and editing tasks into a single framework. The use of layer-wise deep LLM fusion for semantic conditioning is a significant advancement over existing models that typically rely on single-layer conditioning. The architecture's ability to handle diverse tasks with a shared latent space and a single set of weights demonstrates a thoughtful approach to reducing complexity and enhancing cross-task knowledge transfer. The online multi-task data synthesis pipeline and curriculum training further contribute to the robustness of the training process, ensuring stability and effectiveness in learning.
The experimental evaluation is comprehensive, covering a wide range of benchmarks across text-to-audio, text-to-speech, zero-shot cloning, and audio editing tasks. The results show that UNISON performs competitively against task-specialist models, achieving superior performance in several metrics such as FAD, CLAP, and WER. The ablation studies provide valuable insights into the importance of the proposed architectural choices, confirming the effectiveness of the layer-wise deep LLM fusion and the necessity of a multi-task training approach. The use of both objective and subjective metrics enhances the credibility of the findings.
The paper provides detailed implementation details, including model configurations, training data composition, and hyperparameters, which are essential for reproducibility. However, the absence of a publicly available code repository limits the ability for others to replicate the results fully. The authors could enhance reproducibility by releasing their code and trained models.
The paper acknowledges limitations related to the VAE reconstruction quality, particularly for speech, which may affect the overall output quality. Additionally, the synthetic training data for editing tasks may not fully capture the complexities of real-world audio scenes, potentially impacting the model's performance in practical applications. The current model's language support is limited to English and Chinese, which may restrict its applicability in multilingual contexts.
UNISON has the potential to significantly impact the fields of audio generation and editing by providing a unified framework that simplifies the deployment of audio systems. Its ability to handle multiple tasks with a single model could lead to advancements in applications such as virtual assistants, content creation, and audio post-production. The integration of LLMs into audio processing also opens avenues for more intelligent and context-aware audio systems. UNISON presents a unified framework for sound generation and editing through deep LLM fusion, significantly advancing the state of multimodal audio processing. The methodology effectively combines diverse tasks into a single model while demonstrating competitive performance against specialized systems, showcasing the potential for more efficient and scalable audio generation solutions.
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.
Primary: OpenMOSS Team
All Institutions: OpenMOSS Team
MOSS-Audio presents a unified audio-language model that achieves state-of-the-art performance across various audio understanding tasks, leveraging innovative architectural choices and a comprehensive training methodology. The technical contributions, particularly in temporal grounding and feature injection, position it as a significant advancement in the field of audio processing and multimodal AI systems.
The methodology presented in MOSS-Audio is robust and innovative, employing a modular architecture that integrates a dedicated audio encoder, modality adapter, and a large language model. The introduction of DeepStack cross-layer feature injection is particularly notable, as it enhances the model's ability to capture multi-level acoustic features, which is crucial for understanding complex audio inputs. Additionally, the incorporation of explicit time markers for temporal grounding is a significant advancement, allowing the model to handle timestamped transcription and time-aware question answering more effectively. The event-preserving audio annotation pipeline is also a strong methodological contribution, ensuring that the model is trained on coherent audio segments rather than arbitrary cuts, which is essential for real-world audio understanding tasks.
The empirical results demonstrate that MOSS-Audio achieves state-of-the-art performance across a variety of benchmarks, including general audio understanding, speech captioning, ASR, and timestamped ASR. The model's performance is particularly impressive given its relatively compact size compared to other state-of-the-art models, indicating efficient scaling. The evaluation methodology is thorough, utilizing a range of benchmarks and metrics that provide a comprehensive assessment of the model's capabilities. However, the paper could benefit from more detailed comparisons against a wider array of existing models to further contextualize its performance.
The paper provides a detailed description of the architecture, training pipeline, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or model weights limits the practical reproducibility of the results. Future work should consider releasing the model and training code to facilitate independent validation and experimentation by the research community.
While MOSS-Audio shows strong performance, the paper does not address potential limitations related to the model's generalization to unseen audio types or its performance in noisy environments. Additionally, the reliance on large-scale annotated datasets may raise concerns about the model's applicability in low-resource settings. The paper could also explore the computational costs associated with training and deploying such models, which can be significant.
MOSS-Audio has the potential to significantly impact various applications, including voice assistants, automated transcription services, and audio analysis tools in diverse fields such as healthcare, entertainment, and security. By providing a unified framework for audio understanding, the model can enhance user interactions with technology and improve accessibility for individuals with hearing impairments. The advancements in temporal reasoning and audio-grounded reasoning could also lead to more sophisticated AI systems capable of understanding and responding to complex audio cues in real-time. MOSS-Audio presents a unified audio-language model that achieves state-of-the-art performance across various audio understanding tasks, leveraging innovative architectural choices and a comprehensive training methodology. The technical contributions, particularly in temporal grounding and feature injection, position it as a significant advancement in the field of audio processing and multimodal AI systems.
We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.
Primary: Alibaba Group
All Institutions: Alibaba Group, Independent Researcher
The paper introduces UniVocal, a pioneering framework for Speech-Singing Code-Switching synthesis, utilizing innovative methodologies to enhance vocal synthesis capabilities. The comprehensive approach to addressing data scarcity and improving acoustic modeling positions this work as a significant advancement in the field of audio synthesis, with potential applications across various domains.
The methodology presented in the paper is robust and innovative, introducing a unified framework for Speech-Singing Code-Switching (SCS) synthesis. The two-stage curriculum learning strategy is a significant contribution, allowing the model to progressively learn complex vocal transitions driven by text semantics. The integration of a refined cent token and Chain-of-Thought (CoT) generation enhances prosodic and melodic control, addressing limitations in existing models that rely on semantic tokenizers. The scalable data synthesis pipeline for generating diverse code-switching data is another noteworthy aspect, demonstrating a practical approach to overcoming data scarcity in this domain.
The experimental evaluation is thorough, with results demonstrating state-of-the-art performance on the newly introduced SCSBench benchmark. The paper provides a comprehensive analysis of the model's capabilities across different tasks, including empathetic speech generation and singing. The use of both objective metrics (e.g., WER, SIM, UTMOS) and subjective evaluations (e.g., human ratings for empathy and musicality) strengthens the validity of the results. The ablation studies effectively highlight the contributions of specific components, such as the refined cent token and the curriculum learning strategy.
The paper includes detailed descriptions of the model architecture, training procedures, and evaluation methodologies, which support reproducibility. The authors provide links to the code and datasets, facilitating further research and validation of their findings. However, the complexity of the model and the need for substantial computational resources may pose challenges for complete replication.
The paper acknowledges limitations related to the quality of synthetic singing data and the potential for artifacts in generated audio. Additionally, the reliance on explicit semantic triggers for robust generalization in real-world scenarios indicates that the model may struggle with purely implicit transitions. These limitations suggest areas for future improvement, particularly in enhancing the model's adaptability to diverse contexts.
The implications of this research extend to various applications in audio synthesis, including entertainment, education, and assistive technologies. The ability to seamlessly switch between speech and singing could enhance user experiences in interactive media, storytelling, and educational tools. However, the potential for misuse in generating deepfakes raises ethical considerations that must be addressed in future developments. The paper introduces UniVocal, a pioneering framework for Speech-Singing Code-Switching synthesis, utilizing innovative methodologies to enhance vocal synthesis capabilities. The comprehensive approach to addressing data scarcity and improving acoustic modeling positions this work as a significant advancement in the field of audio synthesis, with potential applications across various domains.
Deep learning has advanced pathological voice detection rapidly, yet rare laryngeal diseases remain underexplored due to data scarcity. Recurrent Respiratory Papillomatosis (RRP) exemplifies this gap: an HPV-induced disease of the larynx in which patients oscillate between recurrence and post-surgical remission over the years. RRP demands continuous voice monitoring that existing cross-sectional corpora cannot support. We introduce the first longitudinal voice dataset for RRP, comprising recordings from 26 patients with up to ten years of follow-up. Each session pairs sustained vowels with sentence-level utterances, which are annotated by otolaryngologists and confirmed synchronously with laryngoscopy. Building on this resource, we establish a systematic benchmark spanning handcrafted features, end-to-end deep networks, self-supervised pretrained models, and recent audio large language models, all evaluated under session-level cross-validation with patient-level audit. Per-subject longitudinal analyses further confirm that the cross-sectional discriminative signal reflects laryngoscopic disease state rather than stable speaker attributes. This work lays a foundation for rare longitudinal pathological voice tasks in low-resource clinical settings.
Primary: National Taiwan University
All Institutions: National Taiwan University, National Taiwan Normal University, Academia Sinica, Massachusetts Institute of Technology, Far Eastern Memorial Hospital, Yuan Ze University, University of Southern California, Taipei Municipal Zhongshan Girls High School
The paper introduces RRP-Voice, the first longitudinal voice corpus for Recurrent Respiratory Papillomatosis, providing a critical resource for advancing voice pathology detection. The comprehensive benchmarking and longitudinal analysis contribute significantly to the field, addressing gaps in existing research and offering a foundation for future studies in rare disease diagnostics.
The methodology is robust, introducing a longitudinal dataset that addresses a significant gap in the study of rare laryngeal diseases. The systematic benchmarking across various representation families, including handcrafted features and modern deep learning approaches, demonstrates a comprehensive approach to evaluating voice pathology detection. The use of synchronous laryngoscopic labels adds credibility to the dataset and ensures that the results are clinically relevant.
The experimental setup is thorough, employing a well-structured cross-validation approach that preserves session integrity. The results show clear distinctions between different methods, particularly highlighting the effectiveness of self-supervised models over traditional supervised baselines. The longitudinal analysis provides valuable insights into the dynamics of voice pathology, which is often overlooked in cross-sectional studies.
The paper provides sufficient details on the experimental setup, including model architectures, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly accessible dataset or code repository limits the ability for other researchers to replicate the findings directly.
The dataset is relatively small, with only 26 patients contributing to the recordings, which may limit the generalizability of the findings. Additionally, the study focuses primarily on cross-sectional classification without delving deeply into longitudinal predictive modeling, which could be a valuable extension.
This work has significant implications for clinical practices in monitoring rare laryngeal diseases, potentially leading to improved patient outcomes through better diagnostic tools. The introduction of a longitudinal dataset also sets a precedent for future research in low-resource clinical settings, encouraging the exploration of other rare diseases using similar methodologies. The paper introduces RRP-Voice, the first longitudinal voice corpus for Recurrent Respiratory Papillomatosis, providing a critical resource for advancing voice pathology detection. The comprehensive benchmarking and longitudinal analysis contribute significantly to the field, addressing gaps in existing research and offering a foundation for future studies in rare disease diagnostics.
We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.
Primary: Jen Music AI
All Institutions: Jen Music AI
JenBridge represents a significant advancement in the field of adaptive long-form video soundtracking, integrating innovative methodologies and robust evaluation frameworks to address the challenges of coherence and narrative continuity in AI-generated music. The comprehensive analysis of its technical contributions and the proposed benchmarks positions this work as a valuable resource for future research and applications in creative AI.
The methodology presented in JenBridge is robust, employing a two-stage training paradigm that integrates a Transformer-based generative model with a flow-matching objective. The segmentation of video into semantically coherent clips allows for a more manageable approach to music generation, while the dual conditioning mechanism (text and visual) enhances cross-modal alignment. The introduction of an LLM agent as a director for adaptive transitions is particularly innovative, showcasing a sophisticated understanding of narrative coherence in soundtracking. The modularity and interpretability of the framework are significant strengths, allowing for creative control over the soundtracking process.
The experimental evaluation is thorough, utilizing both objective and subjective metrics to assess performance on the newly proposed LVS Benchmark. The results demonstrate a clear superiority of JenBridge over existing methods, particularly in transition naturalness and overall coherence. The use of a user study adds a valuable layer of validation to the findings, reinforcing the model's effectiveness in real-world applications. The ablation studies further substantiate the importance of each component in the framework, providing insight into the contributions of various design choices.
The authors have committed to making the inference codes and the LVS Benchmark publicly available, which is a positive step towards ensuring reproducibility. However, the foundational text-to-music model's weights will not be released due to licensing constraints, which may limit full reproducibility of the results. The detailed descriptions of training procedures and methodologies contribute to a clearer understanding of the implementation, but the lack of access to the foundational model may hinder some aspects of reproducibility.
The paper acknowledges limitations related to the quality of the video-music training datasets, which could affect the final output's fidelity. Additionally, the LLM agent's current scope is limited to local decision-making without a global narrative understanding, which could lead to suboptimal musical choices in complex scenes. The model also does not account for original audio elements, such as dialogue, which may impact the overall coherence of the soundtrack.
JenBridge has the potential to significantly impact the field of automated soundtracking, providing tools that can enhance the creative processes of video producers and filmmakers. By bridging the gap between automated generation and professional-quality production, it opens avenues for new applications in multimedia content creation. The ethical considerations regarding data usage and the intent to empower human creators rather than replace them are commendable and reflect a responsible approach to AI development. JenBridge represents a significant advancement in the field of adaptive long-form video soundtracking, integrating innovative methodologies and robust evaluation frameworks to address the challenges of coherence and narrative continuity in AI-generated music. The comprehensive analysis of its technical contributions and the proposed benchmarks positions this work as a valuable resource for future research and applications in creative AI.
Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.
Primary: Nagoya University
All Institutions: Nagoya University, Beihang University, TARVO, Inc.
The main contribution of this paper is the development of a novel representation learning framework that integrates speech and text for enhancing electrolaryngeal speech. This work significantly advances the field of voice conversion by addressing the inherent challenges faced by laryngectomees, providing a practical and extensible solution that can greatly improve assistive communication technologies.
The paper proposes a novel seq2seq-based framework for electrolaryngeal (EL) speech enhancement by integrating speech and text representations. The methodology is well-structured, comprising a two-stage approach that includes representation integration and reconstruction training. The introduction of three distinct fusion strategies (middle-, input-, and hybrid-level) is innovative, allowing for enhanced learning of speech-text representations. The use of auxiliary text information to improve mapping quality is a significant advancement over traditional methods that rely solely on speech representations. The incorporation of an autoencoder-style reconstruction strategy is also a notable contribution, as it maintains model simplicity while improving performance.
The experimental evaluation is robust, utilizing multiple small-scale EL2SP datasets that reflect real-world challenges in data collection for laryngectomees. The results demonstrate consistent improvements over baseline models, indicating the effectiveness of the proposed methods. The use of both objective metrics (MCD, CER, F0 RMSE) and subjective evaluations (MOS for naturalness and intelligibility) provides a comprehensive assessment of the system's performance. The statistical significance of the results strengthens the claims made by the authors regarding the superiority of their approach.
The paper provides sufficient details regarding the implementation of the proposed systems, including the architecture, training procedures, and evaluation metrics. The use of established frameworks like ESPnet and the clear description of datasets and experimental protocols enhance reproducibility. However, the absence of publicly available code or detailed hyperparameter settings may hinder full reproducibility for some researchers.
One limitation of the study is the reliance on small-scale datasets, which may not fully capture the variability present in natural speech. Additionally, while the proposed methods show improvements, the performance may still not reach the level of natural human speech, indicating room for further enhancement. The complexity of the model increases with the incorporation of multiple fusion strategies, which may pose challenges in practical implementations.
The proposed methods have significant implications for assistive communication technologies, particularly for individuals who rely on electrolaryngeal devices. By improving the naturalness and intelligibility of EL speech, the research can enhance communication quality for laryngectomees, thus potentially improving their quality of life. The integration of speech and text representations may also inspire further research in multimodal speech processing and voice conversion applications. The main contribution of this paper is the development of a novel representation learning framework that integrates speech and text for enhancing electrolaryngeal speech. This work significantly advances the field of voice conversion by addressing the inherent challenges faced by laryngectomees, providing a practical and extensible solution that can greatly improve assistive communication technologies.
Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.
Primary: Shanghai University
All Institutions: Shanghai University, XJTLU Entrepreneur College (Taicang), Osaka University
The main contribution of this paper is the introduction of C2GA, a class-controllable generative augmentation framework that effectively addresses the challenges of data scarcity and class imbalance in respiratory sound classification. This innovative approach combines advanced generative modeling techniques with a focus on clinical relevance, significantly enhancing the performance of machine learning models in a critical healthcare domain.
The proposed C2GA framework introduces a novel approach to generative data augmentation for respiratory sound classification by leveraging a conditional Vector-Quantized Variational Autoencoder (VQ-VAE) and a Transformer-based autoregressive model. This two-stage method effectively constructs a semantically rich discrete latent space and generates high-fidelity Mel-spectrograms that maintain class semantics, addressing the limitations of existing augmentation techniques that often distort critical features. The methodology is well-structured, with clear descriptions of each stage, and emphasizes the importance of class conditioning and temporal dynamics in generating clinically relevant audio samples.
The experimental evaluation is robust, utilizing two distinct respiratory sound datasets that reflect real-world challenges such as noise and class imbalance. The authors provide comprehensive results demonstrating significant improvements in classification performance, particularly for minority classes, with clear metrics (accuracy, recall, F1-score) that validate the effectiveness of the C2GA framework compared to traditional and state-of-the-art methods. The ablation studies further reinforce the contributions of individual components within the framework, showcasing the importance of each element in achieving the reported gains.
The paper includes detailed implementation details, including architecture specifications, training procedures, and hyperparameter settings, which enhance reproducibility. However, the absence of a publicly available code repository limits the ability for others to replicate the results directly.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of respiratory sounds encountered in clinical practice. Additionally, while the framework shows promise, its performance in extremely noisy environments or with highly imbalanced datasets may require further validation. The lack of a demo or project URL also hinders accessibility for interested researchers.
The C2GA framework has significant implications for clinical practice, particularly in improving the robustness and accuracy of automated respiratory sound classification systems. By enhancing the ability to detect subtle pathological features in noisy and imbalanced datasets, this research could lead to better diagnostic tools and improved patient outcomes in respiratory health monitoring. The main contribution of this paper is the introduction of C2GA, a class-controllable generative augmentation framework that effectively addresses the challenges of data scarcity and class imbalance in respiratory sound classification. This innovative approach combines advanced generative modeling techniques with a focus on clinical relevance, significantly enhancing the performance of machine learning models in a critical healthcare domain.
Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.
Primary: Texas A&M University
All Institutions: Texas A&M University, Massachusetts Institute of Technology
The paper presents a novel parameter-efficient dual-encoder framework for underwater acoustic classification that leverages both waveform and spectrogram representations. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and underwater acoustics.
The proposed dual-encoder architecture is innovative in its approach to simultaneously process both waveform and spectrogram representations, addressing the limitations of single-representation models in underwater acoustic classification. The introduction of the Choquet integral for decision-level fusion is a significant methodological advancement, allowing for non-linear interactions between features. The use of parameter-efficient fine-tuning techniques further enhances the model's adaptability to domain-specific challenges while minimizing computational costs. The soft-sort gating mechanism is a clever solution to the differentiability issue associated with the Choquet integral, enabling end-to-end training.
The experiments conducted on the DeepShip and ShipsEar datasets provide a solid empirical foundation for the proposed methodology. The results demonstrate clear improvements over single-encoder baselines, validating the effectiveness of the dual-encoder architecture and the Choquet integral fusion mechanism. However, the paper could benefit from additional comparative analysis against more recent state-of-the-art methods in underwater acoustic classification to strengthen claims of superiority.
The paper outlines the architecture and methodology in sufficient detail, but lacks specific implementation details such as hyperparameter settings and training procedures. Providing a code repository or supplementary material would enhance reproducibility and allow other researchers to validate the findings.
One limitation is the reliance on two specific datasets, which may not fully represent the diversity of underwater acoustic environments. Additionally, while the Choquet integral provides interpretability, the complexity of the model may pose challenges in understanding the learned fuzzy measures without further analysis.
The proposed framework has significant implications for various oceanic applications, including maritime security and environmental monitoring. By improving classification accuracy in complex underwater environments, this research contributes to advancements in autonomous underwater vehicles and acoustic monitoring systems, potentially enhancing our understanding of marine ecosystems. The paper presents a novel parameter-efficient dual-encoder framework for underwater acoustic classification that leverages both waveform and spectrogram representations. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and underwater acoustics.
Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Soul AI Lab, Moonstep AI
SoulX-Transcriber presents a novel end-to-end framework for multi-speaker transcription that effectively integrates speaker diarization and ASR, demonstrating strong performance across various benchmarks. The innovative methodology, comprehensive evaluation, and potential for real-world applications position this work as a significant contribution to the field of machine learning and audio processing.
The methodology presented in SoulX-Transcriber is robust, employing a two-stage training framework that effectively combines speaker diarization and automatic speech recognition within a unified model. The use of speaker-aware multi-task continuous pre-training followed by supervised fine-tuning is particularly innovative, enhancing speaker representation and transcription accuracy. The complementary data engineering pipeline, which includes both pseudo-labeled and simulated data, addresses the challenges of acquiring high-quality training data in multi-speaker scenarios. This dual approach allows for better generalization and robustness in real-world applications.
The experimental evaluation is comprehensive, utilizing multiple public benchmarks (AliMeeting, AISHELL-4, AMI) to demonstrate the model's effectiveness across different scenarios. The results indicate significant improvements in key metrics such as Diarization Error Rate (DER) and Word Error Rate (WER), showcasing the model's capability to handle both short-form and long-form audio. The inclusion of internal benchmarks further strengthens the evaluation by testing generalization across diverse conversational contexts.
The paper provides sufficient details regarding the training data, model architecture, and evaluation metrics, which supports reproducibility. However, the reliance on proprietary datasets and the complexity of the data generation pipeline may pose challenges for independent replication. The availability of the project URL and demo enhances the potential for others to reproduce the results.
One limitation is the potential for label noise in the pseudo-labeled data, which could affect the model's performance in high-precision tasks. Additionally, while the model shows strong performance in Mandarin-centric datasets, its adaptability to other languages or dialects may require further validation. The complexity of the model and training process could also limit accessibility for practitioners without extensive resources.
The SoulX-Transcriber framework has significant implications for industries relying on accurate multi-speaker transcription, such as customer service, meeting documentation, and media production. Its ability to handle complex conversational dynamics can enhance communication efficiency and accessibility. Furthermore, the integration of LLMs with audio processing opens avenues for future research in multimodal AI applications. SoulX-Transcriber presents a novel end-to-end framework for multi-speaker transcription that effectively integrates speaker diarization and ASR, demonstrating strong performance across various benchmarks. The innovative methodology, comprehensive evaluation, and potential for real-world applications position this work as a significant contribution to the field of machine learning and audio processing.
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce \textbf{SpeechEditBench}, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code will be released upon acceptance.
Primary: City University of Hong Kong
All Institutions: City University of Hong Kong, Leibniz Research Center, Huawei
The main contribution of this paper is the introduction of SpeechEditBench, a comprehensive benchmark for instruction-guided speech editing that addresses the lack of unified evaluation frameworks in the field. This work significantly enhances the understanding of model capabilities and limitations, paving the way for future advancements in speech editing technologies.
The paper introduces a novel benchmark, SpeechEditBench, which systematically evaluates instruction-guided speech editing across multiple attributes. The methodology is robust, employing an anchor-based evaluation protocol that differentiates between target success and preservation success, which is critical for understanding the capabilities of Speech LLMs. The inclusion of both atomic and compositional editing tasks enhances the benchmark's comprehensiveness, allowing for nuanced assessments of model performance.
The experimental evaluation is thorough, involving eight Speech LLMs and specialized systems across various tasks. The results are well-documented, revealing significant insights into model performance, particularly the challenges of compositional editing and preservation of untargeted attributes. The findings are significant, indicating that no single model excels across all tasks, which highlights the fragmentation in current capabilities.
The paper mentions that data and code will be released upon acceptance, which is a positive step towards reproducibility. However, specific implementation details and the exact nature of the datasets used could be better elaborated to facilitate independent verification of results.
The benchmark currently only supports English and Chinese, limiting its applicability to a broader range of languages. The reliance on automatic metrics for evaluation may not fully capture the subjective quality of the edits, and the focus on single-turn instructions excludes multi-turn editing scenarios, which are crucial for real-world applications.
The development of SpeechEditBench has the potential to significantly advance the field of speech editing and LLMs by providing a structured framework for evaluation. This can lead to improved models that better understand and execute complex speech editing tasks, with implications for applications in content creation, accessibility, and interactive voice technologies. The findings regarding model performance fragmentation can guide future research directions and model development. The main contribution of this paper is the introduction of SpeechEditBench, a comprehensive benchmark for instruction-guided speech editing that addresses the lack of unified evaluation frameworks in the field. This work significantly enhances the understanding of model capabilities and limitations, paving the way for future advancements in speech editing technologies.
Speakers in dialogue continuously adapt their communicative behavior across acoustic, lexical, and semantic dimensions, a phenomenon known as conversational entrainment. Modeling this process requires representations that capture the global structure of interaction, yet prior approaches fail to disentangle dyad-specific patterns from speaker-specific traits, limiting their ability to capture true conversational adaptation. We address this with the Dyadic Distance Matrix (DDM), which encodes all pairwise similarities between the turns of two speakers over an entire conversation, capturing long-range cross-speaker dependencies. This raises a key question: does the DDM represent genuine interaction, or merely reflect individual speaker characteristics? We propose the speaker-switch test, a principled control in which one speaker's turns are replaced with those from an unrelated speaker drawn from a different conversation. This preserves turn-level statistics while disrupting the original dyadic coadaptation. The ability to distinguish real from switched DDMs thus directly evaluates whether the representation encodes interaction-specific structure. Across four embedding types and classifiers including ResNet-50 on the CANDOR corpus, real DDMs are consistently distinguishable from their switched counterparts. Comparisons with LibriSpeech show higher discriminability in read speech, highlighting the role of prosodic variability in naturalistic conversations. GradCAM analysis further reveals distinct structural signatures driving classification. These results establish the speaker-switch test as a robust diagnostic for validating representations of dyadic conversational interaction.
Primary: Indian Institute of Technology Guwahati
All Institutions: Indian Institute of Technology Guwahati
The paper presents a systematic framework for evaluating dyadic interactions in conversational speech through the introduction of the Dyadic Distance Matrix and the speaker-switch test. This work significantly contributes to the understanding of conversational dynamics and has the potential to improve the design of more responsive and context-aware dialogue systems.
The paper introduces the Dyadic Distance Matrix (DDM) as a novel representation for capturing dyadic interactions in conversational speech. The methodology is well-structured, employing a speaker-switch test to validate the DDM's ability to distinguish genuine conversational dynamics from speaker-specific traits. The use of multiple embedding types and classifiers, including ResNet-50, enhances the robustness of the approach. The systematic evaluation across different modalities and the cross-corpus analysis provide a comprehensive understanding of the model's performance in varied contexts.
The experiments are thorough, utilizing the CANDOR corpus and LibriSpeech to assess the effectiveness of the DDM in capturing interaction-specific structures. The classification results demonstrate strong discriminability between real and switched DDMs, particularly with semantic embeddings. The GradCAM analysis adds interpretability, revealing the structural features that contribute to classification decisions. The results are statistically significant and provide valuable insights into the nature of conversational entrainment.
The paper provides sufficient details on the methodology, including data preprocessing, model architectures, and evaluation metrics, which facilitates reproducibility. However, the lack of publicly available code or datasets limits the ease with which others can replicate the findings.
One limitation is the reliance on the CANDOR corpus, which may not generalize to all conversational contexts. Additionally, while the speaker-switch test is a robust evaluation method, it may not capture all nuances of dyadic interaction. The paper could also benefit from a more extensive discussion on the implications of the findings for practical applications in dialogue systems.
The findings have significant implications for advancing conversational AI and dialogue systems by providing a framework to better understand and model dyadic interactions. The ability to distinguish genuine conversational dynamics could enhance applications in areas such as automated dialogue systems, sentiment analysis, and social robotics. The paper presents a systematic framework for evaluating dyadic interactions in conversational speech through the introduction of the Dyadic Distance Matrix and the speaker-switch test. This work significantly contributes to the understanding of conversational dynamics and has the potential to improve the design of more responsive and context-aware dialogue systems.
Self-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context invariance across different speaking styles. We introduce SiamCTC, a framework that integrates Siamese networks with Connectionist Temporal Classification (CTC) to learn speech representations without strict frame-level correspondence. By employing CTC loss to establish flexible, monotonic alignments between differing temporal realizations of the same content, SiamCTC accommodates speed perturbations and other temporal augmentations. This design relaxes frame-wise constraints while preserving temporal coherence and enhancing robustness to speaking-rate variations in downstream tasks. Our experiments demonstrate that SiamCTC leads to more adaptable speech representations, particularly at diverse speaking rates.
Primary: SooHwan
All Institutions: SooHwan, Mark
The main contribution of this paper is the introduction of SiamCTC, a novel framework that leverages monotonic temporal alignment to enhance speech representation learning, demonstrating significant improvements over existing self-supervised methods. The comprehensive analysis of the technical contributions, methodology, and experimental results underscores its significance in the field of audio processing and speech technology.
The proposed SiamCTC framework innovatively combines Siamese networks with Connectionist Temporal Classification (CTC) to address the challenge of temporal alignment in speech representation learning. By allowing flexible, monotonic alignments rather than strict frame-wise correspondences, it effectively captures linguistic invariance across varying speaking styles and rates. The integration of multiple loss components (CTC loss, KL divergence loss, and Temporal InfoNCE loss) is well-justified and enhances the robustness of the learned representations. The methodology is sound, with a clear rationale for each component and its contribution to the overall objective.
The experiments are comprehensive, utilizing the LibriSpeech dataset, which is a standard benchmark in the field. The results demonstrate significant improvements in phoneme error rates (PER) compared to existing models like HuBERT and WavLM. The ablation studies effectively highlight the importance of each loss component, providing clear evidence of the framework's efficacy. However, the paper could benefit from additional metrics and comparisons with more recent models to further validate its performance.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameter settings, which should facilitate reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. Including a link to a GitHub repository or similar would enhance reproducibility.
The authors acknowledge the sensitivity of the model to hyperparameters, particularly regarding augmentation strategies and temperature settings. This sensitivity could hinder the model's generalizability across different datasets or applications. Additionally, while the framework shows promise, training from scratch rather than fine-tuning pre-trained models may yield different results, which remains unexplored in this work.
The SiamCTC framework has the potential to significantly advance the field of self-supervised speech representation learning, particularly in applications requiring robustness to variations in speaking styles and rates, such as automatic speech recognition and speaker verification. Its flexible alignment approach could also inspire further research into more adaptive models in related domains. The main contribution of this paper is the introduction of SiamCTC, a novel framework that leverages monotonic temporal alignment to enhance speech representation learning, demonstrating significant improvements over existing self-supervised methods. The comprehensive analysis of the technical contributions, methodology, and experimental results underscores its significance in the field of audio processing and speech technology.
As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of the HAIM dataset and a novel tracking framework that enables granular analysis of AI involvement in music production. This work significantly advances the field by moving beyond binary classifications to a more detailed understanding of hybrid human-AI collaborations in music, thereby addressing a critical gap in current detection methodologies.
The methodology presented in this paper is robust, introducing the HAIM dataset that categorizes music tracks based on the roles of AI and human contributions across various stages of music production. The multi-faceted taxonomy and diverse data sourcing are commendable, as they address the limitations of existing binary classification systems. The use of a modified Fusion Segment Transformer (MuQ-FST) for multilabel tracking is innovative, allowing for a more nuanced understanding of AI involvement in music production.
The experiments conducted are thorough, evaluating multiple existing detection systems against the HAIM dataset. The results highlight the systemic flaws in current detectors when faced with hybrid scenarios, demonstrating the effectiveness of the proposed approach. The performance metrics are well-defined, and the results are presented clearly, showcasing the advantages of the new benchmark.
The paper provides sufficient detail regarding the dataset creation, model architecture, and training procedures, which supports reproducibility. However, the lack of specific URLs for code or data repositories limits the ease of access for other researchers wishing to replicate the study.
The paper acknowledges several limitations, including the imbalance in category sizes, potential overfitting of the model to specific templates, and the need for more diverse mixing and mastering styles. Additionally, the complexity of human roles in music production is not fully captured, which may hinder the model's ability to generalize across different scenarios.
The implications of this research are significant, as it addresses the growing intersection of AI and music production, providing tools for better understanding and tracking AI contributions. This has potential applications in copyright law, music production, and the development of more sophisticated AI detection systems. The main contribution of this paper is the introduction of the HAIM dataset and a novel tracking framework that enables granular analysis of AI involvement in music production. This work significantly advances the field by moving beyond binary classifications to a more detailed understanding of hybrid human-AI collaborations in music, thereby addressing a critical gap in current detection methodologies.
Kinship verification (KV) from voice, the task of determining whether two speakers are biologically related, has received only little attention. Our work establishes a foundational basis for this emerging frontier, contributing to both performance evaluation and detection methodologies. First, leveraging the speech recordings of the large-scale audio-visual dataset, KAN-AV, we propose a revised evaluation protocol that controls for various confounders and adopts a family-disjoint train--test split to address open-set KV. Second, we analyze the close connection between speaker verification and KV, showing that genealogical similarity of speaker pairs plays opposite roles in the two tasks. Third, we tackle KV using three neural speaker embedding extractors (ECAPA-TDNN, WavLM-ECAPA, and ReDimNet) combined with various back-ends. In zero-shot KV including same-speaker target trials, ReDimNet achieves the lowest equal error rate (EER) of $20.8\%$; however, performance degrades to $39.7\%$ under strict kin trials, where same-speaker target trials are excluded. Our best trainable back-end, which applies asymmetric processing of the embedding pair to mitigate age-difference effects, obtains an EER of $32.0\%$ ($18.6\%$ with speaker target trials included). These results highlight the difficulty of KV while showing that speaker embeddings encode familial cues, offering a promising foundation for voice-based kinship analysis.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland
The paper establishes a foundational basis for voice-based kinship verification, contributing significantly to the field by addressing methodological gaps and proposing innovative solutions. The comprehensive analysis of kinship cues in voice, coupled with rigorous experimental validation, positions this work as a meaningful advancement in audio-based machine learning research.
The paper introduces a novel approach to kinship verification (KV) using voice, leveraging a large-scale audio-visual dataset (KAN-AV) and proposing a revised evaluation protocol that addresses confounding factors. The authors articulate a clear distinction between speaker verification (SV) and KV, emphasizing the unique challenges posed by familial voice similarities. They employ three advanced neural speaker embedding extractors and develop a lightweight asymmetric processing backend to mitigate age-difference effects, showcasing a thoughtful methodology that integrates both theoretical and practical considerations.
The experiments are robust, utilizing a well-curated dataset and a family-disjoint train-test split to evaluate generalization to unseen families. The results demonstrate the effectiveness of the proposed methods, with detailed performance metrics, including equal error rates (EER) for various configurations. The benchmarking against existing methods provides a solid foundation for assessing the contributions of the proposed approaches.
The paper provides sufficient detail regarding the experimental setup, including the data filtering process and the evaluation protocol. However, the absence of a publicly accessible code repository limits reproducibility. Clear descriptions of the models and training conditions are provided, but without code, independent verification of results may be challenging.
The study acknowledges the inherent difficulties in KV, particularly in strict kin trials where performance degrades significantly. The reliance on a specific dataset (KAN-AV) may limit the generalizability of the findings to other contexts or populations. Additionally, while the proposed methods show promise, further validation on diverse datasets would strengthen the claims.
The implications of this research extend to various fields, including forensics, where non-invasive kinship verification could complement traditional DNA profiling methods. The findings may also influence future work in speaker verification and voice analysis, potentially leading to advancements in applications such as familial identification in multimedia content. The paper establishes a foundational basis for voice-based kinship verification, contributing significantly to the field by addressing methodological gaps and proposing innovative solutions. The comprehensive analysis of kinship cues in voice, coupled with rigorous experimental validation, positions this work as a meaningful advancement in audio-based machine learning research.
The localization of moving sound sources using a microphone array is typically based on modifying the signal to compensate for the Doppler effect. In the time domain this compensation is done on a sample-by-sample basis. In the frequency domain short time segments need to be used in which the Doppler effect is assumed to be approximately constant and a discrete Fourier transform is done on each segment. In contrast, the authors developed an inverse 2.5D localization method for uniformly moving single-frequency sources that works in the spectral domain and allows for the use of longer windows. This was achieved by modifying the 2.5D forward model to directly compute the effect of the motion in the static observer position. The method does neither require to modify the measured signal nor does it require quasi-stationary of the measurements within the window used. Unfortunately, this approach is not directly suitable for broad-band stochastic sources, and in the present work we will investigate how the statistical properties of a uniformly moving stochastic source change when observed at a static observer. Using a 2.5D setting, the relation between the power spectral density of the moving source and the Loรจve spectrum, which is a generalization of the cross-spectral density at the static receivers, was derived. Based on simulated data with speeds up to 100 m\,s$^{-1}$, the work presented here provides a proof of concept for a method based on multi-taper estimates for the Loรจve spectrum to localize moving broad-band stochastic sources . Currently, the method requires a stationary source signal and that the spectral density is flat within a certain range around the frequency of interest. Also, correlations between sources are currently not considered.
Primary: Acoustics Research Institute, Austrian Academy of Sciences
All Institutions: Acoustics Research Institute, Austrian Academy of Sciences
The paper presents a novel approach to localizing moving broadband noise sources using the Loรจve spectrum and a 2.5D framework, contributing significantly to the field of acoustic signal processing. The methodology is innovative, addressing key challenges in the localization of stochastic sources, and the experimental validation supports its potential applicability in real-world scenarios.
The paper introduces a novel inverse 2.5D localization method that operates in the spectral domain, allowing for longer window sizes and avoiding the need for signal modification. The authors derive a relationship between the power spectral density of moving sources and the Loรจve spectrum, which is a significant theoretical contribution. The methodology is well-structured, leveraging multi-taper estimates for spectral analysis, and effectively addresses the challenges associated with localizing moving stochastic sources. However, the method's assumptions, such as requiring a stationary source signal and a flat spectral density, may limit its applicability in more complex real-world scenarios.
The experiments utilize simulated data to validate the proposed localization method, demonstrating its effectiveness in distinguishing moving sources at high speeds. The results are presented clearly, showcasing the correlation between theoretical and estimated spectra across various conditions. The use of a 64-channel microphone array in one of the experiments adds practical relevance to the findings. However, the reliance on simulations may not fully capture the complexities of real-world acoustic environments.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the experimental setup. While the methodology is theoretically sound, the absence of a publicly available implementation limits the ability of other researchers to replicate the results.
The method currently requires assumptions that may not hold in all scenarios, such as the need for a stationary source signal and flat spectral density. Additionally, the impact of correlations between sources is not considered, which could affect the localization accuracy in practical applications. The reliance on simulated data also raises questions about the method's robustness in real-world conditions.
The proposed localization method has potential applications in various fields, including transportation noise monitoring, environmental acoustics, and sound source localization in urban settings. By improving the accuracy of moving source localization, this work could contribute to advancements in noise control strategies and urban planning. The paper presents a novel approach to localizing moving broadband noise sources using the Loรจve spectrum and a 2.5D framework, contributing significantly to the field of acoustic signal processing. The methodology is innovative, addressing key challenges in the localization of stochastic sources, and the experimental validation supports its potential applicability in real-world scenarios.
Speech denoising is an often necessary step not only for human listening, but also for downstream processing by systems lacking robustness to noisy, real-world acoustic conditions. Unfortunately, denoising is a problem where conventional in-domain supervised training is not trivial, as the training targets cannot be annotated by humans: producing a clean version of a naturally-noisy speech recording is itself the task to solve. Supervised training is typically performed through the artificial addition of noise to clean speech recordings, which can only be sourced from controlled domains, a significant limitation due to the poor out-of-domain generalization of neural networks. An alternative is noisy target training (NyTT), which simply replaces the clean speech with in-domain noisy recordings, with the hope that learning to remove the artificial noise will extend to the natural. Though having shown promising results, NyTT's training objective is not minimized by clean speech estimates. We show that by estimating the artificial noise in addition to the naturally-noisy speech, the undesirable optimum can actually be exploited: the residual noise in the speech estimate can be canceled by the noise estimate via simple subtraction. Crucially, the optimum is fully compatible with conventional artificial mixtures, enabling joint training using both types of data with consistent optimization targets, opening the door to improved domain adaptability. The effectiveness of our approach is demonstrated through WHAM! and CHiME-3-based benchmarks.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a novel approach to speech denoising that effectively exploits noise inseparability through Differential Noise Filtering, significantly improving performance in weakly-supervised settings. The technical contributions and methodology are well-articulated, showcasing a promising advancement in the field of audio processing.
The proposed methodology introduces Differential Noise Filtering (DNF), which innovatively combines noisy target training with conventional supervised training. This dual-output approach allows for the estimation of both the noisy speech and the noise itself, enabling effective noise cancellation through subtraction. The methodology is well-grounded in theoretical analysis, leveraging scale-invariance principles and providing a clear framework for joint training with synthetic data. The integration of these concepts is a notable strength of the paper.
The experiments conducted on the WHAM! and CHiME-3 datasets provide robust evidence of the effectiveness of the proposed method. The reported improvements in SI-SDR and DNSMOS metrics demonstrate the practical applicability of the DNF approach. However, the paper could benefit from a more extensive comparison with state-of-the-art methods and a clearer presentation of results in tabular form.
While the paper outlines the model architecture and training configurations, it lacks specific implementation details and code availability, which may hinder reproducibility. Clearer documentation or a supplementary repository would enhance this aspect.
The paper acknowledges limitations in performance when compared to fully supervised methods, particularly in high-noise scenarios. Additionally, the potential for increased WER due to the cleaner outputs produced by DNF is a notable drawback. The reliance on the quality of the noisy data also poses challenges for generalization.
The proposed method has significant implications for real-world applications in speech processing, particularly in environments where clean speech data is scarce. By improving the robustness of speech denoising systems, this work could enhance communication technologies, assistive devices, and various AI-driven audio applications. The main contribution of this paper is the introduction of a novel approach to speech denoising that effectively exploits noise inseparability through Differential Noise Filtering, significantly improving performance in weakly-supervised settings. The technical contributions and methodology are well-articulated, showcasing a promising advancement in the field of audio processing.
Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.
Primary: PSL University
All Institutions: PSL University, Laboratoire d'Informatique et Systรจmes, Universitรฉ Aix-Marseille
The main contribution of this paper is the development of a context-aware model for distinguishing child-directed speech from adult-directed speech in long-form recordings, significantly enhancing the scalability and accuracy of analyses in children's language environments. The work represents a meaningful advancement in the intersection of machine learning and developmental linguistics, with potential applications in both research and practical settings.
The methodology employed in this paper is robust, utilizing self-supervised learning models specifically tailored for child-directed speech detection. The authors effectively leverage a multilingual dataset and incorporate contextual information into their models, which is a significant advancement over previous isolated utterance approaches. The context-aware fine-tuning strategy is particularly noteworthy, as it addresses the limitations of existing models by enhancing the input with surrounding audio, thereby improving classification performance. The use of multiple self-supervised models and a clear delineation of the addressee classification problem showcases a well-structured approach.
The experimental evaluation is comprehensive, involving a well-defined dataset comprising recordings from diverse languages and sociocultural contexts. The results demonstrate a substantial improvement in classification accuracy through the incorporation of contextual information, with an impressive absolute gain of 13.8% in average F1-score. The paper also contrasts the performance of various models, including both in-domain and out-of-domain pre-trained models, providing a thorough analysis of their effectiveness. However, the lack of a detailed comparison with other state-of-the-art methods could limit the contextual understanding of their results.
The implementation details provided are thorough, including specifics on model training, evaluation metrics, and the computational resources utilized. The authors have made their code available on GitHub, which enhances the reproducibility of their work. However, additional details on hyperparameter tuning and the specific configurations of the models could further aid in replicating the results.
One limitation noted in the paper is the reliance on automatic segmentation, which can introduce errors that propagate through the classification pipeline. Additionally, while the multilingual dataset is a strength, the authors acknowledge that their models may still be limited in their ability to generalize across all languages and sociocultural contexts. The computational cost associated with context-aware fine-tuning is also a concern, as it may hinder practical applications.
This research has significant implications for the field of developmental science, as it enables large-scale analysis of children's language environments without the need for extensive manual annotation. The ability to automatically detect child-directed speech could facilitate studies on language acquisition and development across diverse populations. Furthermore, by releasing their model and code, the authors contribute to the advancement of the field, promoting further research and development in this area. The main contribution of this paper is the development of a context-aware model for distinguishing child-directed speech from adult-directed speech in long-form recordings, significantly enhancing the scalability and accuracy of analyses in children's language environments. The work represents a meaningful advancement in the intersection of machine learning and developmental linguistics, with potential applications in both research and practical settings.
Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally heterogeneous. This mismatch can introduce memory-bandwidth, dispatch, and intermediate-allocation overheads on contemporary accelerator backends. This work introduces MelT, a single-stage frontend framework in which Mel-spaced Non-Uniform Discrete Fourier Transform (NDFT) bases are precomputed and applied to time-domain acoustic frames through dense General Matrix Multiplication (GEMM) operations. The contribution is not the NDFT operator itself; rather, it is the formulation of Mel-spaced NDFT projection as a GEMM-native audio frontend and its evaluation as a hardware-efficient alternative to conventional STFT+Mel pipelines. Evaluated across platforms ranging from Apple A18 Pro edge hardware to NVIDIA H100 datacenter acceleration, MelT attains up to a $3.75\times$ speedup in inference latency and a $3.52\times$ reduction in energy consumption while maintaining downstream classification accuracy.
Primary: Instituto de Ciรชncias Matemรกticas e de Computaรงรฃo, University of Sรฃo Paulo
All Institutions: Instituto de Ciรชncias Matemรกticas e de Computaรงรฃo, University of Sรฃo Paulo
The paper presents MelT, a novel GEMM-native audio frontend that significantly improves the efficiency of audio feature extraction by reformulating the conventional STFT and Mel aggregation into a single-stage process. This approach not only enhances computational performance but also reduces energy consumption, making it a valuable contribution to the field of audio processing in machine learning.
The methodology presented in the paper is innovative in reformulating the conventional STFT and Mel aggregation process into a single-stage GEMM-native framework. The authors leverage the mathematical foundation of the Non-Uniform Discrete Fourier Transform (NDFT) to directly compute Mel-spaced projections, which is a significant departure from traditional methods. The approach is well-justified, with clear explanations of how it avoids the inefficiencies of multi-stage processing. The integration of dense matrix multiplication into the audio frontend design is particularly noteworthy, as it aligns with modern hardware capabilities.
The experiments are robust, involving multiple hardware platforms (NVIDIA H100, V100, Apple M4 Pro, and A18 Pro) and demonstrating significant speedups and energy reductions. The benchmarks are comprehensive, covering various input durations and providing detailed latency and energy consumption metrics. The downstream task validation on VoxCeleb1 and SPIRA COVID-19 detection further strengthens the findings, showing that the new method maintains competitive performance with traditional approaches.
The paper provides a GitHub repository with source code, benchmark scripts, and configuration files, which enhances reproducibility. The detailed descriptions of experimental setups, including hardware configurations and statistical methodologies, allow other researchers to replicate the experiments effectively.
One limitation discussed is the scaling behavior of the proposed method, which shows diminishing returns as the number of Mel bins increases. The authors acknowledge that while the method is advantageous in the compact-bin regime, it may not perform as well in scenarios requiring a larger number of Mel bins. Additionally, the paper does not explore the potential for further optimization or adaptation to other audio processing tasks beyond the evaluated benchmarks.
The proposed MelT framework has significant implications for the efficiency of audio processing in machine learning applications, particularly in environments where computational resources are limited. By aligning audio feature extraction with the capabilities of modern accelerators, this work could lead to more efficient real-time audio applications, including speech recognition and classification tasks. The findings may inspire further research into hardware-optimized audio processing techniques, potentially influencing future designs of audio frontends in deep learning systems. The paper presents MelT, a novel GEMM-native audio frontend that significantly improves the efficiency of audio feature extraction by reformulating the conventional STFT and Mel aggregation into a single-stage process. This approach not only enhances computational performance but also reduces energy consumption, making it a valuable contribution to the field of audio processing in machine learning.
Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered set of source-like pitch maps. The model uses permutation-invariant Hungarian matching to avoid fixed output semantics and treats the number of slots as an upper bound on the number of active sources. We further study two modular extensions: a self-supervised timbre encoder that provides training-time targets for slot-level timbre embeddings, and a polyphony branch that regularizes the pitch density of mixture- and slot-level predictions. Experiments show that Hungarian matching substantially improves instrument family decomposition on URMP. Stem-level prediction remains more challenging: timbre and polyphony supervision improve selected configurations, but do not consistently resolve source assignment. The results suggest that slot-based architectures are a promising direction for source-aware MPE, while highlighting the need to couple auxiliary musical cues to slot identity more carefully.
Primary: Ilmenau University of Technology
All Institutions: Ilmenau University of Technology
The paper presents a novel lightweight slot-attention framework for MI-MPE, contributing significantly to the field by addressing the challenges of source decomposition and pitch estimation in complex audio mixtures. The methodology and experimental results indicate a promising direction for future research in music information retrieval.
The paper proposes a lightweight slot-attention framework for multi-instrument multi-pitch estimation (MI-MPE), which innovatively uses permutation-invariant Hungarian matching to allow for flexible output semantics. The methodology is well-structured, introducing a self-supervised timbre encoder and a polyphony branch to enhance the model's capabilities. The use of an unordered set of pitch maps is particularly noteworthy, as it addresses the challenges of fixed output semantics in traditional models. However, the complexity of the model's architecture may pose challenges for practical implementation and deployment.
The experiments are comprehensive, systematically evaluating the proposed model across various configurations and datasets, including URMP and mshoxxDB. The results indicate that the slot-based approach, particularly with the incorporation of timbre and polyphony supervision, shows promise in improving source decomposition and pitch estimation. However, the performance on stem-level predictions remains inconsistent, highlighting the need for further refinement in the model's design and training.
The paper provides a detailed description of the methodology, including the architecture, training protocols, and datasets used. However, the lack of publicly available code or a demo URL limits reproducibility. Clearer documentation or a supplementary material section could enhance the ability of other researchers to replicate the study.
The paper acknowledges that while the slot-based architecture shows potential, source assignment remains a significant challenge. The coupling between auxiliary objectives and slot decomposition is identified as a limitation, suggesting that further research is needed to disentangle these components. Additionally, the performance variability across different datasets indicates that the model may not generalize well to all types of music.
The proposed framework has the potential to advance the field of music information retrieval by enabling more accurate and flexible multi-pitch estimation in complex audio mixtures. This could have applications in automatic music transcription, music analysis, and even real-time audio processing systems. The lightweight nature of the model also suggests it could be deployed in resource-constrained environments, broadening its accessibility. The paper presents a novel lightweight slot-attention framework for MI-MPE, contributing significantly to the field by addressing the challenges of source decomposition and pitch estimation in complex audio mixtures. The methodology and experimental results indicate a promising direction for future research in music information retrieval.
Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.
Primary: Columbia University
All Institutions: Columbia University
The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.
The methodology presented in this paper is robust and innovative, combining implicit affect inference from user speech with explicit valence-arousal (VA) conditioning. The authors construct a novel dataset (Sympatheia-18k) that allows for the training of a speech-to-speech dialogue system capable of generating emotionally appropriate responses. The use of continuous VA coordinates as a conditioning mechanism is a significant advancement over traditional discrete emotion categories, allowing for more nuanced emotional responses. The integration of multimodal emotion sensing modules adds further depth to the system, making it adaptable to various input types. The architecture follows a well-established speech-language model (GLM-4-Voice) but enhances it with emotional conditioning, which is a thoughtful approach to improving empathetic dialogue systems.
The experimental evaluation is comprehensive, utilizing both automated and human assessments to evaluate the empathetic response quality of the Sympatheia system. The authors employ a variety of metrics, including empathy scores from an audio-capable LLM and a human Emotion Mean Opinion Score (MOS) study, which provides a well-rounded view of the model's performance. The results indicate that Sympatheia significantly outperforms baseline models in generating emotionally appropriate responses, validating the effectiveness of the proposed methods. The use of both emotional and neutral splits in the dataset allows for a thorough examination of the model's capabilities across different emotional contexts.
The paper provides detailed implementation details, including training configurations and dataset generation processes, which enhance reproducibility. The availability of the project code and dataset on GitHub and Hugging Face respectively further supports the ability of other researchers to replicate the study. However, the reliance on synthetic data for training may introduce variability that could affect reproducibility in real-world applications.
The paper acknowledges several limitations, including the synthetic nature of the training data, which may not fully capture the complexity of real-world conversations. Additionally, the fixed VA anchors used for emotional conditioning may not universally apply across different cultures or individual expressions of emotion. The authors also note that the current evaluation primarily relies on automated assessments, which may miss nuanced failures in empathy and appropriateness.
The potential applications of Sympatheia are significant, particularly in assistive technologies, education, and mental health support, where emotionally aware interactions can enhance user experience. However, the deployment of such systems raises ethical considerations regarding privacy and the potential for misuse in manipulative contexts. The authors emphasize the need for safeguards and responsible deployment practices to mitigate these risks. The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.
We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.
Primary: Norwegian University of Science and Technology
All Institutions: Norwegian University of Science and Technology, Tsinghua University
This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.
The paper introduces a novel Lagrangian sub-flow (LSF) framework for out-of-distribution (OOD) detection using continuous normalizing flows (CNFs). The methodology is well-grounded in fluid dynamics principles, allowing for localized analysis of high-dimensional data while maintaining global context. The approach effectively addresses the "likelihood paradox" by isolating relevant components in the data representation, which is a significant advancement in the field of generative models. The proposed geometric diagnostic signals and metrics for phoneme-level mispronunciation detection are innovative and provide a fresh perspective on OOD detection.
The experiments are robust, utilizing a real-world dataset (CMU Kids) for zero-shot phoneme-level mispronunciation detection. The results demonstrate the superiority of the proposed metrics over traditional likelihood-based methods, highlighting the effectiveness of the LSF framework. The evaluation metrics, including ROC-AUC, are appropriate for the task, although further validation across diverse datasets would strengthen the findings.
The paper provides sufficient details on the experimental setup, including model training and evaluation processes. However, the lack of publicly available code or a demo limits reproducibility. Clear descriptions of the methods and metrics used contribute positively, but access to implementation details would enhance reproducibility.
The study is primarily focused on a specific application in speech synthesis, which may limit the generalizability of the findings. The authors acknowledge the need for further validation across other domains, indicating that the framework's applicability is yet to be fully explored. Additionally, the complexity of the proposed methods may pose challenges for practical implementation in real-time systems.
The proposed framework has the potential to significantly improve OOD detection in various applications beyond speech synthesis, such as computer vision and medical imaging. By enhancing the ability to detect mispronunciations and other anomalies, this work could lead to advancements in automated speech recognition and generative modeling, ultimately benefiting user experience and system reliability. This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.
Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.
The methodology employed in the development of QuAP is robust, integrating a hybrid retrieval system with procedural audio synthesis and an intelligent parameter assistant. The use of MobileNet for audio embeddings and the feature-driven bottleneck framework for optimizing synthesis parameters demonstrates a thoughtful approach to addressing the challenges in sound design workflows. However, the paper could benefit from a more detailed description of the implementation specifics and the exact parameters used in the optimization process.
The experimental evaluation is well-structured, utilizing a MUSHRA subjective evaluation to assess the quality of the synthesized audio and an ablation study to compare encoder architectures. The results indicate statistically significant improvements in sound quality for most models, which supports the effectiveness of the proposed system. However, the relatively small sample size in the user evaluation (16 participants) may limit the generalizability of the findings.
While the paper provides a project URL and mentions the use of established datasets and frameworks, it lacks detailed implementation instructions or code availability, which could hinder reproducibility. More explicit documentation on the setup and execution of experiments would enhance this aspect.
The study acknowledges limitations, particularly in the synthesis quality of certain models (e.g., Rocket and Jet) and the narrow scope of sound categories supported by QuAP. The reliance on subjective evaluations may also introduce biases, and the tool's performance in real-world scenarios remains to be fully validated.
QuAP has the potential to significantly impact sound design practices by streamlining workflows and enhancing creative exploration. By unifying retrieval and synthesis, it could facilitate more efficient sound design processes across various industries, including film, gaming, and music production. The focus on maintaining creative agency while providing intelligent assistance is particularly relevant in the context of increasing automation in creative fields. The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.
Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.
Primary: Athens University of Economics and Business
All Institutions: Athens University of Economics and Business, Orfium, Hellenic Mediterranean University, National Center for Scientific Research โDemokritosโ
The main contribution of this paper is the introduction of a framework for deterministic attribute modulation in symbolic music generation through activation steering, which enhances interpretability and control without the need for retraining. This work is significant as it bridges the gap between complex generative models and user-driven control, paving the way for more interactive and user-friendly music generation systems.
The paper presents a novel approach to activation steering in the Multitrack Music Transformer (MMT) by utilizing the Difference-in-Means (DiffMean) methodology to isolate latent directions for musical attributes. The introduction of a Dual Steering framework using Gram-Schmidt Orthogonalization is a significant advancement in addressing feature entanglement, allowing for independent control of attributes like Pitch and Duration. The methodology is well-structured, leveraging existing theories in mechanistic interpretability while innovatively applying them to symbolic music generation.
The experimental setup is robust, with clear definitions of the steering vectors and comprehensive evaluations across both unconditional and conditional generation paradigms. The use of statistical measures such as Pearson correlation coefficients and Rยฒ values provides a solid quantitative basis for the effectiveness of the steering methods. The results demonstrate a high degree of success in achieving the intended attribute shifts, with detailed analysis of steering dynamics across various layers of the transformer architecture.
The paper includes sufficient detail regarding the model architecture, data representation, and experimental procedures, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the experiments. The URL provided for audio examples is a positive aspect, but a more comprehensive project URL would bolster reproducibility further.
One limitation is the reliance on a single dataset (SOD), which may affect the generalizability of the findings. Additionally, while the paper addresses conceptual interference, the methods for dual steering may still encounter challenges in more complex musical contexts or with additional attributes. The paper could also benefit from a discussion on the computational efficiency of the proposed methods in real-time applications.
This research has the potential to significantly impact the field of music generation and AI-driven creative tools, providing musicians and composers with more precise control over generated outputs. The findings could be applied in various applications, including algorithmic composition, interactive music systems, and educational tools for music theory. The focus on mechanistic interpretability also contributes to the broader discourse on transparency and explainability in AI systems. The main contribution of this paper is the introduction of a framework for deterministic attribute modulation in symbolic music generation through activation steering, which enhances interpretability and control without the need for retraining. This work is significant as it bridges the gap between complex generative models and user-driven control, paving the way for more interactive and user-friendly music generation systems.
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.
Primary: Fudan University
All Institutions: Fudan University
The MindVoice framework represents a significant advancement in reconstructing intelligible speech from non-invasive neural signals, utilizing a novel dual-stream architecture that effectively leverages pretrained models to address the challenges posed by noisy and incomplete neural recordings. This work has the potential to impact both the fields of auditory neuroscience and speech technology significantly.
The proposed MindVoice framework introduces a dual-stream architecture that separates semantic and acoustic reconstruction, leveraging pretrained models to enhance the intelligibility of reconstructed speech from non-invasive neural signals. This approach is innovative as it addresses the inherent noise and spatial blurring of neural recordings by disentangling the reconstruction process into two complementary pathways. The use of pretrained models for both semantic and acoustic attributes is a significant methodological advancement, allowing the model to compensate for the incomplete information present in neural signals. The architecture's design is well-justified, and the integration of various neural network components, including CNNs and Transformers, is appropriate for the task.
The authors conduct extensive experiments on two datasets (Brennan EEG and Gwilliams MEG), demonstrating that MindVoice outperforms existing baselines across multiple metrics, including semantic accuracy and speech quality. The evaluation metrics employed, such as HuBERT representation similarity and BERTScore-F1, are robust and relevant for assessing the intelligibility and quality of reconstructed speech. The results indicate a clear improvement over previous methods, validating the effectiveness of the proposed framework. However, the paper could benefit from more detailed comparisons with additional baselines and a broader range of evaluation metrics.
The implementation details are provided, including the architecture, training parameters, and preprocessing steps. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing the code and models to facilitate further research and validation by the community.
The study acknowledges limitations, including the model's tendency to produce generative hallucinations when neural signals do not provide sufficient information. The focus on semantic and timbre similarity may compromise fine-grained temporal fidelity, which is critical for certain applications. Additionally, the framework's applicability is currently limited to non-invasive neural signals related to auditory perception, leaving open questions about its performance on other types of neural signals.
The research has significant implications for the development of non-invasive speech brain-computer interfaces, potentially enabling communication for individuals with speech impairments. It also contributes to our understanding of auditory processing in the brain, paving the way for future studies in auditory neuroscience. The framework's ability to reconstruct intelligible speech from neural signals could lead to advancements in assistive technologies and enhance our understanding of human cognition. The MindVoice framework represents a significant advancement in reconstructing intelligible speech from non-invasive neural signals, utilizing a novel dual-stream architecture that effectively leverages pretrained models to address the challenges posed by noisy and incomplete neural recordings. This work has the potential to impact both the fields of auditory neuroscience and speech technology significantly.
Speech representations that capture prosodic information can be useful for both understanding and generation. However, speaker characteristics are reflected in acoustic-prosodic features (e.g., pitch). To address privacy concerns from the leakage of identity information, we propose a new self-supervised approach to learning prosody representations that incorporates speaker disentanglement strategies. We evaluate our encoder on three tasks to probe representation capabilities, including pitch reconstruction and detection of different prosodic events. Our encoder outperforms raw prosody and HuBERT-base baselines, achieving strong speaker disentanglement without adverse impact on prosody-related downstream tasks.
Primary: University of Washington
All Institutions: University of Washington
The main contribution of this paper is the development of a self-supervised prosody encoder that successfully disentangles speaker characteristics while preserving prosodic information, addressing critical privacy concerns in speech processing. The technical contributions and innovative methodology position this work as a meaningful advancement in the field of audio processing, with potential applications in privacy-sensitive speech technologies.
The methodology presented in this paper is robust, leveraging self-supervised learning to create a prosody encoder that effectively disentangles speaker characteristics from prosodic features. The use of glottal source estimation as input is innovative, and the combination of adversarial training with speaker normalization is a thoughtful approach to mitigate privacy concerns while maintaining prosody representation quality. The architecture builds on existing models like HuBERT and ProsodyBERT, but introduces significant enhancements, particularly in the context of privacy-preserving applications.
The experimental evaluation is comprehensive, utilizing multiple tasks to assess the encoder's performance, including pitch reconstruction and prosodic event detection. The results demonstrate clear improvements over baseline models, indicating that the proposed methods effectively enhance prosody modeling without compromising speaker disentanglement. The use of extensive datasets, such as the GigaSpeech corpus, strengthens the validity of the findings.
The paper provides detailed implementation information, including the training setup and the specific datasets used. However, the reliance on pseudo-labels for speaker normalization may affect reproducibility, as the effectiveness of the disentanglement strategies could vary with different labeling approaches. The GitHub repository linked in the paper aids in reproducibility, but the absence of publicly available code for some related works limits comparative evaluations.
The paper acknowledges limitations, including the use of pseudo-labels instead of ground-truth speaker labels, which may hinder the effectiveness of the proposed methods. Additionally, the focus on local prosodic events could limit the generalizability of the findings to more complex paralinguistic tasks. The model's non-causal nature also restricts its application in real-time scenarios.
The implications of this research are significant, particularly in the context of privacy-preserving speech technologies. By effectively disentangling speaker information from prosodic features, the proposed encoder can contribute to safer speech processing applications, such as AI assistants and voice synthesis systems, where user privacy is paramount. The approach could also inspire further research into privacy-preserving techniques across various domains of machine learning. The main contribution of this paper is the development of a self-supervised prosody encoder that successfully disentangles speaker characteristics while preserving prosodic information, addressing critical privacy concerns in speech processing. The technical contributions and innovative methodology position this work as a meaningful advancement in the field of audio processing, with potential applications in privacy-sensitive speech technologies.
Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.
Primary: National Taiwan University
All Institutions: National Taiwan University
This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.
The paper presents a comprehensive taxonomy of jailbreak attacks and defenses in Large Audio Language Models (LALMs), categorizing them into semantic, acoustic, signal, and embedding-layer attacks, as well as guard-based, training-free, and training-based defenses. The methodology is robust, combining a structured survey with empirical evaluations across ten open-source LALMs, which allows for a fair comparison of various attack and defense strategies. The authors also introduce a cost-aware evaluation framework that considers not just attack success rates but also benign refusal and latency, which is a significant improvement over previous works that focused solely on success rates.
The experiments are well-structured, utilizing a controlled dataset from JailbreakBench with 100 harmful and 100 benign requests, allowing for a clear assessment of the effectiveness of various attacks and defenses. The results indicate that different attack strategies yield varying success rates, with the Acoustic Best-of-N attack demonstrating the highest vulnerability. The empirical evaluation of defenses reveals a trade-off between robustness and usability, highlighting the complexity of ensuring safety in LALMs.
The paper provides detailed descriptions of the experimental setup, including the datasets used, the models evaluated, and the specific attack and defense methods employed. However, the reliance on specific hardware and configurations may limit the reproducibility of results in different environments. The authors do not provide code or data access, which could further hinder reproducibility.
The authors acknowledge several limitations, including the restricted model coverage to ten open-source LALMs and the controlled nature of the dataset, which may not fully represent real-world scenarios. Additionally, the evaluation metrics used may not capture all aspects of deployment, such as user satisfaction with benign responses. The paper also does not explore all possible attack and defense categories outlined in the taxonomy.
The findings of this paper have significant implications for the development of safe and robust LALMs, particularly in applications involving voice assistants and interactive systems. The emphasis on cost-aware evaluation and the identification of vulnerabilities across different modalities can guide future research in creating more resilient audio systems. The work also raises awareness about the potential for misuse of LALMs in bypassing safety mechanisms, highlighting the need for ongoing research into equitable and effective safety measures. This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.
Primary: University of Southern California
All Institutions: University of Southern California, The Ohio State University, University of California, Los Angeles, Harvard University, Boston University, University of Miami
The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.
The methodology presented in the paper is robust, as it introduces the ChildVox benchmark, which encompasses a wide range of child-centered audio and speech tasks. The integration of over 20 sub-tasks across 17 datasets is a significant advancement, allowing for a comprehensive evaluation of various audio and speech foundation models. The approach to define "voice" in children broadly, including physiological sounds and non-linguistic vocalizations, is innovative and necessary for understanding child communication. The evaluation of multiple model architectures, including self-supervised and ASR-oriented models, provides a well-rounded perspective on the capabilities of current technologies in this domain.
The experiments are thorough, with a clear structure that includes a variety of tasks and datasets. The benchmark results demonstrate that ChildVox provides high-performance models for recognizing a wide range of acoustic signals from children. The paper effectively compares the performance of different models on specific tasks, highlighting the strengths and weaknesses of each. The use of Macro-F1 scores for classification tasks and WER for ASR tasks is appropriate, ensuring that the evaluation metrics are relevant to the goals of the benchmark.
The paper provides detailed information about the datasets, experimental setup, and model training parameters, which enhances reproducibility. However, the lack of publicly available code or models limits the ability for other researchers to replicate the results fully. The authors mention plans to release models under a Responsible AI License, which is a positive step towards improving reproducibility in the future.
The paper acknowledges several limitations, including the focus on English-language recordings, which may restrict generalizability to other languages and dialects. Additionally, the subjective nature of some tasks, such as affective vocalization classification, may introduce variability in annotation reliability. The authors also note that the benchmark does not cover all recent advancements in audio foundation models, which could limit its comprehensiveness.
The ChildVox benchmark has significant implications for research in child development, speech therapy, and early childhood education. By providing a structured framework for evaluating child-centered audio processing, it can facilitate advancements in understanding children's communication and support the development of tools for monitoring and enhancing language skills. The potential applications in clinical settings for tracking speech production and language development are particularly noteworthy. The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, University of Surrey
The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.
The paper introduces a novel framework, COMET, utilizing Partial Least Squares Singular Value Decomposition (PLS-SVD) to analyze and mitigate the modality gap between audio and text embeddings in CLAP models. The methodology is well-structured, offering a fresh perspective on the decomposition of multimodal embeddings into interpretable concepts. The spectral truncation method proposed is innovative, allowing for effective dimensionality reduction while maintaining performance, which is a significant contribution to the field of multimodal contrastive learning.
The experiments are comprehensive, utilizing standard datasets like Clotho and AudioCaps for evaluation. The results demonstrate that the proposed PLSHead method achieves comparable or improved performance over the original embeddings, validating the effectiveness of the approach. The paper provides detailed metrics for retrieval tasks, showcasing the robustness of the method across different scenarios, including in-domain and cross-domain evaluations.
The paper lacks explicit implementation details or code availability, which could hinder reproducibility. While the methodology is clearly described, the absence of a publicly available codebase or demo limits the ability for other researchers to replicate the findings.
One limitation is the reliance on existing CLAP models, which may introduce biases based on their training data. Additionally, while the proposed methods show promise, the paper does not explore the potential impacts of varying the number of retained dimensions in the spectral truncation, which could affect generalization in different contexts.
The findings have significant implications for audio understanding and generation tasks, particularly in zero-shot scenarios. By effectively bridging the modality gap, the proposed methods could enhance the performance of multimodal applications, making them more accessible and efficient. This work could pave the way for future research in multimodal learning and its applications in real-world scenarios. The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.
While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.
Primary: KAIST
All Institutions: KAIST, Google DeepMind
The main contribution of this paper is the systematic evaluation of decoding strategies for DLM-based ASR, revealing that static and dynamic thresholding significantly enhance accuracy and speed compared to fixed-number decoding. This work provides a crucial step towards optimizing ASR systems, particularly in leveraging the unique properties of DLMs for improved performance.
The paper presents a systematic evaluation of decoding strategies for DLM-based ASR, comparing fixed-number, static threshold, and dynamic threshold approaches. The methodology is well-structured, utilizing Negative Log-Likelihood (NLL) as a measure of uncertainty, which is a novel approach in this context. The authors effectively analyze the performance of each strategy in terms of accuracy and speed, providing a clear rationale for their findings. However, the reliance on a single baseline model (Whisper-LLaDA) may limit the generalizability of the results.
The experiments are comprehensive, utilizing the LibriSpeech dataset and focusing on various hyperparameters for each decoding strategy. The evaluation metrics, including Word Error Rate (WER) and Real-Time Factor (RTF), are appropriate for assessing the performance of ASR systems. The results indicate that threshold-based strategies significantly outperform fixed-number schemes, which is a valuable contribution to the field. However, the paper could benefit from additional experiments on diverse datasets to validate the findings further.
The paper provides sufficient details on the experimental setup, including the training process and evaluation metrics. However, the absence of code or a project URL limits reproducibility. Future work should include sharing the implementation to facilitate validation by other researchers.
The study is limited to clean read English speech from the LibriSpeech test-clean set, which may not fully represent the challenges of noisy or spontaneous speech. Additionally, the findings may not generalize to multilingual ASR systems, as the confidence distribution could vary significantly across different languages and contexts.
The findings have significant implications for the development of more efficient ASR systems, particularly in applications requiring real-time processing. By demonstrating the effectiveness of threshold-based decoding strategies, this work could influence future research directions in ASR and related fields, potentially leading to advancements in speech technology and accessibility. The main contribution of this paper is the systematic evaluation of decoding strategies for DLM-based ASR, revealing that static and dynamic thresholding significantly enhance accuracy and speed compared to fixed-number decoding. This work provides a crucial step towards optimizing ASR systems, particularly in leveraging the unique properties of DLMs for improved performance.
Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reliable or are not used in training objectives. This work introduces a novel workflow for feature extraction using only acoustic labels. By isolating explicit regional accent landmarks and using a phoneme-based forced aligner (ZIPA), our targeted feature set captures dialectal variance more effectively than utterance embeddings, demonstrating that localized features can outperform general-purpose architectures on accent-related tasks using minimal and objective data labels.
Primary: Faculdade de Engenharia Elรฉtrica e Computaรงรฃo (FEEC)
All Institutions: Faculdade de Engenharia Elรฉtrica e Computaรงรฃo (FEEC), CNPq, UFRJ, UNICAMP
This paper presents a novel workflow for accent classification in Brazilian Portuguese, demonstrating that localized acoustic features can effectively capture dialectal variance without the need for sociolinguistic labels. The methodology and results contribute meaningfully to the field, showcasing the potential for improved speech processing techniques that are both interpretable and computationally efficient.
The methodology is innovative in its approach to accent classification by utilizing a purely audio-driven pipeline that relies on acoustic labels rather than sociolinguistic labels. The use of ZIPA for phoneme-based forced alignment to isolate accent markers is a significant methodological advancement. The authors effectively demonstrate the extraction of localized features that outperform general-purpose architectures, which is a novel contribution to the field of speech processing. The detailed description of the feature extraction process and the classification tasks is commendable, although the reliance on manual annotation may introduce bias.
The experimental evaluation is thorough, employing a variety of classifiers and a well-structured cross-validation protocol to assess the performance of the proposed features against established SSL models. The results indicate that the proposed method achieves competitive accuracy, which is a strong validation of the approach. However, the paper could benefit from more extensive comparisons with other state-of-the-art methods and a clearer presentation of results in tables.
The paper provides sufficient detail regarding the methods and datasets used, which aids in reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results. The authors mention a companion webpage, which could potentially provide additional resources, but this needs to be explicitly linked.
The study acknowledges that the accent markers used are not exhaustive for all Brazilian Portuguese accents, indicating a limitation in generalizability. The reliance on manual annotation for training data may also introduce biases that affect the model's performance. Additionally, the paper does not address potential challenges in real-world applications, such as variability in speaker accents and environmental noise.
The work has significant implications for the field of speech recognition and sociolinguistics, particularly in regions with diverse dialects like Brazil. By demonstrating that reliable accent classification can be achieved without sociolinguistic labels, the research opens avenues for more inclusive and accessible speech technologies. This could enhance applications in automatic speech recognition, language learning, and sociophonetic research. This paper presents a novel workflow for accent classification in Brazilian Portuguese, demonstrating that localized acoustic features can effectively capture dialectal variance without the need for sociolinguistic labels. The methodology and results contribute meaningfully to the field, showcasing the potential for improved speech processing techniques that are both interpretable and computationally efficient.
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Xiaohongshu Inc
The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.
The proposed HoliTok model introduces a novel continuous tokenization approach that effectively balances the requirements of learnability and decodability for unified speech generation and understanding. The progressive training strategy enhances the model's ability to preserve signal fidelity while incorporating semantic information, which is a significant advancement over existing tokenization methods. The architecture's integration of a variational autoencoder with a temporal bottleneck and a downstream-aware supervision network is a thoughtful design choice that addresses the limitations of traditional tokenizers.
The experiments conducted demonstrate the model's competitive performance in reconstruction fidelity, speech synthesis, and unified generation-understanding tasks. The evaluation metrics used, including PESQ, STOI, and WER, provide a robust framework for assessing the quality of the generated outputs. The results indicate that HoliTok not only outperforms existing methods but also maintains a compact latent representation, which is crucial for practical applications in speech technology.
The paper provides a clear description of the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of detailed hyperparameter settings and specific training configurations in the main text may pose challenges for full replication. The availability of the code on GitHub is a positive aspect that aids in reproducibility efforts.
The study primarily focuses on speech generation and understanding, leaving out broader audio applications such as environmental sounds and music. The evaluation is limited to a specific architecture (AR+DiT), which may not capture the full potential of the proposed tokenizer across various modeling paradigms. Future work should explore these areas to validate the generalizability of the approach.
The advancements presented in this paper have the potential to significantly enhance speech synthesis and recognition technologies, making them more efficient and effective. The model's ability to serve as a unified interface for both tasks could lead to improvements in applications such as virtual assistants, automated transcription services, and interactive voice response systems. The implications for accessibility and user interaction with technology are substantial, as improved speech models can facilitate better communication for individuals with speech impairments. The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.
Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.
Primary: University of Edinburgh
All Institutions: University of Edinburgh, Google DeepMind, Meta Superintelligence Labs
The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.
The paper presents a novel approach to speech language modeling by introducing MELD, which integrates discrete latent variables into the autoregressive modeling of mel-spectrograms. This joint optimization of the encoder and autoregressive model addresses limitations of previous two-stage methods, particularly in preserving task-relevant information. The methodology is well-structured, leveraging variational inference to optimize a lower bound on the log likelihood, and effectively incorporates both TTS and STT tasks within a single framework. The use of discrete latent variables to suppress silence generation is a significant innovation, enhancing the model's performance over existing methods.
The experiments are comprehensive, utilizing the 960-hour subset of the LibriSpeech dataset for training and evaluation. The authors compare MELD against several baselines, including codec-based models and other mel-spectrogram-based approaches, demonstrating clear improvements in both TTS and STT tasks. The evaluation metrics include both subjective (MOS, speaker similarity) and objective (WER) assessments, providing a well-rounded view of the model's performance. The results indicate that MELD outperforms its competitors, particularly in reducing silence and improving word error rates.
The paper provides detailed implementation specifics, including model architecture, training configurations, and evaluation protocols. However, the authors acknowledge challenges in reproducing results from related work (e.g., MELLE), which may affect the perceived reliability of their comparisons. The use of specific datasets and training strategies is well-documented, but the lack of a public code repository or demo limits reproducibility.
The authors note several limitations, including the difficulty in making fair comparisons between codec-based and mel-spectrogram-based methods due to differences in representation mapping. Additionally, while the joint optimization framework is promising, the paper does not explore its application to other speech tasks beyond TTS and STT. The potential for overfitting or collapsing solutions in the discrete latent space is also mentioned, although not observed in their experiments.
The proposed model has significant implications for real-world applications in speech synthesis and recognition, particularly in enhancing the quality and efficiency of TTS systems. The ability to jointly model TTS and STT tasks could streamline workflows in various applications, such as virtual assistants and automated transcription services. However, ethical considerations regarding the misuse of speech generation technologies, such as voice cloning, must be addressed to ensure responsible use. The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.
AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign, Wonkwang University
The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.
The proposed methodology introduces a novel federated domain generalization framework specifically tailored for respiratory sound classification, addressing the critical issue of inter-stethoscope variability. The integration of a causality-inspired device style intervention network, counterfactual text augmentation, and gradient alignment represents a significant advancement in the field, as it not only tackles the entanglement of device style and disease content but also enhances the robustness of the model across heterogeneous devices. The approach is well-structured, leveraging causal inference principles to inform data augmentation strategies, which is a fresh perspective in the context of audio classification.
The experimental setup is robust, utilizing two well-defined datasets (ICBHI and SPRSound) and employing leave-one-device-out validation to rigorously assess the model's performance. The results demonstrate that the proposed method consistently outperforms conventional data augmentation and federated learning baselines, indicating its effectiveness in improving cross-device generalization. The ablation studies further substantiate the contributions of each component of the framework, providing clear evidence for the importance of the causality-inspired interventions.
While the paper mentions that code will be released upon publication, the absence of a current project URL limits immediate reproducibility. The methodology is described in sufficient detail to allow for replication, but access to the code and datasets would be essential for full verification of results.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of respiratory sound recordings across different clinical settings. Additionally, the paper acknowledges the need for future work to address privacy concerns and computational efficiency in federated learning settings, which are critical for real-world applications.
The framework has significant potential implications for telemedicine and automated pulmonary disease detection, particularly in enhancing the reliability of AI-driven diagnostics across various healthcare environments. By addressing device-induced biases, the work contributes to the broader goal of equitable healthcare access and improved patient outcomes. The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.
Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.
Primary: Zhejiang University
All Institutions: Zhejiang University
The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.
The proposed CoRe-KD framework innovatively addresses the challenges of multimodal emotion recognition (MER) under incomplete observations. It introduces two key components: Complete-view State Anchoring (CSA) and Nonverbal Conflict Exposure (NCE), which enhance the robustness of emotion recognition by aligning incomplete-view predictions with structured references from a complete-view teacher. The methodology is well-structured, leveraging knowledge distillation effectively while avoiding the pitfalls of input reconstruction, which is a common issue in existing methods. The use of Gaussian-inspired states for modality fusion is a notable technical contribution that adds precision to the alignment process.
The experiments are comprehensive, utilizing established datasets (IEMOCAP, MELD, and CMU-MOSEI) to validate the effectiveness of CoRe-KD under both fixed- and random-missing protocols. The results demonstrate consistent improvements in accuracy and F1 scores compared to various baselines, indicating the robustness of the proposed method. The inclusion of ablation studies further strengthens the findings by elucidating the contributions of each component within the framework.
The paper provides detailed implementation specifics, including training protocols, hyperparameters, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the results.
One significant limitation is that CoRe-KD requires complete multimodal observations for training the teacher model, which may not be feasible in all real-world scenarios. Additionally, the NCE module relies on controlled conflict views that might not comprehensively cover all possible real-world misalignments or corruptions in multimodal data.
The advancements in robust conversational MER have implications for various applications, including human-computer interaction, sentiment analysis, and affective computing. By improving the reliability of emotion recognition systems in the presence of missing or unreliable modalities, this work could enhance user experience in applications such as virtual assistants, mental health monitoring, and interactive entertainment. The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.
The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality. This fosters the assumption that low-WER tokens inherently preserve the information necessary for intelligible acoustic synthesis. We argue this is fundamentally deceptive. While high-frequency tokens succeed in generation tasks due to implicit information leakage, isolating pure semantic information at ultra-low frame rates strips away the finegrained articulation and micro-dynamics essential for ODE-based generation. Empirically validating this requires extreme compression without sacrificing WER -- a methodological bottleneck, as standard fixed-stride downsampling arbitrarily truncates phonetic boundaries. To overcome this, we develop a dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, achieving ultra-low frame rates with exceptionally low WER. Using these isolated "pure" semantic tokens, we expose the WER trap: when conditioning generative models -- even with oracle duration alignments -- the reconstructed speech suffers from severe articulation blur and is rendered acoustically unintelligible. Our findings demonstrate that semantic categorization rewarded by low WER is inherently orthogonal to the continuous phonetic trajectories required for synthesis, shattering the illusion of the unified token and advocating for explicitly decoupled speech representations.
Primary: The University of New South Wales
All Institutions: The University of New South Wales, Nanyang Technological University
The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.
The paper presents a novel dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, addressing the methodological bottleneck of fixed-stride downsampling that corrupts phonetic boundaries. This approach is innovative as it allows for extreme compression while maintaining low WER, enabling a rigorous evaluation of the unified token hypothesis through the Dual-Probing Protocol. The methodology is well-structured, leveraging existing frameworks while introducing significant improvements in tokenization for speech synthesis.
The experiments are comprehensive, utilizing large-scale multilingual datasets and employing a dual-probing protocol to assess both discriminative understanding and generative viability. The results demonstrate that while the dynamic tokens achieve high performance in understanding tasks, they fail in generating intelligible speech, effectively illustrating the WER trap. The evaluation metrics, including CER and AVQA accuracy, are appropriate and provide a clear picture of the model's performance.
The paper provides detailed architectural specifications, hyperparameter configurations, and training methodologies, which enhance reproducibility. However, the absence of a public code repository limits the ease with which others can replicate the results. The thoroughness of the experimental setup and the clear delineation of methods contribute positively to reproducibility.
The study acknowledges its limitations, particularly that the generative probe employs a single synthesis paradigm, which may not generalize across different architectures. Additionally, the focus on Mandarin as the sole language for evaluation may restrict the applicability of findings to other languages with different phonetic structures. The paper also notes that while it identifies a critical flaw in the unified token approach, it does not propose a concrete solution for decoupled representations.
The findings have significant implications for the development of speech language models, challenging the prevailing assumption that a single token can suffice for both understanding and generation. This work advocates for a separation of semantic and acoustic representations, which could lead to more effective and intelligible speech synthesis systems. The insights gained from this research could influence future designs in multimodal AI systems, particularly in improving the quality of synthesized speech. The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.
Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.
Primary: Cochin University of Science and Technology (CUSAT)
All Institutions: Cochin University of Science and Technology (CUSAT)
The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.
The proposed CAFNet architecture is innovative in its approach to jointly address the challenges of ternary classification and temporal boundary localization for half-truth audio deepfake detection. The use of cross-attentive feature fusion and depthwise-separable convolutions enhances the model's ability to process multiple acoustic features effectively. The integration of BiLSTM for boundary prediction is a well-justified choice, given the temporal nature of the task. However, the paper could benefit from a more detailed discussion on the design choices for the architecture and the rationale behind the specific feature sets used.
The experiments are robust, utilizing a comprehensive dataset (MLADDC) that covers a diverse range of languages and audio conditions. The performance metrics reported, including accuracy, AUC, and MAE for boundary localization, are convincing and demonstrate the effectiveness of CAFNet compared to existing models. The cross-dataset generalization study adds significant value, revealing critical insights into the limitations of current training paradigms in deepfake detection.
The authors provide sufficient details regarding the implementation, including hyperparameters, training protocols, and the architecture of CAFNet. The availability of code and trained models on GitHub enhances reproducibility. However, the paper lacks detailed information on the specific datasets used for training and evaluation, which could hinder full reproducibility.
One notable limitation is the model's performance on the real class, where a significant number of half-truth samples are misclassified as real. This indicates that while the model excels in detecting fully fake and half-truth audio, it struggles with distinguishing genuine audio, which is crucial for practical applications. Additionally, the study highlights the challenge of catastrophic forgetting during domain adaptation, suggesting that the current approach may not be robust across different datasets.
The findings of this research have significant implications for audio forensics and the detection of manipulated media, especially in contexts where misinformation can have serious consequences. The ability to localize manipulations within audio clips enhances the forensic value of detection systems, making them more actionable for users. As deepfake technology continues to evolve, advancements in detection methods like CAFNet will be critical in maintaining trust in audio communications. The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.
Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMamba}_\textrm{P}$ framework, the proposed model incorporates FiLM layers to adapt the reconstruction process according to the respective bandwidth. Experiments on the MUSDB dataset show that FiPA-SR outperforms the state-of-the-art AudioSR model across 8, 20, and 32 kHz input sampling rates. Moreover, the proposed architecture uses approximately 3$\times$ less GPU memory and performs inference more than 60$\times$ faster than the diffusion-based baseline.
Primary: PEE/COPPE, UFRJ
All Institutions: PEE/COPPE, UFRJ, Carlos Chagas Filho Foundation for Research Support in the State of Rio de Janeiro, National Council for Scientific and Technological Development, CAPES
This paper presents FiPA-SR, a GAN-based model for audio bandwidth extension, demonstrating significant improvements in reconstruction quality and computational efficiency. The innovative use of FiLM layers to adaptively handle multiple bandwidths marks a notable advancement in the field of audio super-resolution.
The methodology is robust, leveraging a GAN-based architecture with FiLM layers to adaptively handle different bandwidths. The use of perceptual metrics and a well-defined training procedure enhances the model's ability to generalize across various input configurations. The innovative approach of combining upsampling with conditional modulation through FiLM layers is a significant advancement over previous models.
The experiments are thorough, utilizing the MUSDB dataset and comparing against state-of-the-art models. The use of objective metrics like Log-Spectral Distance and ViSQOL provides a solid foundation for evaluating performance. However, the paper could benefit from more qualitative assessments, such as user studies or listening tests, to complement the objective metrics.
The paper provides sufficient details regarding the architecture, training setup, and evaluation metrics, which should enable other researchers to replicate the results. However, the absence of a publicly available code repository limits accessibility.
The study is limited to specific bandwidth configurations and does not explore the model's performance across a broader range of frequencies. Additionally, while the results are promising, the reliance on objective metrics alone may not fully capture perceptual audio quality.
The proposed model has significant implications for audio processing applications, particularly in telecommunications and music production, where bandwidth limitations are prevalent. The ability to reconstruct high-frequency content efficiently could enhance audio quality in various consumer and professional settings. This paper presents FiPA-SR, a GAN-based model for audio bandwidth extension, demonstrating significant improvements in reconstruction quality and computational efficiency. The innovative use of FiLM layers to adaptively handle multiple bandwidths marks a notable advancement in the field of audio super-resolution.
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Xiaomi Inc.
Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.
The paper introduces a novel framework, Dasheng AudioGen, which effectively integrates multiple audio generation tasks into a single model using structured multi-view captions and a unified semantic-acoustic representation. This approach addresses the fragmentation in audio generation by allowing for coherent mixed-audio scene generation from text, which is a significant advancement in the field. The methodology is well-structured, leveraging a flow-matching DiT architecture and a unique conditioning framework that enhances control over audio components. The use of high-dimensional latent spaces for audio representation is particularly innovative, as it allows for better modeling of overlapping audio elements.
The experiments conducted are comprehensive, utilizing a large-scale dataset (ACAVCaps) and a robust evaluation pipeline that includes both objective and subjective metrics. The results demonstrate that Dasheng AudioGen outperforms existing specialized models in mixed-audio generation while maintaining competitive performance in single-type tasks. The introduction of the MECAT benchmark for mixed-audio evaluation is a valuable contribution, providing a new standard for assessing model performance in this area.
The paper mentions limitations in reproducibility due to reliance on a private dataset, which may hinder others from replicating the results. However, the detailed methodology and experimental setup provide a clear path for future researchers to build upon this work. The authors should consider releasing their dataset or providing a public version to enhance reproducibility.
Key limitations include the model's restriction to generating 10-second audio clips and the lack of advanced speaker control in TTS applications. Additionally, the performance in terms of speech intelligibility lags behind specialized TTS systems, indicating room for improvement. The reliance on a private dataset also poses challenges for reproducibility and broader accessibility.
The implications of this work are significant, as it paves the way for more integrated audio generation systems that can produce realistic and contextually coherent audio scenes. This could have applications in various fields, including film production, gaming, virtual reality, and assistive technologies. The ability to generate complex audio scenes from simple text prompts could also enhance user experiences in interactive media. Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.
Primary: Future Living Lab, Alibaba
All Institutions: Future Living Lab, Alibaba
The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.
The paper introduces a novel benchmark, VoiceGiraffe, designed specifically for evaluating long-context audio-language models (LALMs) in realistic scenarios. The methodology is robust, employing a dual-level taxonomy for question generation that captures both single-hop and multi-hop reasoning tasks. The data curation process is thorough, involving a multi-stage pipeline that includes voice activity detection, hierarchical captioning, and collaborative verification by human annotators. This rigorous approach ensures high-quality data for evaluation, addressing the limitations of existing benchmarks that rely on short clips or concatenated segments.
The experimental evaluation is comprehensive, benchmarking a wide range of LALMs against human performance across various tasks and inference paradigms. The results reveal significant challenges in long-context understanding, with only one proprietary model surpassing human performance. The findings highlight the limitations of current models in memory persistence and reasoning capabilities, providing valuable insights into areas for future research. The use of multiple inference settings (E2E, cascaded caption aggregation, and reasoning-enhanced cascading) allows for a nuanced understanding of model performance.
While the paper outlines a detailed methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the study or build upon the findings.
The primary limitations include the lack of a publicly available dataset or benchmark for other researchers to use, which could hinder wider adoption and validation of the proposed methods. Additionally, the paper acknowledges that even human annotators found the tasks challenging, indicating that the benchmark may be too difficult for current models. There is also a potential bias in language performance, as the models exhibited varying capabilities across English and Chinese inputs.
The introduction of VoiceGiraffe has the potential to significantly advance the field of audio-language understanding by providing a rigorous evaluation framework that addresses real-world challenges. This benchmark can guide future research towards developing models with improved long-context reasoning and memory capabilities, which are essential for applications in audio assistants, automated transcription, and multimedia content analysis. The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.
Primary: Zhejiang University
All Institutions: Zhejiang University, Bytedance
The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.
The paper introduces SwanBench-Speech, a comprehensive benchmark for evaluating long-form speech generation models. It effectively addresses the limitations of existing evaluation methods by proposing a multi-dimensional framework that includes seven disentangled metrics across three core challenges: acoustics, semantics, and expressiveness. The methodology is well-structured, with a clear focus on real-world applications and the incorporation of human-aligned metrics, which enhances the relevance of the evaluation. The use of diverse scenarios and a rigorous data collection process further strengthens the methodology.
The experiments are extensive, involving over 20 models evaluated across 1,101 samples in 17 scenarios. The results provide valuable insights into the performance gaps of current models compared to human recordings, particularly in expressiveness and consistency. The use of both objective metrics and human evaluations adds robustness to the findings. However, while the experiments are thorough, the paper could benefit from more detailed statistical analyses to quantify the significance of the results.
The paper provides a clear description of the data collection and evaluation processes, along with the metrics used. The open-sourcing of the benchmark and the availability of evaluation scripts enhance reproducibility. However, the reliance on specific models for evaluation may limit the generalizability of the findings to other systems.
The study acknowledges limitations, including a narrow linguistic scope (only Chinese and English) and a lack of robustness in assessing emotional and stylistic transitions. Additionally, the dataset's speaker diversity is limited, which may introduce bias in evaluations. Future work should address these gaps to enhance the benchmark's applicability.
This work has significant implications for the field of speech synthesis, particularly in enhancing the evaluation of long-form speech generation systems. By establishing a standardized benchmark, it paves the way for future research and development in this area, potentially leading to more immersive and expressive speech synthesis applications. The focus on real-world scenarios and human-aligned metrics also suggests potential applications in education, entertainment, and customer service. The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania, The Chinese University of Hong Kong
This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.
The proposed methodology introduces a Cross-view Alternate-attention Transformer (CVAT) that effectively captures both local intra-view and global cross-view relationships, addressing the challenges of few-shot Room Impulse Response (RIR) prediction. The integration of a geometry-informed modulation block enhances the model's ability to leverage geometric features, which is a significant advancement over existing methods. The auxiliary loss for multi-task learning further strengthens the model's performance by promoting generalizability across different architectures.
The experiments are robust, utilizing both simulated and real-world datasets, and demonstrate state-of-the-art performance across various metrics. The ablation studies provide clear evidence of the contributions of each component, validating the effectiveness of the proposed architecture. The quantitative results indicate substantial improvements over baseline methods, particularly in sparse reference scenarios.
The paper provides sufficient implementation details, including architecture specifications and training configurations, which should facilitate reproducibility. The availability of code and checkpoints on GitHub enhances this aspect, although specific hyperparameters and training procedures could be elaborated further for clarity.
While the model shows impressive performance, it may still be limited by the quality of the input data and the assumptions made regarding room geometry. The reliance on geometric features may not generalize well to all acoustic environments, particularly those with complex or unconventional geometries.
The advancements in few-shot learning for RIR prediction have significant implications for immersive audio applications in AR/VR and spatial audio rendering, potentially enhancing user experiences in virtual environments. The methodology could inspire further research into integrating geometric and acoustic modeling in other domains. This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, ModelBest Inc.
The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.
The paper introduces a novel low-dimensional audio tokenizer, LoSATok, which effectively compresses high-dimensional semantic representations while maintaining semantic richness and acoustic details. The methodology includes the Semantic Bottleneck (SemBo) for dimensionality reduction, and a dual-level semantic supervision strategy that enhances the learning process. The proposed time-relation loss is a significant innovation that ensures temporal consistency in the representations. Overall, the methodology is well-structured and addresses a critical gap in current audio modeling approaches.
The experiments are comprehensive, covering various audio tasks across speech, music, and general audio domains. The results demonstrate that LoSATok achieves competitive performance in understanding tasks and outperforms existing models in generation tasks, particularly in terms of efficiency and quality. The use of objective metrics (e.g., FAD, CLAP) alongside subjective evaluations strengthens the findings. However, the paper could benefit from more extensive comparisons with state-of-the-art methods in a broader range of tasks.
The paper provides a GitHub repository with the code, which is essential for reproducibility. However, specific implementation details, such as hyperparameter choices and training setups, could be more clearly outlined to facilitate replication by other researchers.
The authors acknowledge that LoSATok sacrifices some reconstruction fidelity for improved semantic organization and generative performance. Additionally, while it shows promise in understanding tasks, it does not fully reach the performance of high-dimensional semantic representations. Future work is needed to optimize the balance between semantics, acoustics, and generation.
The proposed tokenizer has significant implications for audio understanding and generation, potentially enhancing applications in speech recognition, music generation, and audio synthesis. By enabling more efficient models, it could lead to advancements in real-time audio processing and interactive applications. The research also opens avenues for further exploration of low-dimensional representations in multimodal contexts. The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.
Primary: Renmin University of China
All Institutions: Renmin University of China
The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.
The proposed methodology, PlanAudio, introduces a novel framework for generating unified audio from free-form text prompts, leveraging an autoregressive LLM architecture and a semantic latent Chain-of-Thought (CoT) mechanism. This approach is innovative as it avoids traditional text encoders and explicit text rewriting, which are common in existing models. The integration of semantic planning in the latent space before audio synthesis is a significant advancement, allowing for better alignment between high-level semantics and low-level audio generation. The methodology is well-structured, with clear phases for semantic planning and acoustic generation, which enhances the model's ability to produce coherent audio outputs.
The experiments are comprehensive, evaluating PlanAudio across multiple scenarios (sound, speech, and composite) using both objective metrics (FAD, KL divergence, WER) and subjective assessments (human ratings on acoustic quality, temporal correctness, etc.). The results demonstrate that PlanAudio outperforms existing pipeline and unified models, showcasing its versatility and effectiveness. The creation of PlanAudio-Bench as a specialized benchmark for composite audio scenarios adds value to the evaluation process, providing a structured way to assess the model's performance in real-world applications.
The paper provides detailed implementation details, including the datasets used, training procedures, and evaluation metrics. However, the lack of a publicly available demo or project URL limits the reproducibility of the results. While the methodology is clearly described, access to the code and trained models would enhance the ability of other researchers to replicate the findings.
One limitation is the potential for the model to struggle with highly complex prompts that require intricate audio interactions, as indicated by the slight performance drop in speech generation compared to specialized models. Additionally, the reliance on the quality of the training data and the inherent challenges in synthesizing audio from free-form text prompts may introduce variability in performance across different contexts.
The implications of this research are significant for various applications, including content creation, game development, and assistive technologies for individuals with speech impairments. By enabling the generation of coherent audio from natural language prompts, this work could facilitate new forms of human-computer interaction and enhance multimedia experiences. The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.
Primary: Daydream
All Institutions: Daydream
The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.
The methodology presented in the paper is innovative, leveraging a real-time diffusion engine that transforms the denoising process into a playable musical instrument. The authors introduce several mechanisms that enhance the responsiveness and control of audio generation, including per-slot heterogeneous denoise scheduling, shared mutable per-step state, per-frame source blending, and a windowed VAE decode. These contributions are well-structured and address significant challenges in real-time audio generation, particularly in maintaining high throughput while allowing for fine-grained control over audio parameters.
The experimental evaluation is thorough, with a focus on latency, output quality, and responsiveness of parameter changes. The authors provide empirical results that substantiate their claims regarding the effectiveness of their proposed mechanisms, including quantitative comparisons with existing systems. The use of various audio sources and the detailed reporting of metrics such as CLAP and SNR demonstrate a rigorous approach to validating the system's performance.
The paper includes sufficient detail regarding the architecture and implementation of the DEMON system, including the use of TensorRT for acceleration and the specific configurations used for experiments. However, the absence of a detailed description of the datasets and the evaluation metrics used may pose challenges for complete reproducibility. The provided URLs for the project and demo enhance accessibility to the code and results.
One limitation of the paper is the reliance on a specific hardware setup (NVIDIA RTX 5090) for performance metrics, which may not generalize across different systems. Additionally, while the authors address the latency of their system, the practical implications of the onset latency in live performance contexts could be further explored. The paper does not discuss potential limitations in the quality of audio generated under varying conditions or the scalability of the system.
The work has significant implications for the fields of music generation and real-time audio processing, particularly for live performances. By enabling musicians to manipulate denoising parameters in real-time, DEMON opens up new avenues for creative expression and interaction with AI-generated music. The integration of machine learning into musical instruments could lead to innovative performance practices and new genres of music. The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.
Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.
The proposed Audio-Mind framework introduces a novel approach to audio understanding by integrating a strong frontend with planner-guided tool use. This method allows for dynamic evidence acquisition, which is a significant improvement over existing audio-agent baselines. The framework's ability to preserve the frontend's judgment while addressing evidence gaps is a noteworthy contribution to the field, as it enhances the overall reasoning process in audio question answering.
The experiments conducted on MMAR and MSU-Bench demonstrate the effectiveness of Audio-Mind, achieving impressive accuracy scores of 80.4% and 82.8%, respectively. The matched-backbone comparison further validates the framework's design by highlighting the orchestration bottleneck in agentic decomposition under strong audio frontends. However, the paper lacks detailed descriptions of the datasets and evaluation metrics used, which could enhance the transparency and reproducibility of the results.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Without access to the framework or clear guidelines on how to replicate the experiments, it is challenging for other researchers to validate the findings.
One limitation is the potential complexity introduced by the planner-guided tool use, which may not generalize well to all audio understanding tasks. Additionally, the framework's reliance on strong frontends could limit its applicability in scenarios where such models are not available.
The Audio-Mind framework has the potential to significantly impact the field of audio understanding and question answering by providing a more reliable and auditable reasoning process. Its contributions could lead to advancements in audio-QA annotation and error analysis, making it a valuable tool for researchers and practitioners in the domain. The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.
Primary: Beijing Technology and Business University
All Institutions: University of Sydney, Beijing Technology and Business University, Xidian University, Tongji University
The paper presents EventSpeech, a pioneering framework that utilizes neuromorphic events for expressive speech generation, significantly advancing the state of the art in multimodal speech synthesis. The innovative approach and robust experimental validation position this work as a substantial contribution to the field, addressing key limitations of existing methods and opening new avenues for research and application.
The proposed EventSpeech framework introduces a novel architecture that leverages neuromorphic events for speech generation, addressing the limitations of traditional RGB-based methods. The integration of a dedicated Event Encoder and a multi-scale Audio Encoder, along with a bidirectional alignment mechanism, demonstrates a sophisticated approach to synchronizing visual and auditory modalities. The methodology is well-structured, with a clear focus on addressing the Temporal Granularity Mismatch, and the use of specialized components like the Hierarchical Wavelet Contextualizer (HWC) enhances the model's ability to capture fine-grained emotional nuances in speech.
The paper presents extensive evaluations on the EVT-SPK benchmark, which is a significant contribution to the field as it includes both synthetic and real-world datasets. The results indicate that EventSpeech outperforms state-of-the-art methods across various metrics, showcasing its robustness in handling rapid articulation and subtle facial dynamics. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings.
The paper provides implementation details, including the training setup and optimization strategies, which are crucial for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
The EVT-SPK benchmark's limited scale and the reliance on simulated events may restrict the model's generalization capabilities. Additionally, the paper acknowledges the challenges associated with capturing complex physical sensor noise in real-world scenarios, which could affect performance.
The introduction of neuromorphic events for speech generation has the potential to revolutionize multimodal speech synthesis, enabling more expressive and natural-sounding speech. This could have applications in various domains, including virtual assistants, entertainment, and accessibility technologies. The paper presents EventSpeech, a pioneering framework that utilizes neuromorphic events for expressive speech generation, significantly advancing the state of the art in multimodal speech synthesis. The innovative approach and robust experimental validation position this work as a substantial contribution to the field, addressing key limitations of existing methods and opening new avenues for research and application.
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.
Primary: University of Southern California
All Institutions: University of Southern California
The main contribution of this paper is the introduction of VoxParadox, a benchmark that effectively isolates and evaluates the paralinguistic understanding of Audio LLMs, alongside innovative methods to enhance model performance in this domain. The work is significant as it addresses a critical gap in the capabilities of current Audio LLMs and proposes actionable solutions that could lead to more robust multimodal systems.
The paper introduces VoxParadox, a novel adversarial benchmark designed to evaluate the paralinguistic understanding of Audio LLMs by creating controlled linguistic-acoustic contradictions. The methodology is robust, employing a systematic approach to generate adversarial examples and utilizing layer-wise probing to diagnose model limitations. The proposed Prompt-Conditioned Layer Mixer (PCLM) is a significant innovation that adaptively combines information from multiple audio layers based on the input prompt, addressing identified bottlenecks in model performance.
The experiments are comprehensive, evaluating a diverse set of Audio LLMs against the VoxParadox benchmark. The results demonstrate a clear performance gap in paralinguistic tasks, with models showing a tendency to rely on transcript-implied answers rather than acoustic evidence. The paper provides detailed metrics, including ground truth accuracy and adversarial-label agreement, which effectively illustrate the models' weaknesses and the improvements achieved through the proposed methods.
The paper includes sufficient detail regarding the experimental setup, data generation pipeline, and evaluation metrics, which supports reproducibility. However, the implementation specifics of the PCLM and DPO methods could benefit from additional clarity to ensure that other researchers can replicate the results accurately.
The authors acknowledge that PCLM is a post-hoc solution and that the degradation of paralinguistic information in deeper layers and at the encoder-LLM interface presents inherent limitations. Additionally, while VoxParadox serves as a controlled stress test, it may not fully capture the complexities of naturalistic speech scenarios. The reliance on TTS-generated audio also raises questions about the generalizability of the findings.
The research has significant implications for improving speech-based interfaces and accessibility technologies, enhancing the ability of Audio LLMs to interpret non-verbal cues accurately. However, the potential for misuse in profiling and surveillance contexts necessitates careful consideration of ethical implications and the establishment of safeguards in deployment. The main contribution of this paper is the introduction of VoxParadox, a benchmark that effectively isolates and evaluates the paralinguistic understanding of Audio LLMs, alongside innovative methods to enhance model performance in this domain. The work is significant as it addresses a critical gap in the capabilities of current Audio LLMs and proposes actionable solutions that could lead to more robust multimodal systems.
High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neural speech codec that operates entirely in the modified discrete cosine transform (MDCT) domain. CFMDCTCodec integrates a lightweight encoder-quantizer-decoder-style MDCT-spectral codec with a noise-prior-aware, conditional-flow-matching (CFM)-based MDCT-spectral enhancer. Within this framework, the codec serves as a base module that compactly discretizes the MDCT spectrum extracted from speech and produces an initial coarse reconstruction, while the enhancer further restores fine-grained spectral details. The enhancer improves the decoded MDCT spectrum by integrating a conditional MDCT velocity-field filter with an ordinary differential equation (ODE) solver, under the guidance of an MDCT-derived magnitude-adaptive noise prior, aiming to emphasize perceptually significant high-energy regions while stabilizing low-energy and silent regions. Finally, the enhanced MDCT spectrum is reconstructed into the decoded speech using the inverse MDCT. When optimizing CFMDCTCodec, we adopt a unified non-adversarial training strategy that jointly combines reconstruction, quantization and CFM objectives. Both objective and subjective evaluations show that CFMDCTCodec outperforms competitive baselines in low-bitrate regimes, e.g., 0.65 kbps, while approaching the perceptual quality of large-scale codecs with significantly fewer parameters and computations.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing, Tsinghua University
The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.
The proposed CFMDCTCodec introduces a novel architecture for low-bitrate speech coding that operates entirely in the MDCT domain, integrating a single-codebook quantization strategy with a noise-prior-aware conditional flow matching (CFM) enhancement mechanism. This approach effectively addresses the limitations of existing codecs by enhancing the spectral quality of decoded speech without increasing bitrate, utilizing a joint training strategy that simplifies the learning process. The methodology is well-structured, with clear descriptions of the encoder, decoder, and enhancer components, and the use of ordinary differential equations (ODE) for state evolution is particularly innovative.
The experimental setup is robust, utilizing two different speech corpora and multiple bitrate settings to evaluate the codec's performance. The paper provides both objective and subjective evaluation metrics, including MUSHRA tests and various objective measures (STOI, SI-SDR, etc.), which demonstrate the codec's superiority over competitive baselines. The results indicate significant improvements in speech quality at low bitrates, validating the effectiveness of the proposed enhancements.
The paper includes detailed descriptions of the experimental setup, including hyperparameters, training configurations, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease of replication for other researchers.
One limitation is the reliance on a single-codebook quantization strategy, which may not capture the full diversity of speech signals as effectively as multi-codebook approaches. Additionally, while the results are promising, further testing across a wider range of speech datasets and real-world scenarios would strengthen the findings.
The CFMDCTCodec has significant potential applications in bandwidth-constrained environments such as satellite communications, teleconferencing, and mobile applications, where high-quality speech transmission is critical. Its lightweight design and efficient processing could facilitate broader adoption in various speech processing applications, contributing to advancements in telecommunications and accessibility technologies. The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.
We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.
Primary: CUHK MMLab
All Institutions: CUHK MMLab, SJTU, NTU, McMaster, CityUHK, JUFE
The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.
The methodology introduces a novel interaction slot formulation that captures real-time, multimodal interactions in a continuous audio-visual stream. This approach is innovative as it shifts the evaluation paradigm from static question-answer pairs to dynamic, temporally grounded interactions, allowing for a more realistic assessment of model capabilities in real-time settings. The proposed metrics (IA-QTF1, IDS, NCCS) are well-defined and tailored to the unique challenges of streaming interactions, effectively measuring not just correctness but also timing and context management.
The experiments are comprehensive, evaluating multiple state-of-the-art omnimodal models under the new benchmark. The results reveal significant gaps in current models' abilities to handle real-time interactions, particularly in continuous task monitoring and nested query scenarios. The use of a diverse dataset of 250 videos with 1,430 response slots provides a solid foundation for the evaluations, although the performance scores indicate that there is considerable room for improvement in the models tested.
The paper mentions that the code and datasets will be made publicly accessible, which is crucial for reproducibility. However, details on the exact implementation of the models tested and the specific evaluation protocols could be elaborated upon to enhance reproducibility further.
The paper acknowledges limitations such as the narrow focus on specific interaction types and the reliance on synthesized speech for the 1QnA split. Additionally, the benchmark currently covers only Chinese and English scenarios, which may limit its applicability across different languages and cultures. The analysis is also limited to a small number of models, which may not represent the full landscape of omnimodal systems.
The introduction of OmniInteract has the potential to significantly advance the field of real-time human-AI interaction by providing a standardized benchmark for evaluating omnimodal models. This can lead to improved AI assistants that are more capable of understanding and responding to user queries in real-time, enhancing applications in accessibility, education, and everyday tasks. The focus on real-time interaction also raises important considerations regarding privacy and the ethical deployment of always-on systems. The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.
Primary: Amap, Alibaba Group
All Institutions: Amap, Alibaba Group, The Chinese University of Hong Kong, Shenzhen
The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.
The methodology is robust, featuring a well-structured multi-stage data processing pipeline that enhances data quality and a compact autoregressive architecture that effectively decouples speaker identity from style. The use of Q-Former-based conditioning and cross-sample paired training is innovative and addresses common challenges in TTS systems.
The experiments are comprehensive, utilizing the Seed-TTS Eval benchmark to demonstrate superior performance in terms of WER, CER, and speaker similarity. The inclusion of human evaluations for emotion control and paralinguistic synthesis adds depth to the assessment of the system's capabilities.
The paper emphasizes reproducibility by providing a complete data processing pipeline built from publicly available tools, along with pretrained weights and code. This transparency enhances the likelihood of other researchers replicating the results.
The paper acknowledges limitations such as insufficient explicit style modeling and the constraints of single-codebook quantization, which may hinder performance in more complex scenarios. Additionally, the reliance on mel-spectrograms could introduce reconstruction artifacts.
The potential applications of PilotTTS are significant, particularly for resource-constrained teams seeking to develop competitive TTS systems. Its modular approach and open-source nature could democratize access to high-quality speech synthesis technology. The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.
Primary: The University of Melbourne
All Institutions: The University of Melbourne, The University of Auckland, UNSW Sydney, KAIST
The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.
The methodology is robust, introducing the EnvMem framework to systematically analyze the retention of acoustic information in multi-turn interactions. The authors employ a combination of controlled experiments, linear probing, and attention analysis to dissect the representation and retrieval mechanisms in LALMs. The use of synthetic dialogues and a clear structure for the evaluation tasks enhances the clarity of the experimental design. However, the reliance on synthetic data may limit the generalizability of the findings to real-world scenarios.
The experiments are comprehensive, evaluating multiple LALMs across various context lengths. The results demonstrate a clear performance gap between semantic and acoustic memory, with detailed analyses of representational drift and attention allocation. The use of metrics like accuracy and relative degradation provides a solid basis for comparison, although the paper could benefit from additional qualitative assessments of model outputs.
The paper provides detailed descriptions of the experimental setup, including dataset construction and evaluation protocols. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing the EnvMem benchmark and associated models to facilitate further research in this area.
The primary limitation is the use of synthetic data, which may not capture the complexities of natural conversations. Additionally, the interventions are post-hoc and may not translate to practical solutions for improving acoustic memory in deployed models. The study also acknowledges potential ethical concerns regarding privacy and surveillance in real-world applications.
This research has significant implications for the development of more robust audio language models, particularly in applications requiring persistent awareness of environmental sounds. By highlighting the representational bottlenecks in LALMs, the findings can guide future training strategies and benchmark designs, ultimately improving the integration of acoustic memory in multimodal systems. The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.
Primary: University
All Institutions: Company, Department of Computer Science, International Laboratories, University
The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.
The methodology presented in MERIT is innovative, focusing on disentangled representations of music based on melody, rhythm, and timbre. The use of a frozen MERT backbone combined with a novel triplet construction strategy allows for effective training on isolated musical dimensions without manual labeling. The approach of leveraging generative models for creating training data is particularly noteworthy, as it addresses the challenge of entangled real-world audio data. The Circle Loss optimization technique further enhances the training process by focusing on hard negatives, which is a sound choice for improving representation quality.
The experiments are well-structured, utilizing both internal and external evaluations to assess the model's performance. The use of zero-shot probes on independent datasets demonstrates the generalizability of the learned representations. The results indicate strong factor-wise disentanglement, with high accuracy in distinguishing between the different musical dimensions. The human evaluation of triplet quality adds a valuable subjective perspective to the findings, reinforcing the model's effectiveness. Overall, the experimental design is robust and provides compelling evidence of the framework's capabilities.
The paper provides sufficient details regarding the architecture, training procedures, and datasets used, which supports reproducibility. The authors have made the code and pre-trained models publicly available, further facilitating replication of their results. However, the reliance on specific datasets like MoisesDB and the generative model JASCO may limit reproducibility if these resources are not accessible to all researchers.
Some limitations are acknowledged, such as the focus on only three musical dimensions (melody, rhythm, and timbre), which may overlook other important aspects like harmony and dynamics. Additionally, the operationalization of timbre at the instrument-class level may not capture within-class variations adequately. The authors also mention potential biases from the training data that could affect the model's performance in real-world scenarios.
The implications of MERIT are significant for music information retrieval, recommendation systems, and music analysis tools. By enabling users to query music based on specific dimensions, it enhances user control and interpretability, which can lead to more personalized music experiences. The framework could also inspire further research into disentangled representations in other domains, potentially influencing broader applications in audio processing and machine learning. The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.