We present UNISON, a latent diffusion framework that unifies speech generation, sound generation, and audio editing within a single model. A single model handles text-to-audio, text-to-speech, zero-shot speaker cloning, mixed speech-and-sound generation, scene-level audio editing, speech-in-scene editing, and timed temporal composition, all of which share a single set of weights. Our architecture features two core designs: (1) Layer-wise deep LLM fusion, which injects hidden states from uniformly sampled layers of a frozen MLLM into corresponding MM-DiT blocks via learned projections, providing depth-matched semantic conditioning that improves instruction following over single-layer baselines; and (2) a unified multi-task architecture where task identity is encoded solely by a channel-wise mask and source audio is provided through VAE-encoded channel concatenation. Training is stabilized by an online GPU-side multi-task data synthesis pipeline with task-homogeneous batching and a two-stage curriculum. With 621M--732M trainable parameters, UNISON achieves results competitive with or exceeding task-specialist models across evaluated domains, while being roughly $4\times$ smaller than comparable unified systems.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, The Hong Kong Polytechnic University, City University of Hong Kong, The Hong Kong University of Science and Technology, Tsinghua University, Huawei Research Hong Kong
UNISON presents a unified framework for sound generation and editing through deep LLM fusion, significantly advancing the state of multimodal audio processing. The methodology effectively combines diverse tasks into a single model while demonstrating competitive performance against specialized systems, showcasing the potential for more efficient and scalable audio generation solutions.
The proposed methodology in UNISON is innovative, featuring a unified architecture that integrates multiple audio generation and editing tasks into a single framework. The use of layer-wise deep LLM fusion for semantic conditioning is a significant advancement over existing models that typically rely on single-layer conditioning. The architecture's ability to handle diverse tasks with a shared latent space and a single set of weights demonstrates a thoughtful approach to reducing complexity and enhancing cross-task knowledge transfer. The online multi-task data synthesis pipeline and curriculum training further contribute to the robustness of the training process, ensuring stability and effectiveness in learning.
The experimental evaluation is comprehensive, covering a wide range of benchmarks across text-to-audio, text-to-speech, zero-shot cloning, and audio editing tasks. The results show that UNISON performs competitively against task-specialist models, achieving superior performance in several metrics such as FAD, CLAP, and WER. The ablation studies provide valuable insights into the importance of the proposed architectural choices, confirming the effectiveness of the layer-wise deep LLM fusion and the necessity of a multi-task training approach. The use of both objective and subjective metrics enhances the credibility of the findings.
The paper provides detailed implementation details, including model configurations, training data composition, and hyperparameters, which are essential for reproducibility. However, the absence of a publicly available code repository limits the ability for others to replicate the results fully. The authors could enhance reproducibility by releasing their code and trained models.
The paper acknowledges limitations related to the VAE reconstruction quality, particularly for speech, which may affect the overall output quality. Additionally, the synthetic training data for editing tasks may not fully capture the complexities of real-world audio scenes, potentially impacting the model's performance in practical applications. The current model's language support is limited to English and Chinese, which may restrict its applicability in multilingual contexts.
UNISON has the potential to significantly impact the fields of audio generation and editing by providing a unified framework that simplifies the deployment of audio systems. Its ability to handle multiple tasks with a single model could lead to advancements in applications such as virtual assistants, content creation, and audio post-production. The integration of LLMs into audio processing also opens avenues for more intelligent and context-aware audio systems. UNISON presents a unified framework for sound generation and editing through deep LLM fusion, significantly advancing the state of multimodal audio processing. The methodology effectively combines diverse tasks into a single model while demonstrating competitive performance against specialized systems, showcasing the potential for more efficient and scalable audio generation solutions.
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.
Primary: Beijing Technology and Business University
All Institutions: University of Sydney, Beijing Technology and Business University, Xidian University, Tongji University
The paper presents EventSpeech, a pioneering framework that utilizes neuromorphic events for expressive speech generation, significantly advancing the state of the art in multimodal speech synthesis. The innovative approach and robust experimental validation position this work as a substantial contribution to the field, addressing key limitations of existing methods and opening new avenues for research and application.
The proposed EventSpeech framework introduces a novel architecture that leverages neuromorphic events for speech generation, addressing the limitations of traditional RGB-based methods. The integration of a dedicated Event Encoder and a multi-scale Audio Encoder, along with a bidirectional alignment mechanism, demonstrates a sophisticated approach to synchronizing visual and auditory modalities. The methodology is well-structured, with a clear focus on addressing the Temporal Granularity Mismatch, and the use of specialized components like the Hierarchical Wavelet Contextualizer (HWC) enhances the model's ability to capture fine-grained emotional nuances in speech.
The paper presents extensive evaluations on the EVT-SPK benchmark, which is a significant contribution to the field as it includes both synthetic and real-world datasets. The results indicate that EventSpeech outperforms state-of-the-art methods across various metrics, showcasing its robustness in handling rapid articulation and subtle facial dynamics. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings.
The paper provides implementation details, including the training setup and optimization strategies, which are crucial for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
The EVT-SPK benchmark's limited scale and the reliance on simulated events may restrict the model's generalization capabilities. Additionally, the paper acknowledges the challenges associated with capturing complex physical sensor noise in real-world scenarios, which could affect performance.
The introduction of neuromorphic events for speech generation has the potential to revolutionize multimodal speech synthesis, enabling more expressive and natural-sounding speech. This could have applications in various domains, including virtual assistants, entertainment, and accessibility technologies. The paper presents EventSpeech, a pioneering framework that utilizes neuromorphic events for expressive speech generation, significantly advancing the state of the art in multimodal speech synthesis. The innovative approach and robust experimental validation position this work as a substantial contribution to the field, addressing key limitations of existing methods and opening new avenues for research and application.
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.
Primary: University of Southern California
All Institutions: University of Southern California
The main contribution of this paper is the introduction of VoxParadox, a benchmark that effectively isolates and evaluates the paralinguistic understanding of Audio LLMs, alongside innovative methods to enhance model performance in this domain. The work is significant as it addresses a critical gap in the capabilities of current Audio LLMs and proposes actionable solutions that could lead to more robust multimodal systems.
The paper introduces VoxParadox, a novel adversarial benchmark designed to evaluate the paralinguistic understanding of Audio LLMs by creating controlled linguistic-acoustic contradictions. The methodology is robust, employing a systematic approach to generate adversarial examples and utilizing layer-wise probing to diagnose model limitations. The proposed Prompt-Conditioned Layer Mixer (PCLM) is a significant innovation that adaptively combines information from multiple audio layers based on the input prompt, addressing identified bottlenecks in model performance.
The experiments are comprehensive, evaluating a diverse set of Audio LLMs against the VoxParadox benchmark. The results demonstrate a clear performance gap in paralinguistic tasks, with models showing a tendency to rely on transcript-implied answers rather than acoustic evidence. The paper provides detailed metrics, including ground truth accuracy and adversarial-label agreement, which effectively illustrate the models' weaknesses and the improvements achieved through the proposed methods.
The paper includes sufficient detail regarding the experimental setup, data generation pipeline, and evaluation metrics, which supports reproducibility. However, the implementation specifics of the PCLM and DPO methods could benefit from additional clarity to ensure that other researchers can replicate the results accurately.
The authors acknowledge that PCLM is a post-hoc solution and that the degradation of paralinguistic information in deeper layers and at the encoder-LLM interface presents inherent limitations. Additionally, while VoxParadox serves as a controlled stress test, it may not fully capture the complexities of naturalistic speech scenarios. The reliance on TTS-generated audio also raises questions about the generalizability of the findings.
The research has significant implications for improving speech-based interfaces and accessibility technologies, enhancing the ability of Audio LLMs to interpret non-verbal cues accurately. However, the potential for misuse in profiling and surveillance contexts necessitates careful consideration of ethical implications and the establishment of safeguards in deployment. The main contribution of this paper is the introduction of VoxParadox, a benchmark that effectively isolates and evaluates the paralinguistic understanding of Audio LLMs, alongside innovative methods to enhance model performance in this domain. The work is significant as it addresses a critical gap in the capabilities of current Audio LLMs and proposes actionable solutions that could lead to more robust multimodal systems.
Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.
Primary: Columbia University
All Institutions: Columbia University
The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.
The methodology presented in this paper is robust and innovative, combining implicit affect inference from user speech with explicit valence-arousal (VA) conditioning. The authors construct a novel dataset (Sympatheia-18k) that allows for the training of a speech-to-speech dialogue system capable of generating emotionally appropriate responses. The use of continuous VA coordinates as a conditioning mechanism is a significant advancement over traditional discrete emotion categories, allowing for more nuanced emotional responses. The integration of multimodal emotion sensing modules adds further depth to the system, making it adaptable to various input types. The architecture follows a well-established speech-language model (GLM-4-Voice) but enhances it with emotional conditioning, which is a thoughtful approach to improving empathetic dialogue systems.
The experimental evaluation is comprehensive, utilizing both automated and human assessments to evaluate the empathetic response quality of the Sympatheia system. The authors employ a variety of metrics, including empathy scores from an audio-capable LLM and a human Emotion Mean Opinion Score (MOS) study, which provides a well-rounded view of the model's performance. The results indicate that Sympatheia significantly outperforms baseline models in generating emotionally appropriate responses, validating the effectiveness of the proposed methods. The use of both emotional and neutral splits in the dataset allows for a thorough examination of the model's capabilities across different emotional contexts.
The paper provides detailed implementation details, including training configurations and dataset generation processes, which enhance reproducibility. The availability of the project code and dataset on GitHub and Hugging Face respectively further supports the ability of other researchers to replicate the study. However, the reliance on synthetic data for training may introduce variability that could affect reproducibility in real-world applications.
The paper acknowledges several limitations, including the synthetic nature of the training data, which may not fully capture the complexity of real-world conversations. Additionally, the fixed VA anchors used for emotional conditioning may not universally apply across different cultures or individual expressions of emotion. The authors also note that the current evaluation primarily relies on automated assessments, which may miss nuanced failures in empathy and appropriateness.
The potential applications of Sympatheia are significant, particularly in assistive technologies, education, and mental health support, where emotionally aware interactions can enhance user experience. However, the deployment of such systems raises ethical considerations regarding privacy and the potential for misuse in manipulative contexts. The authors emphasize the need for safeguards and responsible deployment practices to mitigate these risks. The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.
We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.
Primary: Norwegian University of Science and Technology
All Institutions: Norwegian University of Science and Technology, Tsinghua University
This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.
The paper introduces a novel Lagrangian sub-flow (LSF) framework for out-of-distribution (OOD) detection using continuous normalizing flows (CNFs). The methodology is well-grounded in fluid dynamics principles, allowing for localized analysis of high-dimensional data while maintaining global context. The approach effectively addresses the "likelihood paradox" by isolating relevant components in the data representation, which is a significant advancement in the field of generative models. The proposed geometric diagnostic signals and metrics for phoneme-level mispronunciation detection are innovative and provide a fresh perspective on OOD detection.
The experiments are robust, utilizing a real-world dataset (CMU Kids) for zero-shot phoneme-level mispronunciation detection. The results demonstrate the superiority of the proposed metrics over traditional likelihood-based methods, highlighting the effectiveness of the LSF framework. The evaluation metrics, including ROC-AUC, are appropriate for the task, although further validation across diverse datasets would strengthen the findings.
The paper provides sufficient details on the experimental setup, including model training and evaluation processes. However, the lack of publicly available code or a demo limits reproducibility. Clear descriptions of the methods and metrics used contribute positively, but access to implementation details would enhance reproducibility.
The study is primarily focused on a specific application in speech synthesis, which may limit the generalizability of the findings. The authors acknowledge the need for further validation across other domains, indicating that the framework's applicability is yet to be fully explored. Additionally, the complexity of the proposed methods may pose challenges for practical implementation in real-time systems.
The proposed framework has the potential to significantly improve OOD detection in various applications beyond speech synthesis, such as computer vision and medical imaging. By enhancing the ability to detect mispronunciations and other anomalies, this work could lead to advancements in automated speech recognition and generative modeling, ultimately benefiting user experience and system reliability. This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.
Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.
The methodology employed in the development of QuAP is robust, integrating a hybrid retrieval system with procedural audio synthesis and an intelligent parameter assistant. The use of MobileNet for audio embeddings and the feature-driven bottleneck framework for optimizing synthesis parameters demonstrates a thoughtful approach to addressing the challenges in sound design workflows. However, the paper could benefit from a more detailed description of the implementation specifics and the exact parameters used in the optimization process.
The experimental evaluation is well-structured, utilizing a MUSHRA subjective evaluation to assess the quality of the synthesized audio and an ablation study to compare encoder architectures. The results indicate statistically significant improvements in sound quality for most models, which supports the effectiveness of the proposed system. However, the relatively small sample size in the user evaluation (16 participants) may limit the generalizability of the findings.
While the paper provides a project URL and mentions the use of established datasets and frameworks, it lacks detailed implementation instructions or code availability, which could hinder reproducibility. More explicit documentation on the setup and execution of experiments would enhance this aspect.
The study acknowledges limitations, particularly in the synthesis quality of certain models (e.g., Rocket and Jet) and the narrow scope of sound categories supported by QuAP. The reliance on subjective evaluations may also introduce biases, and the tool's performance in real-world scenarios remains to be fully validated.
QuAP has the potential to significantly impact sound design practices by streamlining workflows and enhancing creative exploration. By unifying retrieval and synthesis, it could facilitate more efficient sound design processes across various industries, including film, gaming, and music production. The focus on maintaining creative agency while providing intelligent assistance is particularly relevant in the context of increasing automation in creative fields. The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.
Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.
Primary: Athens University of Economics and Business
All Institutions: Athens University of Economics and Business, Orfium, Hellenic Mediterranean University, National Center for Scientific Research “Demokritos”
The main contribution of this paper is the introduction of a framework for deterministic attribute modulation in symbolic music generation through activation steering, which enhances interpretability and control without the need for retraining. This work is significant as it bridges the gap between complex generative models and user-driven control, paving the way for more interactive and user-friendly music generation systems.
The paper presents a novel approach to activation steering in the Multitrack Music Transformer (MMT) by utilizing the Difference-in-Means (DiffMean) methodology to isolate latent directions for musical attributes. The introduction of a Dual Steering framework using Gram-Schmidt Orthogonalization is a significant advancement in addressing feature entanglement, allowing for independent control of attributes like Pitch and Duration. The methodology is well-structured, leveraging existing theories in mechanistic interpretability while innovatively applying them to symbolic music generation.
The experimental setup is robust, with clear definitions of the steering vectors and comprehensive evaluations across both unconditional and conditional generation paradigms. The use of statistical measures such as Pearson correlation coefficients and R² values provides a solid quantitative basis for the effectiveness of the steering methods. The results demonstrate a high degree of success in achieving the intended attribute shifts, with detailed analysis of steering dynamics across various layers of the transformer architecture.
The paper includes sufficient detail regarding the model architecture, data representation, and experimental procedures, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the experiments. The URL provided for audio examples is a positive aspect, but a more comprehensive project URL would bolster reproducibility further.
One limitation is the reliance on a single dataset (SOD), which may affect the generalizability of the findings. Additionally, while the paper addresses conceptual interference, the methods for dual steering may still encounter challenges in more complex musical contexts or with additional attributes. The paper could also benefit from a discussion on the computational efficiency of the proposed methods in real-time applications.
This research has the potential to significantly impact the field of music generation and AI-driven creative tools, providing musicians and composers with more precise control over generated outputs. The findings could be applied in various applications, including algorithmic composition, interactive music systems, and educational tools for music theory. The focus on mechanistic interpretability also contributes to the broader discourse on transparency and explainability in AI systems. The main contribution of this paper is the introduction of a framework for deterministic attribute modulation in symbolic music generation through activation steering, which enhances interpretability and control without the need for retraining. This work is significant as it bridges the gap between complex generative models and user-driven control, paving the way for more interactive and user-friendly music generation systems.
Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.
Primary: Fudan University
All Institutions: Fudan University
The MindVoice framework represents a significant advancement in reconstructing intelligible speech from non-invasive neural signals, utilizing a novel dual-stream architecture that effectively leverages pretrained models to address the challenges posed by noisy and incomplete neural recordings. This work has the potential to impact both the fields of auditory neuroscience and speech technology significantly.
The proposed MindVoice framework introduces a dual-stream architecture that separates semantic and acoustic reconstruction, leveraging pretrained models to enhance the intelligibility of reconstructed speech from non-invasive neural signals. This approach is innovative as it addresses the inherent noise and spatial blurring of neural recordings by disentangling the reconstruction process into two complementary pathways. The use of pretrained models for both semantic and acoustic attributes is a significant methodological advancement, allowing the model to compensate for the incomplete information present in neural signals. The architecture's design is well-justified, and the integration of various neural network components, including CNNs and Transformers, is appropriate for the task.
The authors conduct extensive experiments on two datasets (Brennan EEG and Gwilliams MEG), demonstrating that MindVoice outperforms existing baselines across multiple metrics, including semantic accuracy and speech quality. The evaluation metrics employed, such as HuBERT representation similarity and BERTScore-F1, are robust and relevant for assessing the intelligibility and quality of reconstructed speech. The results indicate a clear improvement over previous methods, validating the effectiveness of the proposed framework. However, the paper could benefit from more detailed comparisons with additional baselines and a broader range of evaluation metrics.
The implementation details are provided, including the architecture, training parameters, and preprocessing steps. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing the code and models to facilitate further research and validation by the community.
The study acknowledges limitations, including the model's tendency to produce generative hallucinations when neural signals do not provide sufficient information. The focus on semantic and timbre similarity may compromise fine-grained temporal fidelity, which is critical for certain applications. Additionally, the framework's applicability is currently limited to non-invasive neural signals related to auditory perception, leaving open questions about its performance on other types of neural signals.
The research has significant implications for the development of non-invasive speech brain-computer interfaces, potentially enabling communication for individuals with speech impairments. It also contributes to our understanding of auditory processing in the brain, paving the way for future studies in auditory neuroscience. The framework's ability to reconstruct intelligible speech from neural signals could lead to advancements in assistive technologies and enhance our understanding of human cognition. The MindVoice framework represents a significant advancement in reconstructing intelligible speech from non-invasive neural signals, utilizing a novel dual-stream architecture that effectively leverages pretrained models to address the challenges posed by noisy and incomplete neural recordings. This work has the potential to impact both the fields of auditory neuroscience and speech technology significantly.
Speech representations that capture prosodic information can be useful for both understanding and generation. However, speaker characteristics are reflected in acoustic-prosodic features (e.g., pitch). To address privacy concerns from the leakage of identity information, we propose a new self-supervised approach to learning prosody representations that incorporates speaker disentanglement strategies. We evaluate our encoder on three tasks to probe representation capabilities, including pitch reconstruction and detection of different prosodic events. Our encoder outperforms raw prosody and HuBERT-base baselines, achieving strong speaker disentanglement without adverse impact on prosody-related downstream tasks.
Primary: University of Washington
All Institutions: University of Washington
The main contribution of this paper is the development of a self-supervised prosody encoder that successfully disentangles speaker characteristics while preserving prosodic information, addressing critical privacy concerns in speech processing. The technical contributions and innovative methodology position this work as a meaningful advancement in the field of audio processing, with potential applications in privacy-sensitive speech technologies.
The methodology presented in this paper is robust, leveraging self-supervised learning to create a prosody encoder that effectively disentangles speaker characteristics from prosodic features. The use of glottal source estimation as input is innovative, and the combination of adversarial training with speaker normalization is a thoughtful approach to mitigate privacy concerns while maintaining prosody representation quality. The architecture builds on existing models like HuBERT and ProsodyBERT, but introduces significant enhancements, particularly in the context of privacy-preserving applications.
The experimental evaluation is comprehensive, utilizing multiple tasks to assess the encoder's performance, including pitch reconstruction and prosodic event detection. The results demonstrate clear improvements over baseline models, indicating that the proposed methods effectively enhance prosody modeling without compromising speaker disentanglement. The use of extensive datasets, such as the GigaSpeech corpus, strengthens the validity of the findings.
The paper provides detailed implementation information, including the training setup and the specific datasets used. However, the reliance on pseudo-labels for speaker normalization may affect reproducibility, as the effectiveness of the disentanglement strategies could vary with different labeling approaches. The GitHub repository linked in the paper aids in reproducibility, but the absence of publicly available code for some related works limits comparative evaluations.
The paper acknowledges limitations, including the use of pseudo-labels instead of ground-truth speaker labels, which may hinder the effectiveness of the proposed methods. Additionally, the focus on local prosodic events could limit the generalizability of the findings to more complex paralinguistic tasks. The model's non-causal nature also restricts its application in real-time scenarios.
The implications of this research are significant, particularly in the context of privacy-preserving speech technologies. By effectively disentangling speaker information from prosodic features, the proposed encoder can contribute to safer speech processing applications, such as AI assistants and voice synthesis systems, where user privacy is paramount. The approach could also inspire further research into privacy-preserving techniques across various domains of machine learning. The main contribution of this paper is the development of a self-supervised prosody encoder that successfully disentangles speaker characteristics while preserving prosodic information, addressing critical privacy concerns in speech processing. The technical contributions and innovative methodology position this work as a meaningful advancement in the field of audio processing, with potential applications in privacy-sensitive speech technologies.
Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.
Primary: National Taiwan University
All Institutions: National Taiwan University
This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.
The paper presents a comprehensive taxonomy of jailbreak attacks and defenses in Large Audio Language Models (LALMs), categorizing them into semantic, acoustic, signal, and embedding-layer attacks, as well as guard-based, training-free, and training-based defenses. The methodology is robust, combining a structured survey with empirical evaluations across ten open-source LALMs, which allows for a fair comparison of various attack and defense strategies. The authors also introduce a cost-aware evaluation framework that considers not just attack success rates but also benign refusal and latency, which is a significant improvement over previous works that focused solely on success rates.
The experiments are well-structured, utilizing a controlled dataset from JailbreakBench with 100 harmful and 100 benign requests, allowing for a clear assessment of the effectiveness of various attacks and defenses. The results indicate that different attack strategies yield varying success rates, with the Acoustic Best-of-N attack demonstrating the highest vulnerability. The empirical evaluation of defenses reveals a trade-off between robustness and usability, highlighting the complexity of ensuring safety in LALMs.
The paper provides detailed descriptions of the experimental setup, including the datasets used, the models evaluated, and the specific attack and defense methods employed. However, the reliance on specific hardware and configurations may limit the reproducibility of results in different environments. The authors do not provide code or data access, which could further hinder reproducibility.
The authors acknowledge several limitations, including the restricted model coverage to ten open-source LALMs and the controlled nature of the dataset, which may not fully represent real-world scenarios. Additionally, the evaluation metrics used may not capture all aspects of deployment, such as user satisfaction with benign responses. The paper also does not explore all possible attack and defense categories outlined in the taxonomy.
The findings of this paper have significant implications for the development of safe and robust LALMs, particularly in applications involving voice assistants and interactive systems. The emphasis on cost-aware evaluation and the identification of vulnerabilities across different modalities can guide future research in creating more resilient audio systems. The work also raises awareness about the potential for misuse of LALMs in bypassing safety mechanisms, highlighting the need for ongoing research into equitable and effective safety measures. This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.
Primary: University of Southern California
All Institutions: University of Southern California, The Ohio State University, University of California, Los Angeles, Harvard University, Boston University, University of Miami
The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.
The methodology presented in the paper is robust, as it introduces the ChildVox benchmark, which encompasses a wide range of child-centered audio and speech tasks. The integration of over 20 sub-tasks across 17 datasets is a significant advancement, allowing for a comprehensive evaluation of various audio and speech foundation models. The approach to define "voice" in children broadly, including physiological sounds and non-linguistic vocalizations, is innovative and necessary for understanding child communication. The evaluation of multiple model architectures, including self-supervised and ASR-oriented models, provides a well-rounded perspective on the capabilities of current technologies in this domain.
The experiments are thorough, with a clear structure that includes a variety of tasks and datasets. The benchmark results demonstrate that ChildVox provides high-performance models for recognizing a wide range of acoustic signals from children. The paper effectively compares the performance of different models on specific tasks, highlighting the strengths and weaknesses of each. The use of Macro-F1 scores for classification tasks and WER for ASR tasks is appropriate, ensuring that the evaluation metrics are relevant to the goals of the benchmark.
The paper provides detailed information about the datasets, experimental setup, and model training parameters, which enhances reproducibility. However, the lack of publicly available code or models limits the ability for other researchers to replicate the results fully. The authors mention plans to release models under a Responsible AI License, which is a positive step towards improving reproducibility in the future.
The paper acknowledges several limitations, including the focus on English-language recordings, which may restrict generalizability to other languages and dialects. Additionally, the subjective nature of some tasks, such as affective vocalization classification, may introduce variability in annotation reliability. The authors also note that the benchmark does not cover all recent advancements in audio foundation models, which could limit its comprehensiveness.
The ChildVox benchmark has significant implications for research in child development, speech therapy, and early childhood education. By providing a structured framework for evaluating child-centered audio processing, it can facilitate advancements in understanding children's communication and support the development of tools for monitoring and enhancing language skills. The potential applications in clinical settings for tracking speech production and language development are particularly noteworthy. The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, University of Surrey
The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.
The paper introduces a novel framework, COMET, utilizing Partial Least Squares Singular Value Decomposition (PLS-SVD) to analyze and mitigate the modality gap between audio and text embeddings in CLAP models. The methodology is well-structured, offering a fresh perspective on the decomposition of multimodal embeddings into interpretable concepts. The spectral truncation method proposed is innovative, allowing for effective dimensionality reduction while maintaining performance, which is a significant contribution to the field of multimodal contrastive learning.
The experiments are comprehensive, utilizing standard datasets like Clotho and AudioCaps for evaluation. The results demonstrate that the proposed PLSHead method achieves comparable or improved performance over the original embeddings, validating the effectiveness of the approach. The paper provides detailed metrics for retrieval tasks, showcasing the robustness of the method across different scenarios, including in-domain and cross-domain evaluations.
The paper lacks explicit implementation details or code availability, which could hinder reproducibility. While the methodology is clearly described, the absence of a publicly available codebase or demo limits the ability for other researchers to replicate the findings.
One limitation is the reliance on existing CLAP models, which may introduce biases based on their training data. Additionally, while the proposed methods show promise, the paper does not explore the potential impacts of varying the number of retained dimensions in the spectral truncation, which could affect generalization in different contexts.
The findings have significant implications for audio understanding and generation tasks, particularly in zero-shot scenarios. By effectively bridging the modality gap, the proposed methods could enhance the performance of multimodal applications, making them more accessible and efficient. This work could pave the way for future research in multimodal learning and its applications in real-world scenarios. The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.
While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.
Primary: KAIST
All Institutions: KAIST, Google DeepMind
The main contribution of this paper is the systematic evaluation of decoding strategies for DLM-based ASR, revealing that static and dynamic thresholding significantly enhance accuracy and speed compared to fixed-number decoding. This work provides a crucial step towards optimizing ASR systems, particularly in leveraging the unique properties of DLMs for improved performance.
The paper presents a systematic evaluation of decoding strategies for DLM-based ASR, comparing fixed-number, static threshold, and dynamic threshold approaches. The methodology is well-structured, utilizing Negative Log-Likelihood (NLL) as a measure of uncertainty, which is a novel approach in this context. The authors effectively analyze the performance of each strategy in terms of accuracy and speed, providing a clear rationale for their findings. However, the reliance on a single baseline model (Whisper-LLaDA) may limit the generalizability of the results.
The experiments are comprehensive, utilizing the LibriSpeech dataset and focusing on various hyperparameters for each decoding strategy. The evaluation metrics, including Word Error Rate (WER) and Real-Time Factor (RTF), are appropriate for assessing the performance of ASR systems. The results indicate that threshold-based strategies significantly outperform fixed-number schemes, which is a valuable contribution to the field. However, the paper could benefit from additional experiments on diverse datasets to validate the findings further.
The paper provides sufficient details on the experimental setup, including the training process and evaluation metrics. However, the absence of code or a project URL limits reproducibility. Future work should include sharing the implementation to facilitate validation by other researchers.
The study is limited to clean read English speech from the LibriSpeech test-clean set, which may not fully represent the challenges of noisy or spontaneous speech. Additionally, the findings may not generalize to multilingual ASR systems, as the confidence distribution could vary significantly across different languages and contexts.
The findings have significant implications for the development of more efficient ASR systems, particularly in applications requiring real-time processing. By demonstrating the effectiveness of threshold-based decoding strategies, this work could influence future research directions in ASR and related fields, potentially leading to advancements in speech technology and accessibility. The main contribution of this paper is the systematic evaluation of decoding strategies for DLM-based ASR, revealing that static and dynamic thresholding significantly enhance accuracy and speed compared to fixed-number decoding. This work provides a crucial step towards optimizing ASR systems, particularly in leveraging the unique properties of DLMs for improved performance.
Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reliable or are not used in training objectives. This work introduces a novel workflow for feature extraction using only acoustic labels. By isolating explicit regional accent landmarks and using a phoneme-based forced aligner (ZIPA), our targeted feature set captures dialectal variance more effectively than utterance embeddings, demonstrating that localized features can outperform general-purpose architectures on accent-related tasks using minimal and objective data labels.
Primary: Faculdade de Engenharia Elétrica e Computação (FEEC)
All Institutions: Faculdade de Engenharia Elétrica e Computação (FEEC), CNPq, UFRJ, UNICAMP
This paper presents a novel workflow for accent classification in Brazilian Portuguese, demonstrating that localized acoustic features can effectively capture dialectal variance without the need for sociolinguistic labels. The methodology and results contribute meaningfully to the field, showcasing the potential for improved speech processing techniques that are both interpretable and computationally efficient.
The methodology is innovative in its approach to accent classification by utilizing a purely audio-driven pipeline that relies on acoustic labels rather than sociolinguistic labels. The use of ZIPA for phoneme-based forced alignment to isolate accent markers is a significant methodological advancement. The authors effectively demonstrate the extraction of localized features that outperform general-purpose architectures, which is a novel contribution to the field of speech processing. The detailed description of the feature extraction process and the classification tasks is commendable, although the reliance on manual annotation may introduce bias.
The experimental evaluation is thorough, employing a variety of classifiers and a well-structured cross-validation protocol to assess the performance of the proposed features against established SSL models. The results indicate that the proposed method achieves competitive accuracy, which is a strong validation of the approach. However, the paper could benefit from more extensive comparisons with other state-of-the-art methods and a clearer presentation of results in tables.
The paper provides sufficient detail regarding the methods and datasets used, which aids in reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results. The authors mention a companion webpage, which could potentially provide additional resources, but this needs to be explicitly linked.
The study acknowledges that the accent markers used are not exhaustive for all Brazilian Portuguese accents, indicating a limitation in generalizability. The reliance on manual annotation for training data may also introduce biases that affect the model's performance. Additionally, the paper does not address potential challenges in real-world applications, such as variability in speaker accents and environmental noise.
The work has significant implications for the field of speech recognition and sociolinguistics, particularly in regions with diverse dialects like Brazil. By demonstrating that reliable accent classification can be achieved without sociolinguistic labels, the research opens avenues for more inclusive and accessible speech technologies. This could enhance applications in automatic speech recognition, language learning, and sociophonetic research. This paper presents a novel workflow for accent classification in Brazilian Portuguese, demonstrating that localized acoustic features can effectively capture dialectal variance without the need for sociolinguistic labels. The methodology and results contribute meaningfully to the field, showcasing the potential for improved speech processing techniques that are both interpretable and computationally efficient.
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Xiaohongshu Inc
The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.
The proposed HoliTok model introduces a novel continuous tokenization approach that effectively balances the requirements of learnability and decodability for unified speech generation and understanding. The progressive training strategy enhances the model's ability to preserve signal fidelity while incorporating semantic information, which is a significant advancement over existing tokenization methods. The architecture's integration of a variational autoencoder with a temporal bottleneck and a downstream-aware supervision network is a thoughtful design choice that addresses the limitations of traditional tokenizers.
The experiments conducted demonstrate the model's competitive performance in reconstruction fidelity, speech synthesis, and unified generation-understanding tasks. The evaluation metrics used, including PESQ, STOI, and WER, provide a robust framework for assessing the quality of the generated outputs. The results indicate that HoliTok not only outperforms existing methods but also maintains a compact latent representation, which is crucial for practical applications in speech technology.
The paper provides a clear description of the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of detailed hyperparameter settings and specific training configurations in the main text may pose challenges for full replication. The availability of the code on GitHub is a positive aspect that aids in reproducibility efforts.
The study primarily focuses on speech generation and understanding, leaving out broader audio applications such as environmental sounds and music. The evaluation is limited to a specific architecture (AR+DiT), which may not capture the full potential of the proposed tokenizer across various modeling paradigms. Future work should explore these areas to validate the generalizability of the approach.
The advancements presented in this paper have the potential to significantly enhance speech synthesis and recognition technologies, making them more efficient and effective. The model's ability to serve as a unified interface for both tasks could lead to improvements in applications such as virtual assistants, automated transcription services, and interactive voice response systems. The implications for accessibility and user interaction with technology are substantial, as improved speech models can facilitate better communication for individuals with speech impairments. The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.
Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.
Primary: University of Edinburgh
All Institutions: University of Edinburgh, Google DeepMind, Meta Superintelligence Labs
The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.
The paper presents a novel approach to speech language modeling by introducing MELD, which integrates discrete latent variables into the autoregressive modeling of mel-spectrograms. This joint optimization of the encoder and autoregressive model addresses limitations of previous two-stage methods, particularly in preserving task-relevant information. The methodology is well-structured, leveraging variational inference to optimize a lower bound on the log likelihood, and effectively incorporates both TTS and STT tasks within a single framework. The use of discrete latent variables to suppress silence generation is a significant innovation, enhancing the model's performance over existing methods.
The experiments are comprehensive, utilizing the 960-hour subset of the LibriSpeech dataset for training and evaluation. The authors compare MELD against several baselines, including codec-based models and other mel-spectrogram-based approaches, demonstrating clear improvements in both TTS and STT tasks. The evaluation metrics include both subjective (MOS, speaker similarity) and objective (WER) assessments, providing a well-rounded view of the model's performance. The results indicate that MELD outperforms its competitors, particularly in reducing silence and improving word error rates.
The paper provides detailed implementation specifics, including model architecture, training configurations, and evaluation protocols. However, the authors acknowledge challenges in reproducing results from related work (e.g., MELLE), which may affect the perceived reliability of their comparisons. The use of specific datasets and training strategies is well-documented, but the lack of a public code repository or demo limits reproducibility.
The authors note several limitations, including the difficulty in making fair comparisons between codec-based and mel-spectrogram-based methods due to differences in representation mapping. Additionally, while the joint optimization framework is promising, the paper does not explore its application to other speech tasks beyond TTS and STT. The potential for overfitting or collapsing solutions in the discrete latent space is also mentioned, although not observed in their experiments.
The proposed model has significant implications for real-world applications in speech synthesis and recognition, particularly in enhancing the quality and efficiency of TTS systems. The ability to jointly model TTS and STT tasks could streamline workflows in various applications, such as virtual assistants and automated transcription services. However, ethical considerations regarding the misuse of speech generation technologies, such as voice cloning, must be addressed to ensure responsible use. The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.
AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign, Wonkwang University
The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.
The proposed methodology introduces a novel federated domain generalization framework specifically tailored for respiratory sound classification, addressing the critical issue of inter-stethoscope variability. The integration of a causality-inspired device style intervention network, counterfactual text augmentation, and gradient alignment represents a significant advancement in the field, as it not only tackles the entanglement of device style and disease content but also enhances the robustness of the model across heterogeneous devices. The approach is well-structured, leveraging causal inference principles to inform data augmentation strategies, which is a fresh perspective in the context of audio classification.
The experimental setup is robust, utilizing two well-defined datasets (ICBHI and SPRSound) and employing leave-one-device-out validation to rigorously assess the model's performance. The results demonstrate that the proposed method consistently outperforms conventional data augmentation and federated learning baselines, indicating its effectiveness in improving cross-device generalization. The ablation studies further substantiate the contributions of each component of the framework, providing clear evidence for the importance of the causality-inspired interventions.
While the paper mentions that code will be released upon publication, the absence of a current project URL limits immediate reproducibility. The methodology is described in sufficient detail to allow for replication, but access to the code and datasets would be essential for full verification of results.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of respiratory sound recordings across different clinical settings. Additionally, the paper acknowledges the need for future work to address privacy concerns and computational efficiency in federated learning settings, which are critical for real-world applications.
The framework has significant potential implications for telemedicine and automated pulmonary disease detection, particularly in enhancing the reliability of AI-driven diagnostics across various healthcare environments. By addressing device-induced biases, the work contributes to the broader goal of equitable healthcare access and improved patient outcomes. The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.
Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.
Primary: Zhejiang University
All Institutions: Zhejiang University
The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.
The proposed CoRe-KD framework innovatively addresses the challenges of multimodal emotion recognition (MER) under incomplete observations. It introduces two key components: Complete-view State Anchoring (CSA) and Nonverbal Conflict Exposure (NCE), which enhance the robustness of emotion recognition by aligning incomplete-view predictions with structured references from a complete-view teacher. The methodology is well-structured, leveraging knowledge distillation effectively while avoiding the pitfalls of input reconstruction, which is a common issue in existing methods. The use of Gaussian-inspired states for modality fusion is a notable technical contribution that adds precision to the alignment process.
The experiments are comprehensive, utilizing established datasets (IEMOCAP, MELD, and CMU-MOSEI) to validate the effectiveness of CoRe-KD under both fixed- and random-missing protocols. The results demonstrate consistent improvements in accuracy and F1 scores compared to various baselines, indicating the robustness of the proposed method. The inclusion of ablation studies further strengthens the findings by elucidating the contributions of each component within the framework.
The paper provides detailed implementation specifics, including training protocols, hyperparameters, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the results.
One significant limitation is that CoRe-KD requires complete multimodal observations for training the teacher model, which may not be feasible in all real-world scenarios. Additionally, the NCE module relies on controlled conflict views that might not comprehensively cover all possible real-world misalignments or corruptions in multimodal data.
The advancements in robust conversational MER have implications for various applications, including human-computer interaction, sentiment analysis, and affective computing. By improving the reliability of emotion recognition systems in the presence of missing or unreliable modalities, this work could enhance user experience in applications such as virtual assistants, mental health monitoring, and interactive entertainment. The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.
The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality. This fosters the assumption that low-WER tokens inherently preserve the information necessary for intelligible acoustic synthesis. We argue this is fundamentally deceptive. While high-frequency tokens succeed in generation tasks due to implicit information leakage, isolating pure semantic information at ultra-low frame rates strips away the finegrained articulation and micro-dynamics essential for ODE-based generation. Empirically validating this requires extreme compression without sacrificing WER -- a methodological bottleneck, as standard fixed-stride downsampling arbitrarily truncates phonetic boundaries. To overcome this, we develop a dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, achieving ultra-low frame rates with exceptionally low WER. Using these isolated "pure" semantic tokens, we expose the WER trap: when conditioning generative models -- even with oracle duration alignments -- the reconstructed speech suffers from severe articulation blur and is rendered acoustically unintelligible. Our findings demonstrate that semantic categorization rewarded by low WER is inherently orthogonal to the continuous phonetic trajectories required for synthesis, shattering the illusion of the unified token and advocating for explicitly decoupled speech representations.
Primary: The University of New South Wales
All Institutions: The University of New South Wales, Nanyang Technological University
The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.
The paper presents a novel dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, addressing the methodological bottleneck of fixed-stride downsampling that corrupts phonetic boundaries. This approach is innovative as it allows for extreme compression while maintaining low WER, enabling a rigorous evaluation of the unified token hypothesis through the Dual-Probing Protocol. The methodology is well-structured, leveraging existing frameworks while introducing significant improvements in tokenization for speech synthesis.
The experiments are comprehensive, utilizing large-scale multilingual datasets and employing a dual-probing protocol to assess both discriminative understanding and generative viability. The results demonstrate that while the dynamic tokens achieve high performance in understanding tasks, they fail in generating intelligible speech, effectively illustrating the WER trap. The evaluation metrics, including CER and AVQA accuracy, are appropriate and provide a clear picture of the model's performance.
The paper provides detailed architectural specifications, hyperparameter configurations, and training methodologies, which enhance reproducibility. However, the absence of a public code repository limits the ease with which others can replicate the results. The thoroughness of the experimental setup and the clear delineation of methods contribute positively to reproducibility.
The study acknowledges its limitations, particularly that the generative probe employs a single synthesis paradigm, which may not generalize across different architectures. Additionally, the focus on Mandarin as the sole language for evaluation may restrict the applicability of findings to other languages with different phonetic structures. The paper also notes that while it identifies a critical flaw in the unified token approach, it does not propose a concrete solution for decoupled representations.
The findings have significant implications for the development of speech language models, challenging the prevailing assumption that a single token can suffice for both understanding and generation. This work advocates for a separation of semantic and acoustic representations, which could lead to more effective and intelligible speech synthesis systems. The insights gained from this research could influence future designs in multimodal AI systems, particularly in improving the quality of synthesized speech. The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.
Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.
Primary: Cochin University of Science and Technology (CUSAT)
All Institutions: Cochin University of Science and Technology (CUSAT)
The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.
The proposed CAFNet architecture is innovative in its approach to jointly address the challenges of ternary classification and temporal boundary localization for half-truth audio deepfake detection. The use of cross-attentive feature fusion and depthwise-separable convolutions enhances the model's ability to process multiple acoustic features effectively. The integration of BiLSTM for boundary prediction is a well-justified choice, given the temporal nature of the task. However, the paper could benefit from a more detailed discussion on the design choices for the architecture and the rationale behind the specific feature sets used.
The experiments are robust, utilizing a comprehensive dataset (MLADDC) that covers a diverse range of languages and audio conditions. The performance metrics reported, including accuracy, AUC, and MAE for boundary localization, are convincing and demonstrate the effectiveness of CAFNet compared to existing models. The cross-dataset generalization study adds significant value, revealing critical insights into the limitations of current training paradigms in deepfake detection.
The authors provide sufficient details regarding the implementation, including hyperparameters, training protocols, and the architecture of CAFNet. The availability of code and trained models on GitHub enhances reproducibility. However, the paper lacks detailed information on the specific datasets used for training and evaluation, which could hinder full reproducibility.
One notable limitation is the model's performance on the real class, where a significant number of half-truth samples are misclassified as real. This indicates that while the model excels in detecting fully fake and half-truth audio, it struggles with distinguishing genuine audio, which is crucial for practical applications. Additionally, the study highlights the challenge of catastrophic forgetting during domain adaptation, suggesting that the current approach may not be robust across different datasets.
The findings of this research have significant implications for audio forensics and the detection of manipulated media, especially in contexts where misinformation can have serious consequences. The ability to localize manipulations within audio clips enhances the forensic value of detection systems, making them more actionable for users. As deepfake technology continues to evolve, advancements in detection methods like CAFNet will be critical in maintaining trust in audio communications. The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.
Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMamba}_\textrm{P}$ framework, the proposed model incorporates FiLM layers to adapt the reconstruction process according to the respective bandwidth. Experiments on the MUSDB dataset show that FiPA-SR outperforms the state-of-the-art AudioSR model across 8, 20, and 32 kHz input sampling rates. Moreover, the proposed architecture uses approximately 3$\times$ less GPU memory and performs inference more than 60$\times$ faster than the diffusion-based baseline.
Primary: PEE/COPPE, UFRJ
All Institutions: PEE/COPPE, UFRJ, Carlos Chagas Filho Foundation for Research Support in the State of Rio de Janeiro, National Council for Scientific and Technological Development, CAPES
This paper presents FiPA-SR, a GAN-based model for audio bandwidth extension, demonstrating significant improvements in reconstruction quality and computational efficiency. The innovative use of FiLM layers to adaptively handle multiple bandwidths marks a notable advancement in the field of audio super-resolution.
The methodology is robust, leveraging a GAN-based architecture with FiLM layers to adaptively handle different bandwidths. The use of perceptual metrics and a well-defined training procedure enhances the model's ability to generalize across various input configurations. The innovative approach of combining upsampling with conditional modulation through FiLM layers is a significant advancement over previous models.
The experiments are thorough, utilizing the MUSDB dataset and comparing against state-of-the-art models. The use of objective metrics like Log-Spectral Distance and ViSQOL provides a solid foundation for evaluating performance. However, the paper could benefit from more qualitative assessments, such as user studies or listening tests, to complement the objective metrics.
The paper provides sufficient details regarding the architecture, training setup, and evaluation metrics, which should enable other researchers to replicate the results. However, the absence of a publicly available code repository limits accessibility.
The study is limited to specific bandwidth configurations and does not explore the model's performance across a broader range of frequencies. Additionally, while the results are promising, the reliance on objective metrics alone may not fully capture perceptual audio quality.
The proposed model has significant implications for audio processing applications, particularly in telecommunications and music production, where bandwidth limitations are prevalent. The ability to reconstruct high-frequency content efficiently could enhance audio quality in various consumer and professional settings. This paper presents FiPA-SR, a GAN-based model for audio bandwidth extension, demonstrating significant improvements in reconstruction quality and computational efficiency. The innovative use of FiLM layers to adaptively handle multiple bandwidths marks a notable advancement in the field of audio super-resolution.
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Xiaomi Inc.
Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.
The paper introduces a novel framework, Dasheng AudioGen, which effectively integrates multiple audio generation tasks into a single model using structured multi-view captions and a unified semantic-acoustic representation. This approach addresses the fragmentation in audio generation by allowing for coherent mixed-audio scene generation from text, which is a significant advancement in the field. The methodology is well-structured, leveraging a flow-matching DiT architecture and a unique conditioning framework that enhances control over audio components. The use of high-dimensional latent spaces for audio representation is particularly innovative, as it allows for better modeling of overlapping audio elements.
The experiments conducted are comprehensive, utilizing a large-scale dataset (ACAVCaps) and a robust evaluation pipeline that includes both objective and subjective metrics. The results demonstrate that Dasheng AudioGen outperforms existing specialized models in mixed-audio generation while maintaining competitive performance in single-type tasks. The introduction of the MECAT benchmark for mixed-audio evaluation is a valuable contribution, providing a new standard for assessing model performance in this area.
The paper mentions limitations in reproducibility due to reliance on a private dataset, which may hinder others from replicating the results. However, the detailed methodology and experimental setup provide a clear path for future researchers to build upon this work. The authors should consider releasing their dataset or providing a public version to enhance reproducibility.
Key limitations include the model's restriction to generating 10-second audio clips and the lack of advanced speaker control in TTS applications. Additionally, the performance in terms of speech intelligibility lags behind specialized TTS systems, indicating room for improvement. The reliance on a private dataset also poses challenges for reproducibility and broader accessibility.
The implications of this work are significant, as it paves the way for more integrated audio generation systems that can produce realistic and contextually coherent audio scenes. This could have applications in various fields, including film production, gaming, virtual reality, and assistive technologies. The ability to generate complex audio scenes from simple text prompts could also enhance user experiences in interactive media. Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.
Primary: Future Living Lab, Alibaba
All Institutions: Future Living Lab, Alibaba
The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.
The paper introduces a novel benchmark, VoiceGiraffe, designed specifically for evaluating long-context audio-language models (LALMs) in realistic scenarios. The methodology is robust, employing a dual-level taxonomy for question generation that captures both single-hop and multi-hop reasoning tasks. The data curation process is thorough, involving a multi-stage pipeline that includes voice activity detection, hierarchical captioning, and collaborative verification by human annotators. This rigorous approach ensures high-quality data for evaluation, addressing the limitations of existing benchmarks that rely on short clips or concatenated segments.
The experimental evaluation is comprehensive, benchmarking a wide range of LALMs against human performance across various tasks and inference paradigms. The results reveal significant challenges in long-context understanding, with only one proprietary model surpassing human performance. The findings highlight the limitations of current models in memory persistence and reasoning capabilities, providing valuable insights into areas for future research. The use of multiple inference settings (E2E, cascaded caption aggregation, and reasoning-enhanced cascading) allows for a nuanced understanding of model performance.
While the paper outlines a detailed methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the study or build upon the findings.
The primary limitations include the lack of a publicly available dataset or benchmark for other researchers to use, which could hinder wider adoption and validation of the proposed methods. Additionally, the paper acknowledges that even human annotators found the tasks challenging, indicating that the benchmark may be too difficult for current models. There is also a potential bias in language performance, as the models exhibited varying capabilities across English and Chinese inputs.
The introduction of VoiceGiraffe has the potential to significantly advance the field of audio-language understanding by providing a rigorous evaluation framework that addresses real-world challenges. This benchmark can guide future research towards developing models with improved long-context reasoning and memory capabilities, which are essential for applications in audio assistants, automated transcription, and multimedia content analysis. The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.
Primary: Zhejiang University
All Institutions: Zhejiang University, Bytedance
The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.
The paper introduces SwanBench-Speech, a comprehensive benchmark for evaluating long-form speech generation models. It effectively addresses the limitations of existing evaluation methods by proposing a multi-dimensional framework that includes seven disentangled metrics across three core challenges: acoustics, semantics, and expressiveness. The methodology is well-structured, with a clear focus on real-world applications and the incorporation of human-aligned metrics, which enhances the relevance of the evaluation. The use of diverse scenarios and a rigorous data collection process further strengthens the methodology.
The experiments are extensive, involving over 20 models evaluated across 1,101 samples in 17 scenarios. The results provide valuable insights into the performance gaps of current models compared to human recordings, particularly in expressiveness and consistency. The use of both objective metrics and human evaluations adds robustness to the findings. However, while the experiments are thorough, the paper could benefit from more detailed statistical analyses to quantify the significance of the results.
The paper provides a clear description of the data collection and evaluation processes, along with the metrics used. The open-sourcing of the benchmark and the availability of evaluation scripts enhance reproducibility. However, the reliance on specific models for evaluation may limit the generalizability of the findings to other systems.
The study acknowledges limitations, including a narrow linguistic scope (only Chinese and English) and a lack of robustness in assessing emotional and stylistic transitions. Additionally, the dataset's speaker diversity is limited, which may introduce bias in evaluations. Future work should address these gaps to enhance the benchmark's applicability.
This work has significant implications for the field of speech synthesis, particularly in enhancing the evaluation of long-form speech generation systems. By establishing a standardized benchmark, it paves the way for future research and development in this area, potentially leading to more immersive and expressive speech synthesis applications. The focus on real-world scenarios and human-aligned metrics also suggests potential applications in education, entertainment, and customer service. The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania, The Chinese University of Hong Kong
This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.
The proposed methodology introduces a Cross-view Alternate-attention Transformer (CVAT) that effectively captures both local intra-view and global cross-view relationships, addressing the challenges of few-shot Room Impulse Response (RIR) prediction. The integration of a geometry-informed modulation block enhances the model's ability to leverage geometric features, which is a significant advancement over existing methods. The auxiliary loss for multi-task learning further strengthens the model's performance by promoting generalizability across different architectures.
The experiments are robust, utilizing both simulated and real-world datasets, and demonstrate state-of-the-art performance across various metrics. The ablation studies provide clear evidence of the contributions of each component, validating the effectiveness of the proposed architecture. The quantitative results indicate substantial improvements over baseline methods, particularly in sparse reference scenarios.
The paper provides sufficient implementation details, including architecture specifications and training configurations, which should facilitate reproducibility. The availability of code and checkpoints on GitHub enhances this aspect, although specific hyperparameters and training procedures could be elaborated further for clarity.
While the model shows impressive performance, it may still be limited by the quality of the input data and the assumptions made regarding room geometry. The reliance on geometric features may not generalize well to all acoustic environments, particularly those with complex or unconventional geometries.
The advancements in few-shot learning for RIR prediction have significant implications for immersive audio applications in AR/VR and spatial audio rendering, potentially enhancing user experiences in virtual environments. The methodology could inspire further research into integrating geometric and acoustic modeling in other domains. This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, ModelBest Inc.
The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.
The paper introduces a novel low-dimensional audio tokenizer, LoSATok, which effectively compresses high-dimensional semantic representations while maintaining semantic richness and acoustic details. The methodology includes the Semantic Bottleneck (SemBo) for dimensionality reduction, and a dual-level semantic supervision strategy that enhances the learning process. The proposed time-relation loss is a significant innovation that ensures temporal consistency in the representations. Overall, the methodology is well-structured and addresses a critical gap in current audio modeling approaches.
The experiments are comprehensive, covering various audio tasks across speech, music, and general audio domains. The results demonstrate that LoSATok achieves competitive performance in understanding tasks and outperforms existing models in generation tasks, particularly in terms of efficiency and quality. The use of objective metrics (e.g., FAD, CLAP) alongside subjective evaluations strengthens the findings. However, the paper could benefit from more extensive comparisons with state-of-the-art methods in a broader range of tasks.
The paper provides a GitHub repository with the code, which is essential for reproducibility. However, specific implementation details, such as hyperparameter choices and training setups, could be more clearly outlined to facilitate replication by other researchers.
The authors acknowledge that LoSATok sacrifices some reconstruction fidelity for improved semantic organization and generative performance. Additionally, while it shows promise in understanding tasks, it does not fully reach the performance of high-dimensional semantic representations. Future work is needed to optimize the balance between semantics, acoustics, and generation.
The proposed tokenizer has significant implications for audio understanding and generation, potentially enhancing applications in speech recognition, music generation, and audio synthesis. By enabling more efficient models, it could lead to advancements in real-time audio processing and interactive applications. The research also opens avenues for further exploration of low-dimensional representations in multimodal contexts. The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.
Primary: Renmin University of China
All Institutions: Renmin University of China
The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.
The proposed methodology, PlanAudio, introduces a novel framework for generating unified audio from free-form text prompts, leveraging an autoregressive LLM architecture and a semantic latent Chain-of-Thought (CoT) mechanism. This approach is innovative as it avoids traditional text encoders and explicit text rewriting, which are common in existing models. The integration of semantic planning in the latent space before audio synthesis is a significant advancement, allowing for better alignment between high-level semantics and low-level audio generation. The methodology is well-structured, with clear phases for semantic planning and acoustic generation, which enhances the model's ability to produce coherent audio outputs.
The experiments are comprehensive, evaluating PlanAudio across multiple scenarios (sound, speech, and composite) using both objective metrics (FAD, KL divergence, WER) and subjective assessments (human ratings on acoustic quality, temporal correctness, etc.). The results demonstrate that PlanAudio outperforms existing pipeline and unified models, showcasing its versatility and effectiveness. The creation of PlanAudio-Bench as a specialized benchmark for composite audio scenarios adds value to the evaluation process, providing a structured way to assess the model's performance in real-world applications.
The paper provides detailed implementation details, including the datasets used, training procedures, and evaluation metrics. However, the lack of a publicly available demo or project URL limits the reproducibility of the results. While the methodology is clearly described, access to the code and trained models would enhance the ability of other researchers to replicate the findings.
One limitation is the potential for the model to struggle with highly complex prompts that require intricate audio interactions, as indicated by the slight performance drop in speech generation compared to specialized models. Additionally, the reliance on the quality of the training data and the inherent challenges in synthesizing audio from free-form text prompts may introduce variability in performance across different contexts.
The implications of this research are significant for various applications, including content creation, game development, and assistive technologies for individuals with speech impairments. By enabling the generation of coherent audio from natural language prompts, this work could facilitate new forms of human-computer interaction and enhance multimedia experiences. The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.
Primary: Daydream
All Institutions: Daydream
The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.
The methodology presented in the paper is innovative, leveraging a real-time diffusion engine that transforms the denoising process into a playable musical instrument. The authors introduce several mechanisms that enhance the responsiveness and control of audio generation, including per-slot heterogeneous denoise scheduling, shared mutable per-step state, per-frame source blending, and a windowed VAE decode. These contributions are well-structured and address significant challenges in real-time audio generation, particularly in maintaining high throughput while allowing for fine-grained control over audio parameters.
The experimental evaluation is thorough, with a focus on latency, output quality, and responsiveness of parameter changes. The authors provide empirical results that substantiate their claims regarding the effectiveness of their proposed mechanisms, including quantitative comparisons with existing systems. The use of various audio sources and the detailed reporting of metrics such as CLAP and SNR demonstrate a rigorous approach to validating the system's performance.
The paper includes sufficient detail regarding the architecture and implementation of the DEMON system, including the use of TensorRT for acceleration and the specific configurations used for experiments. However, the absence of a detailed description of the datasets and the evaluation metrics used may pose challenges for complete reproducibility. The provided URLs for the project and demo enhance accessibility to the code and results.
One limitation of the paper is the reliance on a specific hardware setup (NVIDIA RTX 5090) for performance metrics, which may not generalize across different systems. Additionally, while the authors address the latency of their system, the practical implications of the onset latency in live performance contexts could be further explored. The paper does not discuss potential limitations in the quality of audio generated under varying conditions or the scalability of the system.
The work has significant implications for the fields of music generation and real-time audio processing, particularly for live performances. By enabling musicians to manipulate denoising parameters in real-time, DEMON opens up new avenues for creative expression and interaction with AI-generated music. The integration of machine learning into musical instruments could lead to innovative performance practices and new genres of music. The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.
Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.
The proposed Audio-Mind framework introduces a novel approach to audio understanding by integrating a strong frontend with planner-guided tool use. This method allows for dynamic evidence acquisition, which is a significant improvement over existing audio-agent baselines. The framework's ability to preserve the frontend's judgment while addressing evidence gaps is a noteworthy contribution to the field, as it enhances the overall reasoning process in audio question answering.
The experiments conducted on MMAR and MSU-Bench demonstrate the effectiveness of Audio-Mind, achieving impressive accuracy scores of 80.4% and 82.8%, respectively. The matched-backbone comparison further validates the framework's design by highlighting the orchestration bottleneck in agentic decomposition under strong audio frontends. However, the paper lacks detailed descriptions of the datasets and evaluation metrics used, which could enhance the transparency and reproducibility of the results.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Without access to the framework or clear guidelines on how to replicate the experiments, it is challenging for other researchers to validate the findings.
One limitation is the potential complexity introduced by the planner-guided tool use, which may not generalize well to all audio understanding tasks. Additionally, the framework's reliance on strong frontends could limit its applicability in scenarios where such models are not available.
The Audio-Mind framework has the potential to significantly impact the field of audio understanding and question answering by providing a more reliable and auditable reasoning process. Its contributions could lead to advancements in audio-QA annotation and error analysis, making it a valuable tool for researchers and practitioners in the domain. The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.
High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neural speech codec that operates entirely in the modified discrete cosine transform (MDCT) domain. CFMDCTCodec integrates a lightweight encoder-quantizer-decoder-style MDCT-spectral codec with a noise-prior-aware, conditional-flow-matching (CFM)-based MDCT-spectral enhancer. Within this framework, the codec serves as a base module that compactly discretizes the MDCT spectrum extracted from speech and produces an initial coarse reconstruction, while the enhancer further restores fine-grained spectral details. The enhancer improves the decoded MDCT spectrum by integrating a conditional MDCT velocity-field filter with an ordinary differential equation (ODE) solver, under the guidance of an MDCT-derived magnitude-adaptive noise prior, aiming to emphasize perceptually significant high-energy regions while stabilizing low-energy and silent regions. Finally, the enhanced MDCT spectrum is reconstructed into the decoded speech using the inverse MDCT. When optimizing CFMDCTCodec, we adopt a unified non-adversarial training strategy that jointly combines reconstruction, quantization and CFM objectives. Both objective and subjective evaluations show that CFMDCTCodec outperforms competitive baselines in low-bitrate regimes, e.g., 0.65 kbps, while approaching the perceptual quality of large-scale codecs with significantly fewer parameters and computations.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing, Tsinghua University
The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.
The proposed CFMDCTCodec introduces a novel architecture for low-bitrate speech coding that operates entirely in the MDCT domain, integrating a single-codebook quantization strategy with a noise-prior-aware conditional flow matching (CFM) enhancement mechanism. This approach effectively addresses the limitations of existing codecs by enhancing the spectral quality of decoded speech without increasing bitrate, utilizing a joint training strategy that simplifies the learning process. The methodology is well-structured, with clear descriptions of the encoder, decoder, and enhancer components, and the use of ordinary differential equations (ODE) for state evolution is particularly innovative.
The experimental setup is robust, utilizing two different speech corpora and multiple bitrate settings to evaluate the codec's performance. The paper provides both objective and subjective evaluation metrics, including MUSHRA tests and various objective measures (STOI, SI-SDR, etc.), which demonstrate the codec's superiority over competitive baselines. The results indicate significant improvements in speech quality at low bitrates, validating the effectiveness of the proposed enhancements.
The paper includes detailed descriptions of the experimental setup, including hyperparameters, training configurations, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease of replication for other researchers.
One limitation is the reliance on a single-codebook quantization strategy, which may not capture the full diversity of speech signals as effectively as multi-codebook approaches. Additionally, while the results are promising, further testing across a wider range of speech datasets and real-world scenarios would strengthen the findings.
The CFMDCTCodec has significant potential applications in bandwidth-constrained environments such as satellite communications, teleconferencing, and mobile applications, where high-quality speech transmission is critical. Its lightweight design and efficient processing could facilitate broader adoption in various speech processing applications, contributing to advancements in telecommunications and accessibility technologies. The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.
We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.
Primary: CUHK MMLab
All Institutions: CUHK MMLab, SJTU, NTU, McMaster, CityUHK, JUFE
The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.
The methodology introduces a novel interaction slot formulation that captures real-time, multimodal interactions in a continuous audio-visual stream. This approach is innovative as it shifts the evaluation paradigm from static question-answer pairs to dynamic, temporally grounded interactions, allowing for a more realistic assessment of model capabilities in real-time settings. The proposed metrics (IA-QTF1, IDS, NCCS) are well-defined and tailored to the unique challenges of streaming interactions, effectively measuring not just correctness but also timing and context management.
The experiments are comprehensive, evaluating multiple state-of-the-art omnimodal models under the new benchmark. The results reveal significant gaps in current models' abilities to handle real-time interactions, particularly in continuous task monitoring and nested query scenarios. The use of a diverse dataset of 250 videos with 1,430 response slots provides a solid foundation for the evaluations, although the performance scores indicate that there is considerable room for improvement in the models tested.
The paper mentions that the code and datasets will be made publicly accessible, which is crucial for reproducibility. However, details on the exact implementation of the models tested and the specific evaluation protocols could be elaborated upon to enhance reproducibility further.
The paper acknowledges limitations such as the narrow focus on specific interaction types and the reliance on synthesized speech for the 1QnA split. Additionally, the benchmark currently covers only Chinese and English scenarios, which may limit its applicability across different languages and cultures. The analysis is also limited to a small number of models, which may not represent the full landscape of omnimodal systems.
The introduction of OmniInteract has the potential to significantly advance the field of real-time human-AI interaction by providing a standardized benchmark for evaluating omnimodal models. This can lead to improved AI assistants that are more capable of understanding and responding to user queries in real-time, enhancing applications in accessibility, education, and everyday tasks. The focus on real-time interaction also raises important considerations regarding privacy and the ethical deployment of always-on systems. The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.
Primary: Amap, Alibaba Group
All Institutions: Amap, Alibaba Group, The Chinese University of Hong Kong, Shenzhen
The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.
The methodology is robust, featuring a well-structured multi-stage data processing pipeline that enhances data quality and a compact autoregressive architecture that effectively decouples speaker identity from style. The use of Q-Former-based conditioning and cross-sample paired training is innovative and addresses common challenges in TTS systems.
The experiments are comprehensive, utilizing the Seed-TTS Eval benchmark to demonstrate superior performance in terms of WER, CER, and speaker similarity. The inclusion of human evaluations for emotion control and paralinguistic synthesis adds depth to the assessment of the system's capabilities.
The paper emphasizes reproducibility by providing a complete data processing pipeline built from publicly available tools, along with pretrained weights and code. This transparency enhances the likelihood of other researchers replicating the results.
The paper acknowledges limitations such as insufficient explicit style modeling and the constraints of single-codebook quantization, which may hinder performance in more complex scenarios. Additionally, the reliance on mel-spectrograms could introduce reconstruction artifacts.
The potential applications of PilotTTS are significant, particularly for resource-constrained teams seeking to develop competitive TTS systems. Its modular approach and open-source nature could democratize access to high-quality speech synthesis technology. The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.
Primary: The University of Melbourne
All Institutions: The University of Melbourne, The University of Auckland, UNSW Sydney, KAIST
The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.
The methodology is robust, introducing the EnvMem framework to systematically analyze the retention of acoustic information in multi-turn interactions. The authors employ a combination of controlled experiments, linear probing, and attention analysis to dissect the representation and retrieval mechanisms in LALMs. The use of synthetic dialogues and a clear structure for the evaluation tasks enhances the clarity of the experimental design. However, the reliance on synthetic data may limit the generalizability of the findings to real-world scenarios.
The experiments are comprehensive, evaluating multiple LALMs across various context lengths. The results demonstrate a clear performance gap between semantic and acoustic memory, with detailed analyses of representational drift and attention allocation. The use of metrics like accuracy and relative degradation provides a solid basis for comparison, although the paper could benefit from additional qualitative assessments of model outputs.
The paper provides detailed descriptions of the experimental setup, including dataset construction and evaluation protocols. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing the EnvMem benchmark and associated models to facilitate further research in this area.
The primary limitation is the use of synthetic data, which may not capture the complexities of natural conversations. Additionally, the interventions are post-hoc and may not translate to practical solutions for improving acoustic memory in deployed models. The study also acknowledges potential ethical concerns regarding privacy and surveillance in real-world applications.
This research has significant implications for the development of more robust audio language models, particularly in applications requiring persistent awareness of environmental sounds. By highlighting the representational bottlenecks in LALMs, the findings can guide future training strategies and benchmark designs, ultimately improving the integration of acoustic memory in multimodal systems. The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.
Primary: University
All Institutions: Company, Department of Computer Science, International Laboratories, University
The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.
The methodology presented in MERIT is innovative, focusing on disentangled representations of music based on melody, rhythm, and timbre. The use of a frozen MERT backbone combined with a novel triplet construction strategy allows for effective training on isolated musical dimensions without manual labeling. The approach of leveraging generative models for creating training data is particularly noteworthy, as it addresses the challenge of entangled real-world audio data. The Circle Loss optimization technique further enhances the training process by focusing on hard negatives, which is a sound choice for improving representation quality.
The experiments are well-structured, utilizing both internal and external evaluations to assess the model's performance. The use of zero-shot probes on independent datasets demonstrates the generalizability of the learned representations. The results indicate strong factor-wise disentanglement, with high accuracy in distinguishing between the different musical dimensions. The human evaluation of triplet quality adds a valuable subjective perspective to the findings, reinforcing the model's effectiveness. Overall, the experimental design is robust and provides compelling evidence of the framework's capabilities.
The paper provides sufficient details regarding the architecture, training procedures, and datasets used, which supports reproducibility. The authors have made the code and pre-trained models publicly available, further facilitating replication of their results. However, the reliance on specific datasets like MoisesDB and the generative model JASCO may limit reproducibility if these resources are not accessible to all researchers.
Some limitations are acknowledged, such as the focus on only three musical dimensions (melody, rhythm, and timbre), which may overlook other important aspects like harmony and dynamics. Additionally, the operationalization of timbre at the instrument-class level may not capture within-class variations adequately. The authors also mention potential biases from the training data that could affect the model's performance in real-world scenarios.
The implications of MERIT are significant for music information retrieval, recommendation systems, and music analysis tools. By enabling users to query music based on specific dimensions, it enhances user control and interpretability, which can lead to more personalized music experiences. The framework could also inspire further research into disentangled representations in other domains, potentially influencing broader applications in audio processing and machine learning. The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.
Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Thoughtful Lab
The main contribution of this paper is the introduction of PitchBench, a systematic evaluation suite for measuring pitch hearing in audio-language models, which significantly enhances the understanding of how these models perceive musical pitch. This work represents a critical step toward improving the reliability and effectiveness of ALMs in real-world audio applications.
The methodology presented in PitchBench is robust and systematic, focusing on a hierarchical evaluation of pitch perception in audio-language models (ALMs). The paper introduces a comprehensive framework that includes 28 experiments designed to isolate and assess various aspects of pitch hearing, such as absolute and relative pitch perception. The use of controlled synthetic stimuli allows for precise measurement of model performance across different acoustic conditions and response formats. This structured approach is a significant improvement over existing benchmarks, which often fail to directly evaluate the fundamental ability to perceive pitch.
The experimental evaluation is thorough, involving six frontier ALMs across a wide range of tasks that assess pitch perception under varying conditions. The results reveal significant performance variability among models, highlighting specific failure modes that are not captured by higher-level benchmarks. The detailed analysis of model performance, including the effects of acoustic variations and response modalities, provides valuable insights into the strengths and weaknesses of current ALMs in pitch perception.
The paper emphasizes reproducibility by providing a Python package that includes the evaluation data and generation tools. The authors detail the deterministic generation of stimuli, ensuring that other researchers can replicate the experiments. The inclusion of metadata and standardized output formats further supports reproducibility.
While PitchBench offers a significant advancement in evaluating pitch perception, it relies entirely on algorithmically synthesized stimuli, which may not fully capture the complexities of real-world audio. The current instrument selection is limited to General MIDI instruments, and the benchmark does not address non-Western musical traditions or more complex rhythmic reasoning tasks. Future work is needed to incorporate real recordings and broaden the diversity of the instrument pool.
The implications of PitchBench are substantial for the development of audio-language models, particularly in applications requiring reliable musical understanding, such as music tutoring, transcription, and recommendation systems. By providing a diagnostic tool for evaluating pitch perception, this work lays the groundwork for future advancements in multimodal AI systems that integrate audio understanding with other sensory inputs. The main contribution of this paper is the introduction of PitchBench, a systematic evaluation suite for measuring pitch hearing in audio-language models, which significantly enhances the understanding of how these models perceive musical pitch. This work represents a critical step toward improving the reliability and effectiveness of ALMs in real-world audio applications.
Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.
Primary: National Technical University of Athens
All Institutions: National Technical University of Athens
This paper presents a multimodal deep learning framework for dementia detection that effectively combines acoustic and linguistic features, showcasing innovative methods and robust experimental validation. The technical contributions are significant, addressing critical gaps in existing approaches and offering a promising direction for future research in automatic dementia assessment.
The proposed methodology employs a novel multimodal deep learning framework that integrates both acoustic and linguistic representations for dementia detection. The use of HuBERT for acoustic representation and BERT for textual representation, combined with attentive statistics pooling and an innovative Audio-Text Fusion mechanism, demonstrates a sophisticated approach to capturing the nuances of speech relevant to cognitive decline. The introduction of the Mutual Information Neural Estimation (MINE) objective to enhance cross-modal representation alignment is particularly noteworthy, as it addresses a significant gap in existing multimodal approaches.
The experiments are well-structured, utilizing two publicly available datasets (ADReSS Challenge and PROCESS-2) to validate the proposed framework. The results indicate competitive performance compared to state-of-the-art methods, with detailed metrics provided for accuracy, recall, and specificity. The ablation studies further strengthen the findings by demonstrating the effectiveness of various components of the proposed framework, such as pooling strategies and fusion methods.
The paper provides a clear description of the methodology and experimental setup, including details on the datasets and evaluation metrics. However, there is no mention of code availability or a repository for others to reproduce the results, which limits the reproducibility aspect.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of speech patterns in broader populations. Additionally, while the framework shows promising results, the performance on different demographic groups or in real-world settings remains untested. The absence of a demo or project URL also hinders practical application and further exploration by the community.
The framework has significant implications for early diagnosis and intervention in Alzheimer's disease, potentially improving patient care and outcomes. By leveraging speech analysis, the approach could facilitate non-invasive and efficient screening methods, which are crucial given the increasing prevalence of dementia globally. The integration of multimodal learning in this context also opens avenues for future research in cognitive health monitoring and related fields. This paper presents a multimodal deep learning framework for dementia detection that effectively combines acoustic and linguistic features, showcasing innovative methods and robust experimental validation. The technical contributions are significant, addressing critical gaps in existing approaches and offering a promising direction for future research in automatic dementia assessment.
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.
Primary: Nankai University
All Institutions: Nankai University
The paper presents CosyEdit2, a novel framework that enhances speech editing and zero-shot TTS through innovative reinforcement learning techniques and a well-structured methodology. The contributions are significant, addressing key limitations in the field and paving the way for future advancements in audio processing technologies.
The paper introduces CosyEdit2, a two-stage post-training framework that innovatively combines supervised fine-tuning with reinforcement learning (GRPO) to enhance speech editing capabilities while also improving zero-shot TTS performance. The methodology is well-structured, addressing the limitations of previous approaches by eliminating the need for imperfect paired data and optimizing through editing-specific rewards. The architecture leverages a unified text-speech language model and a conditional flow-matching model, showcasing a novel integration of LLMs with audio processing.
The experiments are extensive, utilizing multiple benchmarks for both speech editing and zero-shot TTS. The results demonstrate significant improvements over existing models, particularly in terms of acoustic consistency and editing fidelity. The use of both objective and subjective evaluation metrics strengthens the findings, providing a comprehensive assessment of the model's performance.
The paper provides detailed training and evaluation setups, including data sources, training parameters, and model architectures, which facilitate reproducibility. However, access to the datasets used for training and evaluation may be a limiting factor for complete reproducibility.
The authors acknowledge limitations in the design space of the reward formulation and the language coverage of the framework, which is currently constrained to a few languages. Additionally, broader acoustic editing capabilities remain unexplored, suggesting areas for future research.
The advancements in speech editing and zero-shot TTS have significant implications for applications in accessibility, multimedia production, and human-computer interaction. However, the potential for misuse in voice impersonation and misinformation propagation raises ethical concerns that need to be addressed through responsible deployment practices. The paper presents CosyEdit2, a novel framework that enhances speech editing and zero-shot TTS through innovative reinforcement learning techniques and a well-structured methodology. The contributions are significant, addressing key limitations in the field and paving the way for future advancements in audio processing technologies.
Passive multi-target tracking (MTT) aims to infer the kinematic states of multiple targets from noisy sensor data in which contributions from unknown target-emitted signals are superposed. Track-before-detect (TBD) methods improve robustness to noise by operating directly on raw sensor data without relying on a preceding detection stage. However, many existing TBD methods assume that each target's contribution to the sensor data is determined solely by its kinematic state. This assumption limits their applicability to passive MTT, where each target's contribution depends on both its kinematic state and the unknown emitted signal. We propose subspace TBD, a passive multi-target TBD method based on a likelihood derived from the complex Bingham distribution that does not require explicit modeling or estimation of the unknown emitted signals. In a particle filter (PF) framework, each multi-target hypothesis is mapped to a low-dimensional subspace spanned by the steering vectors corresponding to the hypothesized target states. The likelihood is then used to evaluate the alignment of the normalized multichannel sensor data with this subspace. Preliminary experiments with simulated acoustic measurements and a given target activity pattern show that the proposed method can track two moving targets emitting unknown signals at a signal-to-noise ratio (SNR) of -10dB, whereas a conventional TBD baseline yields substantially larger tracking errors.
Primary: National Institute of Advanced Industrial Science and Technology (AIST)
All Institutions: National Institute of Advanced Industrial Science and Technology (AIST)
The main contribution of this paper is the introduction of a novel subspace track-before-detect methodology for passive multi-target tracking that effectively addresses the challenges posed by unknown emitted signals. This work represents a significant advancement in the field of audio signal processing and multi-target tracking, offering a robust solution for low-SNR environments and paving the way for future research in more complex scenarios.
The proposed methodology, subspace track-before-detect (TBD), innovatively addresses the challenges of passive multi-target tracking (MTT) in environments where the emitted signals from targets are unknown. By leveraging the complex Bingham distribution to model the observation likelihood without requiring explicit estimation of the emitted signals, the authors effectively circumvent a significant limitation of conventional TBD methods. The use of a particle filter framework to implement this approach allows for robust tracking of multiple targets in low signal-to-noise ratio (SNR) conditions, which is a notable advancement in the field.
The experiments conducted are well-structured, utilizing simulated acoustic measurements to validate the proposed method. The comparison against a conventional deterministic-contribution baseline highlights the effectiveness of the subspace TBD approach, particularly in low SNR scenarios. The results demonstrate a significant improvement in tracking accuracy, with lower root mean square errors (RMSE) across various conditions, reinforcing the practical applicability of the method.
The paper provides sufficient details regarding the experimental setup, including the simulation parameters and the configuration of the particle filter. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should include sharing the implementation to facilitate validation and further exploration by the research community.
One limitation of the study is the reliance on simulated data, which may not fully capture the complexities of real-world scenarios. The paper also assumes a fixed activity pattern for the targets, which may not be realistic in dynamic environments. Additionally, the method's performance in more complex acoustic settings, such as reverberant environments or with more than two targets, remains to be evaluated.
The proposed subspace TBD method has significant potential applications in various fields, including surveillance, autonomous vehicles, and audio signal processing. By improving the robustness of multi-target tracking in noisy environments, this research could enhance systems that rely on accurate target localization and tracking, thereby contributing to advancements in safety and efficiency in real-time applications. The main contribution of this paper is the introduction of a novel subspace track-before-detect methodology for passive multi-target tracking that effectively addresses the challenges posed by unknown emitted signals. This work represents a significant advancement in the field of audio signal processing and multi-target tracking, offering a robust solution for low-SNR environments and paving the way for future research in more complex scenarios.
While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-quality, fine-grained annotations, which restricts a model's ability to precisely control NV generation. To address this limitation, we propose a novel approach for fine-grained non-verbal expression synthesis. We curate and reprocess female NV utterances from the EARS corpus, develop a new annotation scheme using tags to encode NV types, frequencies, and durations, and build an emotional TTS benchmark to demonstrate its effectiveness. Our evaluation shows that while our NV approach leads to minor trade-offs in perceived naturalness, it significantly improves expressiveness (eMOS 4.20) and emotional recognition accuracy (78.8%). Emotion-specific analysis further reveals that NV cues are highly effective for high-arousal emotions like happy (82.5%) and fear (82.7%), and almost perfectly convey sadness (98.3%).
Primary: Nara Institute of Science and Technology
All Institutions: Nara Institute of Science and Technology
The main contribution of this paper is the introduction of a fine-grained non-verbal expression dataset and a corresponding TTS system that significantly enhances emotional expressiveness in synthesized speech. This work represents a meaningful advancement in the field of emotional TTS synthesis, addressing critical gaps in existing methodologies and datasets.
The methodology presented in this paper is robust, focusing on the development of a fine-grained non-verbal expression dataset and a corresponding TTS system. The authors effectively address the limitations of existing datasets by introducing a novel annotation scheme that allows for precise control over non-verbal vocalizations. The use of Grad-TTS as the backbone model, enhanced with an emotion encoder, demonstrates a thoughtful integration of emotional embeddings into the synthesis process. The segmentation and transcription processes are well-detailed, showcasing a clear understanding of audio processing and the importance of high-quality data in training TTS systems.
The experimental evaluation is comprehensive, involving subjective assessments of naturalness and emotional expressiveness, as well as emotion recognition accuracy. The use of a diverse set of evaluation metrics, including eMOS and nMOS, provides a nuanced understanding of the model's performance. The results indicate a significant improvement in expressiveness with the fine-grained NV approach, although there is a minor trade-off in perceived naturalness. The emotion-specific analysis adds depth to the findings, illustrating the effectiveness of NV cues in conveying various emotional states.
The paper provides sufficient detail regarding the dataset construction, model architecture, and evaluation procedures, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to fully replicate the study. The authors could improve reproducibility by sharing their code and trained models.
One limitation is the focus on female NV utterances, which may not generalize well to male voices or other demographics. Additionally, the minor trade-off in naturalness when incorporating NVs could be a concern for practical applications. The subjective nature of the evaluations may also introduce variability, as individual preferences for emotional expression can differ widely.
This research has significant implications for the development of more emotionally intelligent conversational AI systems. By enhancing the expressiveness of TTS systems through the integration of non-verbal vocalizations, the work contributes to creating more engaging and human-like interactions in various applications, including virtual assistants, gaming, and mental health support systems. The main contribution of this paper is the introduction of a fine-grained non-verbal expression dataset and a corresponding TTS system that significantly enhances emotional expressiveness in synthesized speech. This work represents a meaningful advancement in the field of emotional TTS synthesis, addressing critical gaps in existing methodologies and datasets.
Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, we propose FMelCodec, an ultra-low-bitrate neural speech codec in the mel-spectrogram domain, cast as a three-stage coding-refinement-reconstruction (CRR) framework that can operate at as low as 250 bps. In the CRR framework, the front-end mel-spectrogram coding stage employs a highly aggressive 640x compression/decompression encoder-decoder structure with a single 1024-entry VQ codebook, coupled with an online clustering strategy that reassigns underused codewords to prevent codebook collapse and preserve codebook diversity. The subsequent conditional flow matching (CFM)-based mel-spectrogram refinement stage leverages a lightweight velocity-field estimator and CFM-based solver to refine the codec-degraded mel-spectrogram produced by the preceding decoder, and adopts a self-consistency training scheme that supports fewer iterative inference steps for the purpose of reducing computational overhead. Finally, the vocoding-driven waveform reconstruction stage employs a HiFi-GAN vocoder to faithfully reconstruct waveform from the refined mel-spectrogram. Experiments conducted on two datasets spanning two sampling rates show that, under ultra-low-bitrate constraints of 250 bps for 16 kHz and 750 bps for 48 kHz, both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity, while incurring lower computational and model complexity.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Institute of Informatics, Baidu Speech Department, National Engineering Research Center of Speech and Language Information Processing
The main contribution of this paper is the introduction of FMelCodec, a novel ultra-low-bitrate speech codec that effectively balances compression efficiency and speech quality through a sophisticated three-stage framework, demonstrating significant advancements in the field of neural speech coding. The methodology and results presented have the potential to influence future developments in audio processing and communication technologies.
The paper introduces FMelCodec, a novel three-stage coding-refinement-reconstruction (CRR) framework for ultra-low-bitrate speech coding that operates in the mel-spectrogram domain. The methodology is well-structured, leveraging a single-codebook vector quantization approach combined with conditional flow matching (CFM) for refinement and a HiFi-GAN vocoder for reconstruction. The online clustering strategy for codebook management is particularly innovative, addressing codebook collapse effectively. The self-consistency training scheme enhances computational efficiency, allowing fewer inference steps while maintaining quality.
The experiments are robust, utilizing two datasets (LibriTTS and VCTK) across different sampling rates. The evaluation metrics include both objective and subjective assessments, showcasing FMelCodec's superiority in reconstruction quality and speaker similarity at ultra-low bitrates. The results are statistically significant, demonstrating the codec's effectiveness compared to existing baselines, which is crucial for validating the proposed approach.
The paper provides detailed implementation configurations, including model architectures, training procedures, and hyperparameters, which enhances reproducibility. The availability of code and trained models on GitHub further supports this aspect, allowing other researchers to replicate the results.
While the proposed method shows promising results, the reliance on a single codebook may limit flexibility in representing diverse speech characteristics. Additionally, the computational efficiency, although improved, may still be a concern in extremely resource-constrained environments. The paper does not extensively discuss the scalability of the approach to other languages or dialects, which could be a limitation in broader applications.
The FMelCodec has significant implications for bandwidth-constrained communication systems, such as satellite communications and mobile devices, where low-bitrate speech coding is essential. Its potential applications extend to telecommunication, voice-over-IP services, and assistive technologies for individuals with speech impairments. The advancements in neural speech coding could also influence future research in audio processing and machine learning. The main contribution of this paper is the introduction of FMelCodec, a novel ultra-low-bitrate speech codec that effectively balances compression efficiency and speech quality through a sophisticated three-stage framework, demonstrating significant advancements in the field of neural speech coding. The methodology and results presented have the potential to influence future developments in audio processing and communication technologies.
Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, diffusion models, despite training faster than GANs, have slow CPU inference. In this paper, we introduce WaveNeXt 2, a unified ConvNeXt-based framework compatible with both GAN and diffusion vocoders. Its core innovation is residual denoising and sub-modeling, where each sub-model progressively refines the waveform. Experimental results in the multi-speaker dataset demonstrate the effectiveness of our approach: (1) GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit, and (2) Diff-WaveNeXt 2 also delivers much faster inference and competitive synthesis quality compared with FastDiff with 4 steps. The Diff-WaveNeXt 2 is very training-efficient, training in only 32 hours, making it ideal for resource-constrained applications.
Primary: Nara Institute of Science and Technology
All Institutions: Nara Institute of Science and Technology, National Institute of Information and Communications Technology
WaveNeXt 2 represents a significant step forward in the development of neural vocoders, providing a unified framework that enhances performance and efficiency in both GAN and diffusion contexts. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and audio processing.
The proposed WaveNeXt 2 framework introduces a novel architecture that integrates ConvNeXt-based residual denoising and sub-modeling, allowing it to function effectively in both GAN and diffusion vocoder contexts. This dual compatibility is a significant advancement, as it addresses the limitations of existing models that are typically confined to one framework. The methodology is well-structured, with clear delineation between the GAN and diffusion approaches, and the use of sub-models for noise-level conditioning is a clever adaptation that enhances performance and efficiency. The authors provide a comprehensive description of the architecture, training strategies, and inference processes, which demonstrates a solid understanding of the challenges in neural vocoding.
The experiments are robust, utilizing a substantial dataset (LibriTTS-R) and employing both subjective (MOS) and objective (UTMOS, NISQA, MCD, log F0 RMSE) evaluation metrics. The results indicate that both GAN-WaveNeXt 2 and Diff-WaveNeXt 2 outperform existing models in terms of inference speed and synthesis quality. The comparative analysis with baseline models is thorough, providing clear evidence of the proposed models' advantages. However, the paper could benefit from more extensive ablation studies to further validate the contributions of individual components.
The authors provide sufficient implementation details, including the use of PyTorch and specific training configurations, which aids reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. Including a link to the implementation or a GitHub repository would enhance reproducibility significantly.
While the paper presents strong results, it acknowledges that the increased model size due to sub-modeling could be a drawback for deployment in resource-constrained environments. Additionally, the reliance on specific architectures may limit the generalizability of the findings to other vocoder designs. The paper could also explore the trade-offs between model complexity and performance in more depth.
The advancements presented in WaveNeXt 2 have significant implications for real-time speech synthesis applications, particularly in multi-speaker scenarios and resource-constrained environments. The ability to unify GAN and diffusion frameworks could lead to more versatile and efficient vocoders, potentially enhancing the quality of synthesized speech in various applications, including virtual assistants, audiobooks, and gaming. The work could inspire further research into hybrid models that leverage the strengths of both GANs and diffusion processes. WaveNeXt 2 represents a significant step forward in the development of neural vocoders, providing a unified framework that enhances performance and efficiency in both GAN and diffusion contexts. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and audio processing.
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at https://cumulativeortis.github.io/ .
Primary: Sungkyunkwan University
All Institutions: Sungkyunkwan University, Korea University
The paper presents CORTIS, a novel framework for continual speaker identity unlearning in zero-shot text-to-speech systems, effectively addressing privacy concerns while maintaining model performance. The integration of advanced techniques in machine unlearning and continual learning marks a significant contribution to the field, with strong experimental validation and practical implications for privacy in AI.
The proposed CORTIS framework innovatively addresses the problem of continual speaker identity unlearning in zero-shot text-to-speech systems. By combining Fisher-information-based parameter masking with orthogonal projection, it effectively prevents catastrophic re-learning of previously unlearned speakers while maintaining the quality of the remaining speakers. This dual approach is a significant advancement over previous methods that assumed simultaneous unlearning requests and failed to account for sequential requests, which is a more realistic deployment scenario. The methodology is well-justified and grounded in the principles of continual learning and machine unlearning, showcasing a thoughtful integration of concepts from both fields.
The experiments are robust, utilizing a well-defined evaluation scenario with clear metrics for assessing both retention of previously learned speakers and the quality of the generated speech. The results demonstrate that CORTIS outperforms existing methods in maintaining speaker identity suppression across multiple requests, with quantitative metrics supporting the claims made. The use of a controlled backbone (VoiceBox) ensures fair comparisons, and the detailed ablation studies provide insights into the contributions of each component of the proposed method.
The paper provides comprehensive implementation details, including the architecture of the backbone model and the specific configurations used for training and evaluation. This level of detail enhances reproducibility, allowing other researchers to replicate the experiments effectively. However, the reliance on specific datasets and models may limit broader applicability without further validation across different architectures.
The paper acknowledges limitations such as the lack of adversarial robustness and the focus on a single backbone model (VoiceBox). Additionally, while the proposed method is effective, the computational overhead introduced by the CORTIS framework may pose challenges for real-time applications. Future work could explore the scalability of the method and its performance across various architectures and datasets.
The implications of this work are significant, particularly in the context of privacy and data protection regulations like GDPR and CCPA. By providing a mechanism for continual speaker identity unlearning, the research contributes to the responsible deployment of zero-shot text-to-speech systems, which can have far-reaching effects on user privacy and consent in AI applications. The framework could be adapted for other domains requiring similar unlearning capabilities, thus broadening its impact. The paper presents CORTIS, a novel framework for continual speaker identity unlearning in zero-shot text-to-speech systems, effectively addressing privacy concerns while maintaining model performance. The integration of advanced techniques in machine unlearning and continual learning marks a significant contribution to the field, with strong experimental validation and practical implications for privacy in AI.
Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations on the complex unit sphere, without explicitly extracting phase and level difference features based on the plane-wave or spherical-wave assumptions. However, prior studies have mostly compared a small number of separately defined directional statistical mixture models, whereas a broader distribution family would enable a more systematic study of how density profiles affect separation performance. We propose the complex spherical Student's t mixture model (cSTMM), a directional mixture model that connects the complex angular central Gaussian mixture model (cACGMM), complex Bingham mixture model (cBMM), and complex Watson mixture model (cWMM) through the degrees-of-freedom parameter $ν$. We also derive a generalized minorization-maximization (MM) based procedure for parameter estimation. A no-restart evaluation on noise-free LibriSpeech mixtures reverberated with measured room impulse responses shows that a single development-selected value $ν^\ast=1$ achieved higher test-set mean signal-to-distortion ratio improvements (SDRi) than the cACGMM-equivalent setting $ν=M$ in all acoustic conditions, with an average condition-wise gain of 0.25dB. The experiments also numerically verify that the proposed formulation numerically recovers the cACGMM, cBMM, and cWMM cases.
Primary: Artificial Intelligence Research Center, AIST, Japan
All Institutions: Artificial Intelligence Research Center, AIST, Japan
The main contribution of this paper is the introduction of the cSTMM, which unifies existing directional statistical models for blind speech separation and demonstrates its effectiveness through rigorous experimental evaluation. Overall, the paper makes a meaningful contribution to the field of audio signal processing, particularly in enhancing the performance of mask-based speech separation techniques.
The paper introduces the complex spherical Student's t mixture model (cSTMM), which unifies several existing directional statistical mixture models (cACGMM, cBMM, cWMM) under a single framework. The methodology is robust, employing a generalized minorization-maximization (MM) procedure for parameter estimation, which is a significant contribution to the field. The approach allows for systematic exploration of how different density profiles impact speech separation performance, addressing a gap in prior research that focused on isolated models. The derivation of the model and the updates for parameter estimation are well-articulated, showing a clear understanding of the underlying statistical principles.
The experiments are well-structured, utilizing the LibriSpeech dataset and a variety of acoustic conditions to evaluate the performance of the proposed model. The results demonstrate a statistically significant improvement in mean signal-to-distortion ratio (SDRi) across different conditions, with a clear methodology for selecting hyperparameters. The inclusion of model recovery tests further strengthens the experimental validation, confirming that the cSTMM can effectively recover the properties of the models it encompasses.
The paper provides sufficient detail regarding the experimental setup, including the choice of datasets, evaluation metrics, and parameter settings. However, the absence of a publicly available implementation or code repository limits reproducibility. Future work should consider making the model and experiments accessible to facilitate validation by other researchers.
While the paper presents a novel model and shows promising results, the improvements in SDRi are modest (averaging 0.25 dB), which may not be substantial enough to warrant a shift from existing methods in practical applications. Additionally, the model's performance in noisy or real-world environments remains untested, which could be a significant limitation for its applicability.
The cSTMM has the potential to advance the field of blind speech separation, particularly in scenarios where supervised learning is impractical. By providing a unified framework for directional statistics, it could lead to more robust speech separation systems, benefiting applications in telecommunications, hearing aids, and automatic speech recognition. The systematic exploration of density profiles may also inspire further research into adaptive signal processing techniques. The main contribution of this paper is the introduction of the cSTMM, which unifies existing directional statistical models for blind speech separation and demonstrates its effectiveness through rigorous experimental evaluation. Overall, the paper makes a meaningful contribution to the field of audio signal processing, particularly in enhancing the performance of mask-based speech separation techniques.
In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in quality. In the case of classical music, performances often differ not only in expressive aspects such as tempo, but also in their structural interpretation of the score (including repeat patterns and edition-specific variants). To meaningfully use large-scale transcribed datasets for performance research, transcriptions of the same piece must be grouped according to their underlying structural realisation to support valid comparison. We address this by applying sequence-to-sequence alignment followed by hierarchical clustering: we create pairwise alignments for all pairs of transcriptions of a given piece, and use the alignment cost and (dis)similarity of performed sequence lengths to resolve structural mismatches as features for grouping. We propose this approach as a first step towards automatically evaluating large-scale transcribed datasets that lack ground-truth score and/or audio, shifting the evaluation criterion from truth-based accuracy to musical coherence and plausibility. We demonstrate our score-agnostic approach on around 1,500 transcriptions of 88 compositions from a recently published large-scale transcribed piano performance dataset.
Primary: Johannes Kepler University Linz
All Institutions: Johannes Kepler University Linz, LIT AI Lab, Linz Institute of Technology
The paper presents a novel approach to automatically align and cluster transcriptions of musical performances based on structural interpretations. It significantly contributes to the field by providing a scalable, reference-free method for evaluating large-scale transcribed datasets, which is essential as the volume of available music data continues to grow.
The proposed methodology effectively combines sequence-to-sequence alignment using Dynamic Time Warping (DTW) with hierarchical clustering to address the challenge of grouping transcriptions based on structural interpretations. The use of a custom distance metric that balances harmonic similarity and timing differences is innovative and tailored to the nuances of musical performance. The two-step approach, which includes both alignment and clustering, is well-structured and demonstrates a clear understanding of the complexities involved in music performance analysis. However, the paper could benefit from a more detailed discussion on the choice of parameters and their impact on the results.
The experiments conducted on the ATEPP dataset are comprehensive, covering a significant number of transcriptions and compositions. The evaluation metrics used, such as homogeneity, completeness, and V-Measure, are appropriate for assessing clustering performance. The results indicate that the proposed method is robust against structural differences and transcription artifacts, which is a critical aspect of the research. However, the paper could enhance its impact by providing more comparative analyses with existing methods beyond the baseline score-dependent repeat estimator.
The paper provides a link to the implementation in the Python library mpteval, which is a positive aspect for reproducibility. However, the details regarding the parameter settings and the specific configurations used in the experiments could be more explicitly stated to facilitate replication. Additionally, providing a sample dataset or a more detailed description of the data preprocessing steps would further enhance reproducibility.
One limitation is that the method relies heavily on the quality of the transcriptions, which can vary significantly due to the nature of automatic music transcription. The paper acknowledges this but does not explore potential solutions or mitigations for low-quality transcriptions. Furthermore, the focus on classical music may limit the generalizability of the approach to other genres or forms of music, which could be a point of consideration for future work.
The approach has significant implications for the field of music performance analysis, particularly in automating the evaluation of large-scale datasets that lack ground-truth scores. This can lead to more efficient curation and maintenance of music collections, enabling researchers to focus on higher-level analyses rather than manual quality control. The method could also inspire further research into score-agnostic evaluation techniques across various musical genres and applications. The paper presents a novel approach to automatically align and cluster transcriptions of musical performances based on structural interpretations. It significantly contributes to the field by providing a scalable, reference-free method for evaluating large-scale transcribed datasets, which is essential as the volume of available music data continues to grow.
Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.
Primary: University of Melbourne
All Institutions: University of Melbourne
This paper makes a meaningful contribution by proposing a representation-centric approach to continual learning in speech and audio, addressing the unique challenges posed by the dynamic nature of acoustic environments. The framework established in this work has the potential to guide future research and development in the field, although empirical validation and implementation details are needed to fully realize its impact.
The paper presents a novel representation-centric taxonomy for continual learning (CL) in speech and audio, addressing the unique challenges posed by the non-stationary nature of acoustic environments. The authors effectively categorize CL scenarios based on representational evolution, which is a significant advancement over traditional task-based taxonomies. The methodology is well-structured, clearly articulating the need for preserving representational geometry in modern speech systems, and it proposes a comprehensive framework for understanding the interaction between representation dynamics and adaptation mechanisms.
While the paper does not present empirical experiments or quantitative results, it offers a thorough analysis of existing CL methods and their limitations in the context of speech and audio. The authors identify gaps in current methodologies and suggest future research directions, which is valuable for guiding subsequent empirical studies. The lack of experimental validation is a notable gap, as it limits the ability to assess the practical effectiveness of the proposed taxonomy.
The paper does not provide specific implementation details or datasets, which could hinder reproducibility. However, it does reference existing methods and frameworks, suggesting that future work could build upon these established techniques. The inclusion of a GitHub repository for related resources is a positive step towards facilitating reproducibility.
A key limitation of the paper is the absence of experimental validation, which makes it difficult to assess the practical applicability of the proposed taxonomy. Additionally, while the authors identify several open problems, they do not provide concrete solutions or methodologies to address these challenges, leaving a gap for future exploration.
The implications of this work are significant for the fields of speech processing and continual learning. By reframing CL in the context of speech and audio, the authors highlight the need for new strategies that accommodate the complexities of acoustic representations. This work could influence the development of more robust and adaptable speech systems, with applications in areas such as automatic speech recognition, speaker verification, and emotion recognition. This paper makes a meaningful contribution by proposing a representation-centric approach to continual learning in speech and audio, addressing the unique challenges posed by the dynamic nature of acoustic environments. The framework established in this work has the potential to guide future research and development in the field, although empirical validation and implementation details are needed to fully realize its impact.