We consider the conversion of musical recordings into human-readable sheet music annotated with timestamps. Such output lets a listener clearly visualize rubato (temporally expressive playing), a learner diagnose ensemble precision and timing choices against the written music, and a musicology scholar compare performance styles across recordings of the same work. We introduce (1) a prompt-conditioned encoder-decoder model, named Rubato, trained to output (2) a new textual representation for polyphonic music, named InterMo, which we designed for compatibility with sequence-to-sequence training. Our experiments demonstrate that Rubato produces timestamped piano sheet music from audio with higher notational accuracy than the best existing approaches, which are based on cascades. We find that even if the cascade is given ground-truth MIDI instead of audio, Rubato performs better, suggesting that the ceiling of existing approaches is primarily representational, not acoustic. Further, because Rubato is trained on several related tasks (with prompts), it competes with or outperforms the best single-task systems on related but simpler tasks like MIDI note grounding and beat/downbeat detection. A demo is available at https://nctamer.github.io/rubato-transcription .
Primary: Paul G. Allen School of Computer Science & Engineering, University of Washington
All Institutions: Paul G. Allen School of Computer Science & Engineering, University of Washington, Allen Institute for AI
The paper presents Rubato, a comprehensive system for transcribing piano music into time-aligned scores, significantly advancing the field of automatic music transcription by integrating audio input with structured musical output. The innovative approach and strong experimental results position this work as a meaningful contribution to both machine learning and music technology.
The paper introduces a novel prompt-conditioned encoder-decoder model, Rubato, which integrates audio-to-score transcription with temporal alignment in a single autoregressive pass. The use of a new textual representation, InterMo, for polyphonic music is innovative, as it allows for a more coherent and structured output that maintains both musical and temporal integrity. The methodology effectively addresses the limitations of existing cascade approaches by eliminating the need for intermediate representations, thus streamlining the transcription process. The multitask training approach leveraging various dialects enhances the model's robustness and versatility.
The experiments are thorough, comparing Rubato against multiple state-of-the-art systems across different tasks, including MIDI note grounding and beat detection. The results demonstrate that Rubato consistently outperforms existing methods in terms of notational accuracy and temporal alignment, even under conditions where oracle inputs are provided to baselines. The use of multiple datasets for evaluation strengthens the validity of the findings.
The paper provides sufficient details about the training data, model architecture, and evaluation metrics, which supports reproducibility. However, the lack of a public code repository limits the ease with which other researchers can replicate the results. The authors do release score excerpts and synthesized utterances for reproducibility, which is a positive aspect.
One limitation is the potential overfitting to the training datasets, particularly given the complexity of the model and the variety of tasks it is trained on. Additionally, the reliance on synthesized audio for some training data may not fully capture the nuances of real-world performances. The paper does not address how the model might perform with non-piano instruments or in more complex musical contexts.
The implications of this research are significant for music education, performance analysis, and musicology, as it allows for a more nuanced understanding of expressive timing in musical performances. The ability to generate accurate, timestamped sheet music can aid learners in diagnosing timing choices and enhance the comparative analysis of performance styles. This work could also inspire further advancements in automatic music transcription and multimodal music analysis systems. The paper presents Rubato, a comprehensive system for transcribing piano music into time-aligned scores, significantly advancing the field of automatic music transcription by integrating audio input with structured musical output. The innovative approach and strong experimental results position this work as a meaningful contribution to both machine learning and music technology.
Evaluating speech generation still relies heavily on human judgments, such as Mean Opinion Score (MOS), which are expensive, subjective, and difficult to reproduce at scale. While a few recent studies have begun to explore AudioLLM-based judge models, existing efforts typically target only a narrow set of scenarios (e.g., utterance-level quality or single-turn dialogue) and provide limited coverage of diverse speech generation tasks and evaluation dimensions. In this work, we propose UniSRM, a unified speech reward model that can support multi-dimensional, interpretable reward signals with reliable reasoning. To support training and evaluation, we introduce UniSRM-Data and UniSRM-Bench, covering speech evaluation tasks from utterance-level quality to context-level coherence. Based on this dataset, we present the unified speech reward model, UniSRM, with a two-stage pipeline that enables reasoning-based fine-grained assessment. Furthermore, we introduce Reasoning-Consistent Rewards to improve the reliability of the reasoning process. Experiments show that UniSRM delivers more reliable and human-aligned judgments across a broad range of speech evaluation tasks, offering a practical foundation for scalable and unified evaluation of speech quality.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong, Tsinghua University, Independent Researcher
The main contribution of this paper is the introduction of UniSRM, a unified speech reward model that provides multi-dimensional, interpretable assessments for speech evaluation, supported by a comprehensive dataset and benchmark. This work significantly advances the state of the art in speech quality evaluation, addressing critical limitations of existing models and offering a practical foundation for future research and applications in the field.
The methodology presented in this paper is robust and innovative, introducing UniSRM, a unified speech reward model that leverages a two-stage training approach combining supervised fine-tuning (SFT) and reinforcement learning (RL) with reasoning-consistent rewards (RCR-GRPO). The model's architecture allows for multi-dimensional evaluations of speech quality, addressing key limitations in existing models. The introduction of UniSRM-Data and UniSRM-Bench provides a comprehensive dataset and benchmark for evaluating speech generation, which is a significant contribution to the field. The explicit decomposition of speech quality assessment into multiple dimensions enhances interpretability and aligns model outputs with human judgments.
The experimental evaluation is thorough, demonstrating the effectiveness of UniSRM across various tasks, including utterance-level preference judgments, fine-grained quality assessments, scenario-aware evaluations, and multi-turn dialogue assessments. The results show that UniSRM outperforms existing models in terms of accuracy and human alignment, indicating its practical applicability. The use of diverse datasets and rigorous evaluation metrics (e.g., accuracy and Pearson correlation coefficient) strengthens the findings. The ablation studies further validate the necessity of RCR-GRPO in improving model performance.
The paper provides detailed implementation details, including model configurations, training hyperparameters, and dataset statistics, which enhance reproducibility. The authors also commit to releasing the datasets and code, which is crucial for enabling other researchers to replicate their work. However, the computational cost associated with training and inference may pose challenges for broader accessibility.
The paper acknowledges limitations, such as the restricted coverage of challenging speech scenarios (e.g., heavy accents and overlapping speech) and the computational demands of the proposed model. These factors may limit the scalability and practical deployment of UniSRM in real-world applications. Additionally, while the benchmarks are comprehensive, they may not encompass all potential evaluation scenarios in speech generation.
The development of a unified and interpretable speech reward model has significant implications for the field of speech generation and evaluation. By providing a scalable and reliable framework for assessing speech quality, UniSRM can facilitate advancements in various applications, including virtual assistants, automated customer service, and content creation. The model's ability to produce human-aligned judgments can enhance user experience and trust in AI-generated speech systems. The main contribution of this paper is the introduction of UniSRM, a unified speech reward model that provides multi-dimensional, interpretable assessments for speech evaluation, supported by a comprehensive dataset and benchmark. This work significantly advances the state of the art in speech quality evaluation, addressing critical limitations of existing models and offering a practical foundation for future research and applications in the field.
Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation. Code and demo are available at https://xrfang-0102.github.io/EventSpeechWeb/.
Primary: Beijing Technology and Business University
All Institutions: University of Sydney, Beijing Technology and Business University, Xidian University, Tongji University
The paper presents EventSpeech, a pioneering framework that utilizes neuromorphic events for expressive speech generation, significantly advancing the state of the art in multimodal speech synthesis. The innovative approach and robust experimental validation position this work as a substantial contribution to the field, addressing key limitations of existing methods and opening new avenues for research and application.
The proposed EventSpeech framework introduces a novel architecture that leverages neuromorphic events for speech generation, addressing the limitations of traditional RGB-based methods. The integration of a dedicated Event Encoder and a multi-scale Audio Encoder, along with a bidirectional alignment mechanism, demonstrates a sophisticated approach to synchronizing visual and auditory modalities. The methodology is well-structured, with a clear focus on addressing the Temporal Granularity Mismatch, and the use of specialized components like the Hierarchical Wavelet Contextualizer (HWC) enhances the model's ability to capture fine-grained emotional nuances in speech.
The paper presents extensive evaluations on the EVT-SPK benchmark, which is a significant contribution to the field as it includes both synthetic and real-world datasets. The results indicate that EventSpeech outperforms state-of-the-art methods across various metrics, showcasing its robustness in handling rapid articulation and subtle facial dynamics. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings.
The paper provides implementation details, including the training setup and optimization strategies, which are crucial for reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results.
The EVT-SPK benchmark's limited scale and the reliance on simulated events may restrict the model's generalization capabilities. Additionally, the paper acknowledges the challenges associated with capturing complex physical sensor noise in real-world scenarios, which could affect performance.
The introduction of neuromorphic events for speech generation has the potential to revolutionize multimodal speech synthesis, enabling more expressive and natural-sounding speech. This could have applications in various domains, including virtual assistants, entertainment, and accessibility technologies. The paper presents EventSpeech, a pioneering framework that utilizes neuromorphic events for expressive speech generation, significantly advancing the state of the art in multimodal speech synthesis. The innovative approach and robust experimental validation position this work as a substantial contribution to the field, addressing key limitations of existing methods and opening new avenues for research and application.
Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.
Primary: National Taiwan University
All Institutions: National Taiwan University
This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.
The paper presents a comprehensive taxonomy of jailbreak attacks and defenses in Large Audio Language Models (LALMs), categorizing them into semantic, acoustic, signal, and embedding-layer attacks, as well as guard-based, training-free, and training-based defenses. The methodology is robust, combining a structured survey with empirical evaluations across ten open-source LALMs, which allows for a fair comparison of various attack and defense strategies. The authors also introduce a cost-aware evaluation framework that considers not just attack success rates but also benign refusal and latency, which is a significant improvement over previous works that focused solely on success rates.
The experiments are well-structured, utilizing a controlled dataset from JailbreakBench with 100 harmful and 100 benign requests, allowing for a clear assessment of the effectiveness of various attacks and defenses. The results indicate that different attack strategies yield varying success rates, with the Acoustic Best-of-N attack demonstrating the highest vulnerability. The empirical evaluation of defenses reveals a trade-off between robustness and usability, highlighting the complexity of ensuring safety in LALMs.
The paper provides detailed descriptions of the experimental setup, including the datasets used, the models evaluated, and the specific attack and defense methods employed. However, the reliance on specific hardware and configurations may limit the reproducibility of results in different environments. The authors do not provide code or data access, which could further hinder reproducibility.
The authors acknowledge several limitations, including the restricted model coverage to ten open-source LALMs and the controlled nature of the dataset, which may not fully represent real-world scenarios. Additionally, the evaluation metrics used may not capture all aspects of deployment, such as user satisfaction with benign responses. The paper also does not explore all possible attack and defense categories outlined in the taxonomy.
The findings of this paper have significant implications for the development of safe and robust LALMs, particularly in applications involving voice assistants and interactive systems. The emphasis on cost-aware evaluation and the identification of vulnerabilities across different modalities can guide future research in creating more resilient audio systems. The work also raises awareness about the potential for misuse of LALMs in bypassing safety mechanisms, highlighting the need for ongoing research into equitable and effective safety measures. This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.
Primary: University of Southern California
All Institutions: University of Southern California, The Ohio State University, University of California, Los Angeles, Harvard University, Boston University, University of Miami
The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.
The methodology presented in the paper is robust, as it introduces the ChildVox benchmark, which encompasses a wide range of child-centered audio and speech tasks. The integration of over 20 sub-tasks across 17 datasets is a significant advancement, allowing for a comprehensive evaluation of various audio and speech foundation models. The approach to define "voice" in children broadly, including physiological sounds and non-linguistic vocalizations, is innovative and necessary for understanding child communication. The evaluation of multiple model architectures, including self-supervised and ASR-oriented models, provides a well-rounded perspective on the capabilities of current technologies in this domain.
The experiments are thorough, with a clear structure that includes a variety of tasks and datasets. The benchmark results demonstrate that ChildVox provides high-performance models for recognizing a wide range of acoustic signals from children. The paper effectively compares the performance of different models on specific tasks, highlighting the strengths and weaknesses of each. The use of Macro-F1 scores for classification tasks and WER for ASR tasks is appropriate, ensuring that the evaluation metrics are relevant to the goals of the benchmark.
The paper provides detailed information about the datasets, experimental setup, and model training parameters, which enhances reproducibility. However, the lack of publicly available code or models limits the ability for other researchers to replicate the results fully. The authors mention plans to release models under a Responsible AI License, which is a positive step towards improving reproducibility in the future.
The paper acknowledges several limitations, including the focus on English-language recordings, which may restrict generalizability to other languages and dialects. Additionally, the subjective nature of some tasks, such as affective vocalization classification, may introduce variability in annotation reliability. The authors also note that the benchmark does not cover all recent advancements in audio foundation models, which could limit its comprehensiveness.
The ChildVox benchmark has significant implications for research in child development, speech therapy, and early childhood education. By providing a structured framework for evaluating child-centered audio processing, it can facilitate advancements in understanding children's communication and support the development of tools for monitoring and enhancing language skills. The potential applications in clinical settings for tracking speech production and language development are particularly noteworthy. The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.
Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications, University of Surrey
The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.
The paper introduces a novel framework, COMET, utilizing Partial Least Squares Singular Value Decomposition (PLS-SVD) to analyze and mitigate the modality gap between audio and text embeddings in CLAP models. The methodology is well-structured, offering a fresh perspective on the decomposition of multimodal embeddings into interpretable concepts. The spectral truncation method proposed is innovative, allowing for effective dimensionality reduction while maintaining performance, which is a significant contribution to the field of multimodal contrastive learning.
The experiments are comprehensive, utilizing standard datasets like Clotho and AudioCaps for evaluation. The results demonstrate that the proposed PLSHead method achieves comparable or improved performance over the original embeddings, validating the effectiveness of the approach. The paper provides detailed metrics for retrieval tasks, showcasing the robustness of the method across different scenarios, including in-domain and cross-domain evaluations.
The paper lacks explicit implementation details or code availability, which could hinder reproducibility. While the methodology is clearly described, the absence of a publicly available codebase or demo limits the ability for other researchers to replicate the findings.
One limitation is the reliance on existing CLAP models, which may introduce biases based on their training data. Additionally, while the proposed methods show promise, the paper does not explore the potential impacts of varying the number of retained dimensions in the spectral truncation, which could affect generalization in different contexts.
The findings have significant implications for audio understanding and generation tasks, particularly in zero-shot scenarios. By effectively bridging the modality gap, the proposed methods could enhance the performance of multimodal applications, making them more accessible and efficient. This work could pave the way for future research in multimodal learning and its applications in real-world scenarios. The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.
Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Xiaohongshu Inc
The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.
The proposed HoliTok model introduces a novel continuous tokenization approach that effectively balances the requirements of learnability and decodability for unified speech generation and understanding. The progressive training strategy enhances the model's ability to preserve signal fidelity while incorporating semantic information, which is a significant advancement over existing tokenization methods. The architecture's integration of a variational autoencoder with a temporal bottleneck and a downstream-aware supervision network is a thoughtful design choice that addresses the limitations of traditional tokenizers.
The experiments conducted demonstrate the model's competitive performance in reconstruction fidelity, speech synthesis, and unified generation-understanding tasks. The evaluation metrics used, including PESQ, STOI, and WER, provide a robust framework for assessing the quality of the generated outputs. The results indicate that HoliTok not only outperforms existing methods but also maintains a compact latent representation, which is crucial for practical applications in speech technology.
The paper provides a clear description of the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of detailed hyperparameter settings and specific training configurations in the main text may pose challenges for full replication. The availability of the code on GitHub is a positive aspect that aids in reproducibility efforts.
The study primarily focuses on speech generation and understanding, leaving out broader audio applications such as environmental sounds and music. The evaluation is limited to a specific architecture (AR+DiT), which may not capture the full potential of the proposed tokenizer across various modeling paradigms. Future work should explore these areas to validate the generalizability of the approach.
The advancements presented in this paper have the potential to significantly enhance speech synthesis and recognition technologies, making them more efficient and effective. The model's ability to serve as a unified interface for both tasks could lead to improvements in applications such as virtual assistants, automated transcription services, and interactive voice response systems. The implications for accessibility and user interaction with technology are substantial, as improved speech models can facilitate better communication for individuals with speech impairments. The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.
Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.
Primary: University of Edinburgh
All Institutions: University of Edinburgh, Google DeepMind, Meta Superintelligence Labs
The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.
The paper presents a novel approach to speech language modeling by introducing MELD, which integrates discrete latent variables into the autoregressive modeling of mel-spectrograms. This joint optimization of the encoder and autoregressive model addresses limitations of previous two-stage methods, particularly in preserving task-relevant information. The methodology is well-structured, leveraging variational inference to optimize a lower bound on the log likelihood, and effectively incorporates both TTS and STT tasks within a single framework. The use of discrete latent variables to suppress silence generation is a significant innovation, enhancing the model's performance over existing methods.
The experiments are comprehensive, utilizing the 960-hour subset of the LibriSpeech dataset for training and evaluation. The authors compare MELD against several baselines, including codec-based models and other mel-spectrogram-based approaches, demonstrating clear improvements in both TTS and STT tasks. The evaluation metrics include both subjective (MOS, speaker similarity) and objective (WER) assessments, providing a well-rounded view of the model's performance. The results indicate that MELD outperforms its competitors, particularly in reducing silence and improving word error rates.
The paper provides detailed implementation specifics, including model architecture, training configurations, and evaluation protocols. However, the authors acknowledge challenges in reproducing results from related work (e.g., MELLE), which may affect the perceived reliability of their comparisons. The use of specific datasets and training strategies is well-documented, but the lack of a public code repository or demo limits reproducibility.
The authors note several limitations, including the difficulty in making fair comparisons between codec-based and mel-spectrogram-based methods due to differences in representation mapping. Additionally, while the joint optimization framework is promising, the paper does not explore its application to other speech tasks beyond TTS and STT. The potential for overfitting or collapsing solutions in the discrete latent space is also mentioned, although not observed in their experiments.
The proposed model has significant implications for real-world applications in speech synthesis and recognition, particularly in enhancing the quality and efficiency of TTS systems. The ability to jointly model TTS and STT tasks could streamline workflows in various applications, such as virtual assistants and automated transcription services. However, ethical considerations regarding the misuse of speech generation technologies, such as voice cloning, must be addressed to ensure responsible use. The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.
AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.
Primary: University of Illinois Urbana-Champaign
All Institutions: University of Illinois Urbana-Champaign, Wonkwang University
The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.
The proposed methodology introduces a novel federated domain generalization framework specifically tailored for respiratory sound classification, addressing the critical issue of inter-stethoscope variability. The integration of a causality-inspired device style intervention network, counterfactual text augmentation, and gradient alignment represents a significant advancement in the field, as it not only tackles the entanglement of device style and disease content but also enhances the robustness of the model across heterogeneous devices. The approach is well-structured, leveraging causal inference principles to inform data augmentation strategies, which is a fresh perspective in the context of audio classification.
The experimental setup is robust, utilizing two well-defined datasets (ICBHI and SPRSound) and employing leave-one-device-out validation to rigorously assess the model's performance. The results demonstrate that the proposed method consistently outperforms conventional data augmentation and federated learning baselines, indicating its effectiveness in improving cross-device generalization. The ablation studies further substantiate the contributions of each component of the framework, providing clear evidence for the importance of the causality-inspired interventions.
While the paper mentions that code will be released upon publication, the absence of a current project URL limits immediate reproducibility. The methodology is described in sufficient detail to allow for replication, but access to the code and datasets would be essential for full verification of results.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of respiratory sound recordings across different clinical settings. Additionally, the paper acknowledges the need for future work to address privacy concerns and computational efficiency in federated learning settings, which are critical for real-world applications.
The framework has significant potential implications for telemedicine and automated pulmonary disease detection, particularly in enhancing the reliability of AI-driven diagnostics across various healthcare environments. By addressing device-induced biases, the work contributes to the broader goal of equitable healthcare access and improved patient outcomes. The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.
Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.
Primary: Zhejiang University
All Institutions: Zhejiang University
The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.
The proposed CoRe-KD framework innovatively addresses the challenges of multimodal emotion recognition (MER) under incomplete observations. It introduces two key components: Complete-view State Anchoring (CSA) and Nonverbal Conflict Exposure (NCE), which enhance the robustness of emotion recognition by aligning incomplete-view predictions with structured references from a complete-view teacher. The methodology is well-structured, leveraging knowledge distillation effectively while avoiding the pitfalls of input reconstruction, which is a common issue in existing methods. The use of Gaussian-inspired states for modality fusion is a notable technical contribution that adds precision to the alignment process.
The experiments are comprehensive, utilizing established datasets (IEMOCAP, MELD, and CMU-MOSEI) to validate the effectiveness of CoRe-KD under both fixed- and random-missing protocols. The results demonstrate consistent improvements in accuracy and F1 scores compared to various baselines, indicating the robustness of the proposed method. The inclusion of ablation studies further strengthens the findings by elucidating the contributions of each component within the framework.
The paper provides detailed implementation specifics, including training protocols, hyperparameters, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the results.
One significant limitation is that CoRe-KD requires complete multimodal observations for training the teacher model, which may not be feasible in all real-world scenarios. Additionally, the NCE module relies on controlled conflict views that might not comprehensively cover all possible real-world misalignments or corruptions in multimodal data.
The advancements in robust conversational MER have implications for various applications, including human-computer interaction, sentiment analysis, and affective computing. By improving the reliability of emotion recognition systems in the presence of missing or unreliable modalities, this work could enhance user experience in applications such as virtual assistants, mental health monitoring, and interactive entertainment. The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.
The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality. This fosters the assumption that low-WER tokens inherently preserve the information necessary for intelligible acoustic synthesis. We argue this is fundamentally deceptive. While high-frequency tokens succeed in generation tasks due to implicit information leakage, isolating pure semantic information at ultra-low frame rates strips away the finegrained articulation and micro-dynamics essential for ODE-based generation. Empirically validating this requires extreme compression without sacrificing WER -- a methodological bottleneck, as standard fixed-stride downsampling arbitrarily truncates phonetic boundaries. To overcome this, we develop a dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, achieving ultra-low frame rates with exceptionally low WER. Using these isolated "pure" semantic tokens, we expose the WER trap: when conditioning generative models -- even with oracle duration alignments -- the reconstructed speech suffers from severe articulation blur and is rendered acoustically unintelligible. Our findings demonstrate that semantic categorization rewarded by low WER is inherently orthogonal to the continuous phonetic trajectories required for synthesis, shattering the illusion of the unified token and advocating for explicitly decoupled speech representations.
Primary: The University of New South Wales
All Institutions: The University of New South Wales, Nanyang Technological University
The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.
The paper presents a novel dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, addressing the methodological bottleneck of fixed-stride downsampling that corrupts phonetic boundaries. This approach is innovative as it allows for extreme compression while maintaining low WER, enabling a rigorous evaluation of the unified token hypothesis through the Dual-Probing Protocol. The methodology is well-structured, leveraging existing frameworks while introducing significant improvements in tokenization for speech synthesis.
The experiments are comprehensive, utilizing large-scale multilingual datasets and employing a dual-probing protocol to assess both discriminative understanding and generative viability. The results demonstrate that while the dynamic tokens achieve high performance in understanding tasks, they fail in generating intelligible speech, effectively illustrating the WER trap. The evaluation metrics, including CER and AVQA accuracy, are appropriate and provide a clear picture of the model's performance.
The paper provides detailed architectural specifications, hyperparameter configurations, and training methodologies, which enhance reproducibility. However, the absence of a public code repository limits the ease with which others can replicate the results. The thoroughness of the experimental setup and the clear delineation of methods contribute positively to reproducibility.
The study acknowledges its limitations, particularly that the generative probe employs a single synthesis paradigm, which may not generalize across different architectures. Additionally, the focus on Mandarin as the sole language for evaluation may restrict the applicability of findings to other languages with different phonetic structures. The paper also notes that while it identifies a critical flaw in the unified token approach, it does not propose a concrete solution for decoupled representations.
The findings have significant implications for the development of speech language models, challenging the prevailing assumption that a single token can suffice for both understanding and generation. This work advocates for a separation of semantic and acoustic representations, which could lead to more effective and intelligible speech synthesis systems. The insights gained from this research could influence future designs in multimodal AI systems, particularly in improving the quality of synthesized speech. The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.
Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.
Primary: Cochin University of Science and Technology (CUSAT)
All Institutions: Cochin University of Science and Technology (CUSAT)
The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.
The proposed CAFNet architecture is innovative in its approach to jointly address the challenges of ternary classification and temporal boundary localization for half-truth audio deepfake detection. The use of cross-attentive feature fusion and depthwise-separable convolutions enhances the model's ability to process multiple acoustic features effectively. The integration of BiLSTM for boundary prediction is a well-justified choice, given the temporal nature of the task. However, the paper could benefit from a more detailed discussion on the design choices for the architecture and the rationale behind the specific feature sets used.
The experiments are robust, utilizing a comprehensive dataset (MLADDC) that covers a diverse range of languages and audio conditions. The performance metrics reported, including accuracy, AUC, and MAE for boundary localization, are convincing and demonstrate the effectiveness of CAFNet compared to existing models. The cross-dataset generalization study adds significant value, revealing critical insights into the limitations of current training paradigms in deepfake detection.
The authors provide sufficient details regarding the implementation, including hyperparameters, training protocols, and the architecture of CAFNet. The availability of code and trained models on GitHub enhances reproducibility. However, the paper lacks detailed information on the specific datasets used for training and evaluation, which could hinder full reproducibility.
One notable limitation is the model's performance on the real class, where a significant number of half-truth samples are misclassified as real. This indicates that while the model excels in detecting fully fake and half-truth audio, it struggles with distinguishing genuine audio, which is crucial for practical applications. Additionally, the study highlights the challenge of catastrophic forgetting during domain adaptation, suggesting that the current approach may not be robust across different datasets.
The findings of this research have significant implications for audio forensics and the detection of manipulated media, especially in contexts where misinformation can have serious consequences. The ability to localize manipulations within audio clips enhances the forensic value of detection systems, making them more actionable for users. As deepfake technology continues to evolve, advancements in detection methods like CAFNet will be critical in maintaining trust in audio communications. The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.
Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Xiaomi Inc.
Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.
The paper introduces a novel framework, Dasheng AudioGen, which effectively integrates multiple audio generation tasks into a single model using structured multi-view captions and a unified semantic-acoustic representation. This approach addresses the fragmentation in audio generation by allowing for coherent mixed-audio scene generation from text, which is a significant advancement in the field. The methodology is well-structured, leveraging a flow-matching DiT architecture and a unique conditioning framework that enhances control over audio components. The use of high-dimensional latent spaces for audio representation is particularly innovative, as it allows for better modeling of overlapping audio elements.
The experiments conducted are comprehensive, utilizing a large-scale dataset (ACAVCaps) and a robust evaluation pipeline that includes both objective and subjective metrics. The results demonstrate that Dasheng AudioGen outperforms existing specialized models in mixed-audio generation while maintaining competitive performance in single-type tasks. The introduction of the MECAT benchmark for mixed-audio evaluation is a valuable contribution, providing a new standard for assessing model performance in this area.
The paper mentions limitations in reproducibility due to reliance on a private dataset, which may hinder others from replicating the results. However, the detailed methodology and experimental setup provide a clear path for future researchers to build upon this work. The authors should consider releasing their dataset or providing a public version to enhance reproducibility.
Key limitations include the model's restriction to generating 10-second audio clips and the lack of advanced speaker control in TTS applications. Additionally, the performance in terms of speech intelligibility lags behind specialized TTS systems, indicating room for improvement. The reliance on a private dataset also poses challenges for reproducibility and broader accessibility.
The implications of this work are significant, as it paves the way for more integrated audio generation systems that can produce realistic and contextually coherent audio scenes. This could have applications in various fields, including film production, gaming, virtual reality, and assistive technologies. The ability to generate complex audio scenes from simple text prompts could also enhance user experiences in interactive media. Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.
While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.
Primary: Future Living Lab, Alibaba
All Institutions: Future Living Lab, Alibaba
The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.
The paper introduces a novel benchmark, VoiceGiraffe, designed specifically for evaluating long-context audio-language models (LALMs) in realistic scenarios. The methodology is robust, employing a dual-level taxonomy for question generation that captures both single-hop and multi-hop reasoning tasks. The data curation process is thorough, involving a multi-stage pipeline that includes voice activity detection, hierarchical captioning, and collaborative verification by human annotators. This rigorous approach ensures high-quality data for evaluation, addressing the limitations of existing benchmarks that rely on short clips or concatenated segments.
The experimental evaluation is comprehensive, benchmarking a wide range of LALMs against human performance across various tasks and inference paradigms. The results reveal significant challenges in long-context understanding, with only one proprietary model surpassing human performance. The findings highlight the limitations of current models in memory persistence and reasoning capabilities, providing valuable insights into areas for future research. The use of multiple inference settings (E2E, cascaded caption aggregation, and reasoning-enhanced cascading) allows for a nuanced understanding of model performance.
While the paper outlines a detailed methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the study or build upon the findings.
The primary limitations include the lack of a publicly available dataset or benchmark for other researchers to use, which could hinder wider adoption and validation of the proposed methods. Additionally, the paper acknowledges that even human annotators found the tasks challenging, indicating that the benchmark may be too difficult for current models. There is also a potential bias in language performance, as the models exhibited varying capabilities across English and Chinese inputs.
The introduction of VoiceGiraffe has the potential to significantly advance the field of audio-language understanding by providing a rigorous evaluation framework that addresses real-world challenges. This benchmark can guide future research towards developing models with improved long-context reasoning and memory capabilities, which are essential for applications in audio assistants, automated transcription, and multimedia content analysis. The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.
Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.
Primary: Zhejiang University
All Institutions: Zhejiang University, Bytedance
The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.
The paper introduces SwanBench-Speech, a comprehensive benchmark for evaluating long-form speech generation models. It effectively addresses the limitations of existing evaluation methods by proposing a multi-dimensional framework that includes seven disentangled metrics across three core challenges: acoustics, semantics, and expressiveness. The methodology is well-structured, with a clear focus on real-world applications and the incorporation of human-aligned metrics, which enhances the relevance of the evaluation. The use of diverse scenarios and a rigorous data collection process further strengthens the methodology.
The experiments are extensive, involving over 20 models evaluated across 1,101 samples in 17 scenarios. The results provide valuable insights into the performance gaps of current models compared to human recordings, particularly in expressiveness and consistency. The use of both objective metrics and human evaluations adds robustness to the findings. However, while the experiments are thorough, the paper could benefit from more detailed statistical analyses to quantify the significance of the results.
The paper provides a clear description of the data collection and evaluation processes, along with the metrics used. The open-sourcing of the benchmark and the availability of evaluation scripts enhance reproducibility. However, the reliance on specific models for evaluation may limit the generalizability of the findings to other systems.
The study acknowledges limitations, including a narrow linguistic scope (only Chinese and English) and a lack of robustness in assessing emotional and stylistic transitions. Additionally, the dataset's speaker diversity is limited, which may introduce bias in evaluations. Future work should address these gaps to enhance the benchmark's applicability.
This work has significant implications for the field of speech synthesis, particularly in enhancing the evaluation of long-form speech generation systems. By establishing a standardized benchmark, it paves the way for future research and development in this area, potentially leading to more immersive and expressive speech synthesis applications. The focus on real-world scenarios and human-aligned metrics also suggests potential applications in education, entertainment, and customer service. The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.
Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.
Primary: University of Pennsylvania
All Institutions: University of Pennsylvania, The Chinese University of Hong Kong
This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.
The proposed methodology introduces a Cross-view Alternate-attention Transformer (CVAT) that effectively captures both local intra-view and global cross-view relationships, addressing the challenges of few-shot Room Impulse Response (RIR) prediction. The integration of a geometry-informed modulation block enhances the model's ability to leverage geometric features, which is a significant advancement over existing methods. The auxiliary loss for multi-task learning further strengthens the model's performance by promoting generalizability across different architectures.
The experiments are robust, utilizing both simulated and real-world datasets, and demonstrate state-of-the-art performance across various metrics. The ablation studies provide clear evidence of the contributions of each component, validating the effectiveness of the proposed architecture. The quantitative results indicate substantial improvements over baseline methods, particularly in sparse reference scenarios.
The paper provides sufficient implementation details, including architecture specifications and training configurations, which should facilitate reproducibility. The availability of code and checkpoints on GitHub enhances this aspect, although specific hyperparameters and training procedures could be elaborated further for clarity.
While the model shows impressive performance, it may still be limited by the quality of the input data and the assumptions made regarding room geometry. The reliance on geometric features may not generalize well to all acoustic environments, particularly those with complex or unconventional geometries.
The advancements in few-shot learning for RIR prediction have significant implications for immersive audio applications in AR/VR and spatial audio rendering, potentially enhancing user experiences in virtual environments. The methodology could inspire further research into integrating geometric and acoustic modeling in other domains. This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.
Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.
Primary: Shenzhen International Graduate School, Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, ModelBest Inc.
The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.
The paper introduces a novel low-dimensional audio tokenizer, LoSATok, which effectively compresses high-dimensional semantic representations while maintaining semantic richness and acoustic details. The methodology includes the Semantic Bottleneck (SemBo) for dimensionality reduction, and a dual-level semantic supervision strategy that enhances the learning process. The proposed time-relation loss is a significant innovation that ensures temporal consistency in the representations. Overall, the methodology is well-structured and addresses a critical gap in current audio modeling approaches.
The experiments are comprehensive, covering various audio tasks across speech, music, and general audio domains. The results demonstrate that LoSATok achieves competitive performance in understanding tasks and outperforms existing models in generation tasks, particularly in terms of efficiency and quality. The use of objective metrics (e.g., FAD, CLAP) alongside subjective evaluations strengthens the findings. However, the paper could benefit from more extensive comparisons with state-of-the-art methods in a broader range of tasks.
The paper provides a GitHub repository with the code, which is essential for reproducibility. However, specific implementation details, such as hyperparameter choices and training setups, could be more clearly outlined to facilitate replication by other researchers.
The authors acknowledge that LoSATok sacrifices some reconstruction fidelity for improved semantic organization and generative performance. Additionally, while it shows promise in understanding tasks, it does not fully reach the performance of high-dimensional semantic representations. Future work is needed to optimize the balance between semantics, acoustics, and generation.
The proposed tokenizer has significant implications for audio understanding and generation, potentially enhancing applications in speech recognition, music generation, and audio synthesis. By enabling more efficient models, it could lead to advancements in real-time audio processing and interactive applications. The research also opens avenues for further exploration of low-dimensional representations in multimodal contexts. The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.
Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.
Primary: Renmin University of China
All Institutions: Renmin University of China
The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.
The proposed methodology, PlanAudio, introduces a novel framework for generating unified audio from free-form text prompts, leveraging an autoregressive LLM architecture and a semantic latent Chain-of-Thought (CoT) mechanism. This approach is innovative as it avoids traditional text encoders and explicit text rewriting, which are common in existing models. The integration of semantic planning in the latent space before audio synthesis is a significant advancement, allowing for better alignment between high-level semantics and low-level audio generation. The methodology is well-structured, with clear phases for semantic planning and acoustic generation, which enhances the model's ability to produce coherent audio outputs.
The experiments are comprehensive, evaluating PlanAudio across multiple scenarios (sound, speech, and composite) using both objective metrics (FAD, KL divergence, WER) and subjective assessments (human ratings on acoustic quality, temporal correctness, etc.). The results demonstrate that PlanAudio outperforms existing pipeline and unified models, showcasing its versatility and effectiveness. The creation of PlanAudio-Bench as a specialized benchmark for composite audio scenarios adds value to the evaluation process, providing a structured way to assess the model's performance in real-world applications.
The paper provides detailed implementation details, including the datasets used, training procedures, and evaluation metrics. However, the lack of a publicly available demo or project URL limits the reproducibility of the results. While the methodology is clearly described, access to the code and trained models would enhance the ability of other researchers to replicate the findings.
One limitation is the potential for the model to struggle with highly complex prompts that require intricate audio interactions, as indicated by the slight performance drop in speech generation compared to specialized models. Additionally, the reliance on the quality of the training data and the inherent challenges in synthesizing audio from free-form text prompts may introduce variability in performance across different contexts.
The implications of this research are significant for various applications, including content creation, game development, and assistive technologies for individuals with speech impairments. By enabling the generation of coherent audio from natural language prompts, this work could facilitate new forms of human-computer interaction and enhance multimedia experiences. The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.
We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.
Primary: Daydream
All Institutions: Daydream
The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.
The methodology presented in the paper is innovative, leveraging a real-time diffusion engine that transforms the denoising process into a playable musical instrument. The authors introduce several mechanisms that enhance the responsiveness and control of audio generation, including per-slot heterogeneous denoise scheduling, shared mutable per-step state, per-frame source blending, and a windowed VAE decode. These contributions are well-structured and address significant challenges in real-time audio generation, particularly in maintaining high throughput while allowing for fine-grained control over audio parameters.
The experimental evaluation is thorough, with a focus on latency, output quality, and responsiveness of parameter changes. The authors provide empirical results that substantiate their claims regarding the effectiveness of their proposed mechanisms, including quantitative comparisons with existing systems. The use of various audio sources and the detailed reporting of metrics such as CLAP and SNR demonstrate a rigorous approach to validating the system's performance.
The paper includes sufficient detail regarding the architecture and implementation of the DEMON system, including the use of TensorRT for acceleration and the specific configurations used for experiments. However, the absence of a detailed description of the datasets and the evaluation metrics used may pose challenges for complete reproducibility. The provided URLs for the project and demo enhance accessibility to the code and results.
One limitation of the paper is the reliance on a specific hardware setup (NVIDIA RTX 5090) for performance metrics, which may not generalize across different systems. Additionally, while the authors address the latency of their system, the practical implications of the onset latency in live performance contexts could be further explored. The paper does not discuss potential limitations in the quality of audio generated under varying conditions or the scalability of the system.
The work has significant implications for the fields of music generation and real-time audio processing, particularly for live performances. By enabling musicians to manipulate denoising parameters in real-time, DEMON opens up new avenues for creative expression and interaction with AI-generated music. The integration of machine learning into musical instruments could lead to innovative performance practices and new genres of music. The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.
Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.
The proposed Audio-Mind framework introduces a novel approach to audio understanding by integrating a strong frontend with planner-guided tool use. This method allows for dynamic evidence acquisition, which is a significant improvement over existing audio-agent baselines. The framework's ability to preserve the frontend's judgment while addressing evidence gaps is a noteworthy contribution to the field, as it enhances the overall reasoning process in audio question answering.
The experiments conducted on MMAR and MSU-Bench demonstrate the effectiveness of Audio-Mind, achieving impressive accuracy scores of 80.4% and 82.8%, respectively. The matched-backbone comparison further validates the framework's design by highlighting the orchestration bottleneck in agentic decomposition under strong audio frontends. However, the paper lacks detailed descriptions of the datasets and evaluation metrics used, which could enhance the transparency and reproducibility of the results.
The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Without access to the framework or clear guidelines on how to replicate the experiments, it is challenging for other researchers to validate the findings.
One limitation is the potential complexity introduced by the planner-guided tool use, which may not generalize well to all audio understanding tasks. Additionally, the framework's reliance on strong frontends could limit its applicability in scenarios where such models are not available.
The Audio-Mind framework has the potential to significantly impact the field of audio understanding and question answering by providing a more reliable and auditable reasoning process. Its contributions could lead to advancements in audio-QA annotation and error analysis, making it a valuable tool for researchers and practitioners in the domain. The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.
Audio large language models (Audio LLMs) demonstrate strong performance on speech understanding tasks, yet their ability to understand paralinguistic information remains limited. To systematically quantify this issue, we introduce VoxParadox, an adversarial benchmark with 2,000 verified examples, spanning 10 paralinguistic tasks, created with controlled speech synthesis to intentionally mismatch transcript claims and speaking style, enabling direct measurement of speech paralinguistic understanding. Evaluation of a diverse set of Audio LLMs reveals consistently low accuracy on acoustic ground truth and a strong tendency to follow language-implied (incorrect) answers. To understand the cause of this gap, we perform layer-wise probing and find that (i) paralinguistic cues can degrade in deeper encoder layers and at the encoder--LLM interface, and (ii) even when such cues are available in audio tokens, the language model frequently ignores them. To address these problems, we propose Prompt-Conditioned Layer Mixer (PCLM), which adaptively combines information from multiple audio layers based on the input prompt, and pair it with Direct Preference Optimization (DPO) to explicitly prefer acoustically supported options over language-implied alternatives. These methods substantially improve Audio LLM paralinguistic understanding, improving Audio Flamingo 3 from 17.40% to 65.20% on VoxParadox, and from 37.74% to 54.78% on MMSU paralinguistic subset. Our project page is available at https://voxparadox.github.io/.
Primary: University of Southern California
All Institutions: University of Southern California
The main contribution of this paper is the introduction of VoxParadox, a benchmark that effectively isolates and evaluates the paralinguistic understanding of Audio LLMs, alongside innovative methods to enhance model performance in this domain. The work is significant as it addresses a critical gap in the capabilities of current Audio LLMs and proposes actionable solutions that could lead to more robust multimodal systems.
The paper introduces VoxParadox, a novel adversarial benchmark designed to evaluate the paralinguistic understanding of Audio LLMs by creating controlled linguistic-acoustic contradictions. The methodology is robust, employing a systematic approach to generate adversarial examples and utilizing layer-wise probing to diagnose model limitations. The proposed Prompt-Conditioned Layer Mixer (PCLM) is a significant innovation that adaptively combines information from multiple audio layers based on the input prompt, addressing identified bottlenecks in model performance.
The experiments are comprehensive, evaluating a diverse set of Audio LLMs against the VoxParadox benchmark. The results demonstrate a clear performance gap in paralinguistic tasks, with models showing a tendency to rely on transcript-implied answers rather than acoustic evidence. The paper provides detailed metrics, including ground truth accuracy and adversarial-label agreement, which effectively illustrate the models' weaknesses and the improvements achieved through the proposed methods.
The paper includes sufficient detail regarding the experimental setup, data generation pipeline, and evaluation metrics, which supports reproducibility. However, the implementation specifics of the PCLM and DPO methods could benefit from additional clarity to ensure that other researchers can replicate the results accurately.
The authors acknowledge that PCLM is a post-hoc solution and that the degradation of paralinguistic information in deeper layers and at the encoder-LLM interface presents inherent limitations. Additionally, while VoxParadox serves as a controlled stress test, it may not fully capture the complexities of naturalistic speech scenarios. The reliance on TTS-generated audio also raises questions about the generalizability of the findings.
The research has significant implications for improving speech-based interfaces and accessibility technologies, enhancing the ability of Audio LLMs to interpret non-verbal cues accurately. However, the potential for misuse in profiling and surveillance contexts necessitates careful consideration of ethical implications and the establishment of safeguards in deployment. The main contribution of this paper is the introduction of VoxParadox, a benchmark that effectively isolates and evaluates the paralinguistic understanding of Audio LLMs, alongside innovative methods to enhance model performance in this domain. The work is significant as it addresses a critical gap in the capabilities of current Audio LLMs and proposes actionable solutions that could lead to more robust multimodal systems.
High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neural speech codec that operates entirely in the modified discrete cosine transform (MDCT) domain. CFMDCTCodec integrates a lightweight encoder-quantizer-decoder-style MDCT-spectral codec with a noise-prior-aware, conditional-flow-matching (CFM)-based MDCT-spectral enhancer. Within this framework, the codec serves as a base module that compactly discretizes the MDCT spectrum extracted from speech and produces an initial coarse reconstruction, while the enhancer further restores fine-grained spectral details. The enhancer improves the decoded MDCT spectrum by integrating a conditional MDCT velocity-field filter with an ordinary differential equation (ODE) solver, under the guidance of an MDCT-derived magnitude-adaptive noise prior, aiming to emphasize perceptually significant high-energy regions while stabilizing low-energy and silent regions. Finally, the enhanced MDCT spectrum is reconstructed into the decoded speech using the inverse MDCT. When optimizing CFMDCTCodec, we adopt a unified non-adversarial training strategy that jointly combines reconstruction, quantization and CFM objectives. Both objective and subjective evaluations show that CFMDCTCodec outperforms competitive baselines in low-bitrate regimes, e.g., 0.65 kbps, while approaching the perceptual quality of large-scale codecs with significantly fewer parameters and computations.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing, Tsinghua University
The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.
The proposed CFMDCTCodec introduces a novel architecture for low-bitrate speech coding that operates entirely in the MDCT domain, integrating a single-codebook quantization strategy with a noise-prior-aware conditional flow matching (CFM) enhancement mechanism. This approach effectively addresses the limitations of existing codecs by enhancing the spectral quality of decoded speech without increasing bitrate, utilizing a joint training strategy that simplifies the learning process. The methodology is well-structured, with clear descriptions of the encoder, decoder, and enhancer components, and the use of ordinary differential equations (ODE) for state evolution is particularly innovative.
The experimental setup is robust, utilizing two different speech corpora and multiple bitrate settings to evaluate the codec's performance. The paper provides both objective and subjective evaluation metrics, including MUSHRA tests and various objective measures (STOI, SI-SDR, etc.), which demonstrate the codec's superiority over competitive baselines. The results indicate significant improvements in speech quality at low bitrates, validating the effectiveness of the proposed enhancements.
The paper includes detailed descriptions of the experimental setup, including hyperparameters, training configurations, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease of replication for other researchers.
One limitation is the reliance on a single-codebook quantization strategy, which may not capture the full diversity of speech signals as effectively as multi-codebook approaches. Additionally, while the results are promising, further testing across a wider range of speech datasets and real-world scenarios would strengthen the findings.
The CFMDCTCodec has significant potential applications in bandwidth-constrained environments such as satellite communications, teleconferencing, and mobile applications, where high-quality speech transmission is critical. Its lightweight design and efficient processing could facilitate broader adoption in various speech processing applications, contributing to advancements in telecommunications and accessibility technologies. The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.
We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.
Primary: CUHK MMLab
All Institutions: CUHK MMLab, SJTU, NTU, McMaster, CityUHK, JUFE
The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.
The methodology introduces a novel interaction slot formulation that captures real-time, multimodal interactions in a continuous audio-visual stream. This approach is innovative as it shifts the evaluation paradigm from static question-answer pairs to dynamic, temporally grounded interactions, allowing for a more realistic assessment of model capabilities in real-time settings. The proposed metrics (IA-QTF1, IDS, NCCS) are well-defined and tailored to the unique challenges of streaming interactions, effectively measuring not just correctness but also timing and context management.
The experiments are comprehensive, evaluating multiple state-of-the-art omnimodal models under the new benchmark. The results reveal significant gaps in current models' abilities to handle real-time interactions, particularly in continuous task monitoring and nested query scenarios. The use of a diverse dataset of 250 videos with 1,430 response slots provides a solid foundation for the evaluations, although the performance scores indicate that there is considerable room for improvement in the models tested.
The paper mentions that the code and datasets will be made publicly accessible, which is crucial for reproducibility. However, details on the exact implementation of the models tested and the specific evaluation protocols could be elaborated upon to enhance reproducibility further.
The paper acknowledges limitations such as the narrow focus on specific interaction types and the reliance on synthesized speech for the 1QnA split. Additionally, the benchmark currently covers only Chinese and English scenarios, which may limit its applicability across different languages and cultures. The analysis is also limited to a small number of models, which may not represent the full landscape of omnimodal systems.
The introduction of OmniInteract has the potential to significantly advance the field of real-time human-AI interaction by providing a standardized benchmark for evaluating omnimodal models. This can lead to improved AI assistants that are more capable of understanding and responding to user queries in real-time, enhancing applications in accessibility, education, and everyday tasks. The focus on real-time interaction also raises important considerations regarding privacy and the ethical deployment of always-on systems. The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.
Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.
Primary: Amap, Alibaba Group
All Institutions: Amap, Alibaba Group, The Chinese University of Hong Kong, Shenzhen
The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.
The methodology is robust, featuring a well-structured multi-stage data processing pipeline that enhances data quality and a compact autoregressive architecture that effectively decouples speaker identity from style. The use of Q-Former-based conditioning and cross-sample paired training is innovative and addresses common challenges in TTS systems.
The experiments are comprehensive, utilizing the Seed-TTS Eval benchmark to demonstrate superior performance in terms of WER, CER, and speaker similarity. The inclusion of human evaluations for emotion control and paralinguistic synthesis adds depth to the assessment of the system's capabilities.
The paper emphasizes reproducibility by providing a complete data processing pipeline built from publicly available tools, along with pretrained weights and code. This transparency enhances the likelihood of other researchers replicating the results.
The paper acknowledges limitations such as insufficient explicit style modeling and the constraints of single-codebook quantization, which may hinder performance in more complex scenarios. Additionally, the reliance on mel-spectrograms could introduce reconstruction artifacts.
The potential applications of PilotTTS are significant, particularly for resource-constrained teams seeking to develop competitive TTS systems. Its modular approach and open-source nature could democratize access to high-quality speech synthesis technology. The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.
Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.
Primary: The University of Melbourne
All Institutions: The University of Melbourne, The University of Auckland, UNSW Sydney, KAIST
The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.
The methodology is robust, introducing the EnvMem framework to systematically analyze the retention of acoustic information in multi-turn interactions. The authors employ a combination of controlled experiments, linear probing, and attention analysis to dissect the representation and retrieval mechanisms in LALMs. The use of synthetic dialogues and a clear structure for the evaluation tasks enhances the clarity of the experimental design. However, the reliance on synthetic data may limit the generalizability of the findings to real-world scenarios.
The experiments are comprehensive, evaluating multiple LALMs across various context lengths. The results demonstrate a clear performance gap between semantic and acoustic memory, with detailed analyses of representational drift and attention allocation. The use of metrics like accuracy and relative degradation provides a solid basis for comparison, although the paper could benefit from additional qualitative assessments of model outputs.
The paper provides detailed descriptions of the experimental setup, including dataset construction and evaluation protocols. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing the EnvMem benchmark and associated models to facilitate further research in this area.
The primary limitation is the use of synthetic data, which may not capture the complexities of natural conversations. Additionally, the interventions are post-hoc and may not translate to practical solutions for improving acoustic memory in deployed models. The study also acknowledges potential ethical concerns regarding privacy and surveillance in real-world applications.
This research has significant implications for the development of more robust audio language models, particularly in applications requiring persistent awareness of environmental sounds. By highlighting the representational bottlenecks in LALMs, the findings can guide future training strategies and benchmark designs, ultimately improving the integration of acoustic memory in multimodal systems. The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.
Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.
Primary: University
All Institutions: Company, Department of Computer Science, International Laboratories, University
The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.
The methodology presented in MERIT is innovative, focusing on disentangled representations of music based on melody, rhythm, and timbre. The use of a frozen MERT backbone combined with a novel triplet construction strategy allows for effective training on isolated musical dimensions without manual labeling. The approach of leveraging generative models for creating training data is particularly noteworthy, as it addresses the challenge of entangled real-world audio data. The Circle Loss optimization technique further enhances the training process by focusing on hard negatives, which is a sound choice for improving representation quality.
The experiments are well-structured, utilizing both internal and external evaluations to assess the model's performance. The use of zero-shot probes on independent datasets demonstrates the generalizability of the learned representations. The results indicate strong factor-wise disentanglement, with high accuracy in distinguishing between the different musical dimensions. The human evaluation of triplet quality adds a valuable subjective perspective to the findings, reinforcing the model's effectiveness. Overall, the experimental design is robust and provides compelling evidence of the framework's capabilities.
The paper provides sufficient details regarding the architecture, training procedures, and datasets used, which supports reproducibility. The authors have made the code and pre-trained models publicly available, further facilitating replication of their results. However, the reliance on specific datasets like MoisesDB and the generative model JASCO may limit reproducibility if these resources are not accessible to all researchers.
Some limitations are acknowledged, such as the focus on only three musical dimensions (melody, rhythm, and timbre), which may overlook other important aspects like harmony and dynamics. Additionally, the operationalization of timbre at the instrument-class level may not capture within-class variations adequately. The authors also mention potential biases from the training data that could affect the model's performance in real-world scenarios.
The implications of MERIT are significant for music information retrieval, recommendation systems, and music analysis tools. By enabling users to query music based on specific dimensions, it enhances user control and interpretability, which can lead to more personalized music experiences. The framework could also inspire further research into disentangled representations in other domains, potentially influencing broader applications in audio processing and machine learning. The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.
Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley, Thoughtful Lab
The main contribution of this paper is the introduction of PitchBench, a systematic evaluation suite for measuring pitch hearing in audio-language models, which significantly enhances the understanding of how these models perceive musical pitch. This work represents a critical step toward improving the reliability and effectiveness of ALMs in real-world audio applications.
The methodology presented in PitchBench is robust and systematic, focusing on a hierarchical evaluation of pitch perception in audio-language models (ALMs). The paper introduces a comprehensive framework that includes 28 experiments designed to isolate and assess various aspects of pitch hearing, such as absolute and relative pitch perception. The use of controlled synthetic stimuli allows for precise measurement of model performance across different acoustic conditions and response formats. This structured approach is a significant improvement over existing benchmarks, which often fail to directly evaluate the fundamental ability to perceive pitch.
The experimental evaluation is thorough, involving six frontier ALMs across a wide range of tasks that assess pitch perception under varying conditions. The results reveal significant performance variability among models, highlighting specific failure modes that are not captured by higher-level benchmarks. The detailed analysis of model performance, including the effects of acoustic variations and response modalities, provides valuable insights into the strengths and weaknesses of current ALMs in pitch perception.
The paper emphasizes reproducibility by providing a Python package that includes the evaluation data and generation tools. The authors detail the deterministic generation of stimuli, ensuring that other researchers can replicate the experiments. The inclusion of metadata and standardized output formats further supports reproducibility.
While PitchBench offers a significant advancement in evaluating pitch perception, it relies entirely on algorithmically synthesized stimuli, which may not fully capture the complexities of real-world audio. The current instrument selection is limited to General MIDI instruments, and the benchmark does not address non-Western musical traditions or more complex rhythmic reasoning tasks. Future work is needed to incorporate real recordings and broaden the diversity of the instrument pool.
The implications of PitchBench are substantial for the development of audio-language models, particularly in applications requiring reliable musical understanding, such as music tutoring, transcription, and recommendation systems. By providing a diagnostic tool for evaluating pitch perception, this work lays the groundwork for future advancements in multimodal AI systems that integrate audio understanding with other sensory inputs. The main contribution of this paper is the introduction of PitchBench, a systematic evaluation suite for measuring pitch hearing in audio-language models, which significantly enhances the understanding of how these models perceive musical pitch. This work represents a critical step toward improving the reliability and effectiveness of ALMs in real-world audio applications.
Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.
Primary: National Technical University of Athens
All Institutions: National Technical University of Athens
This paper presents a multimodal deep learning framework for dementia detection that effectively combines acoustic and linguistic features, showcasing innovative methods and robust experimental validation. The technical contributions are significant, addressing critical gaps in existing approaches and offering a promising direction for future research in automatic dementia assessment.
The proposed methodology employs a novel multimodal deep learning framework that integrates both acoustic and linguistic representations for dementia detection. The use of HuBERT for acoustic representation and BERT for textual representation, combined with attentive statistics pooling and an innovative Audio-Text Fusion mechanism, demonstrates a sophisticated approach to capturing the nuances of speech relevant to cognitive decline. The introduction of the Mutual Information Neural Estimation (MINE) objective to enhance cross-modal representation alignment is particularly noteworthy, as it addresses a significant gap in existing multimodal approaches.
The experiments are well-structured, utilizing two publicly available datasets (ADReSS Challenge and PROCESS-2) to validate the proposed framework. The results indicate competitive performance compared to state-of-the-art methods, with detailed metrics provided for accuracy, recall, and specificity. The ablation studies further strengthen the findings by demonstrating the effectiveness of various components of the proposed framework, such as pooling strategies and fusion methods.
The paper provides a clear description of the methodology and experimental setup, including details on the datasets and evaluation metrics. However, there is no mention of code availability or a repository for others to reproduce the results, which limits the reproducibility aspect.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of speech patterns in broader populations. Additionally, while the framework shows promising results, the performance on different demographic groups or in real-world settings remains untested. The absence of a demo or project URL also hinders practical application and further exploration by the community.
The framework has significant implications for early diagnosis and intervention in Alzheimer's disease, potentially improving patient care and outcomes. By leveraging speech analysis, the approach could facilitate non-invasive and efficient screening methods, which are crucial given the increasing prevalence of dementia globally. The integration of multimodal learning in this context also opens avenues for future research in cognitive health monitoring and related fields. This paper presents a multimodal deep learning framework for dementia detection that effectively combines acoustic and linguistic features, showcasing innovative methods and robust experimental validation. The technical contributions are significant, addressing critical gaps in existing approaches and offering a promising direction for future research in automatic dementia assessment.
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.
Primary: Nankai University
All Institutions: Nankai University
The paper presents CosyEdit2, a novel framework that enhances speech editing and zero-shot TTS through innovative reinforcement learning techniques and a well-structured methodology. The contributions are significant, addressing key limitations in the field and paving the way for future advancements in audio processing technologies.
The paper introduces CosyEdit2, a two-stage post-training framework that innovatively combines supervised fine-tuning with reinforcement learning (GRPO) to enhance speech editing capabilities while also improving zero-shot TTS performance. The methodology is well-structured, addressing the limitations of previous approaches by eliminating the need for imperfect paired data and optimizing through editing-specific rewards. The architecture leverages a unified text-speech language model and a conditional flow-matching model, showcasing a novel integration of LLMs with audio processing.
The experiments are extensive, utilizing multiple benchmarks for both speech editing and zero-shot TTS. The results demonstrate significant improvements over existing models, particularly in terms of acoustic consistency and editing fidelity. The use of both objective and subjective evaluation metrics strengthens the findings, providing a comprehensive assessment of the model's performance.
The paper provides detailed training and evaluation setups, including data sources, training parameters, and model architectures, which facilitate reproducibility. However, access to the datasets used for training and evaluation may be a limiting factor for complete reproducibility.
The authors acknowledge limitations in the design space of the reward formulation and the language coverage of the framework, which is currently constrained to a few languages. Additionally, broader acoustic editing capabilities remain unexplored, suggesting areas for future research.
The advancements in speech editing and zero-shot TTS have significant implications for applications in accessibility, multimedia production, and human-computer interaction. However, the potential for misuse in voice impersonation and misinformation propagation raises ethical concerns that need to be addressed through responsible deployment practices. The paper presents CosyEdit2, a novel framework that enhances speech editing and zero-shot TTS through innovative reinforcement learning techniques and a well-structured methodology. The contributions are significant, addressing key limitations in the field and paving the way for future advancements in audio processing technologies.
Passive multi-target tracking (MTT) aims to infer the kinematic states of multiple targets from noisy sensor data in which contributions from unknown target-emitted signals are superposed. Track-before-detect (TBD) methods improve robustness to noise by operating directly on raw sensor data without relying on a preceding detection stage. However, many existing TBD methods assume that each target's contribution to the sensor data is determined solely by its kinematic state. This assumption limits their applicability to passive MTT, where each target's contribution depends on both its kinematic state and the unknown emitted signal. We propose subspace TBD, a passive multi-target TBD method based on a likelihood derived from the complex Bingham distribution that does not require explicit modeling or estimation of the unknown emitted signals. In a particle filter (PF) framework, each multi-target hypothesis is mapped to a low-dimensional subspace spanned by the steering vectors corresponding to the hypothesized target states. The likelihood is then used to evaluate the alignment of the normalized multichannel sensor data with this subspace. Preliminary experiments with simulated acoustic measurements and a given target activity pattern show that the proposed method can track two moving targets emitting unknown signals at a signal-to-noise ratio (SNR) of -10dB, whereas a conventional TBD baseline yields substantially larger tracking errors.
Primary: National Institute of Advanced Industrial Science and Technology (AIST)
All Institutions: National Institute of Advanced Industrial Science and Technology (AIST)
The main contribution of this paper is the introduction of a novel subspace track-before-detect methodology for passive multi-target tracking that effectively addresses the challenges posed by unknown emitted signals. This work represents a significant advancement in the field of audio signal processing and multi-target tracking, offering a robust solution for low-SNR environments and paving the way for future research in more complex scenarios.
The proposed methodology, subspace track-before-detect (TBD), innovatively addresses the challenges of passive multi-target tracking (MTT) in environments where the emitted signals from targets are unknown. By leveraging the complex Bingham distribution to model the observation likelihood without requiring explicit estimation of the emitted signals, the authors effectively circumvent a significant limitation of conventional TBD methods. The use of a particle filter framework to implement this approach allows for robust tracking of multiple targets in low signal-to-noise ratio (SNR) conditions, which is a notable advancement in the field.
The experiments conducted are well-structured, utilizing simulated acoustic measurements to validate the proposed method. The comparison against a conventional deterministic-contribution baseline highlights the effectiveness of the subspace TBD approach, particularly in low SNR scenarios. The results demonstrate a significant improvement in tracking accuracy, with lower root mean square errors (RMSE) across various conditions, reinforcing the practical applicability of the method.
The paper provides sufficient details regarding the experimental setup, including the simulation parameters and the configuration of the particle filter. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should include sharing the implementation to facilitate validation and further exploration by the research community.
One limitation of the study is the reliance on simulated data, which may not fully capture the complexities of real-world scenarios. The paper also assumes a fixed activity pattern for the targets, which may not be realistic in dynamic environments. Additionally, the method's performance in more complex acoustic settings, such as reverberant environments or with more than two targets, remains to be evaluated.
The proposed subspace TBD method has significant potential applications in various fields, including surveillance, autonomous vehicles, and audio signal processing. By improving the robustness of multi-target tracking in noisy environments, this research could enhance systems that rely on accurate target localization and tracking, thereby contributing to advancements in safety and efficiency in real-time applications. The main contribution of this paper is the introduction of a novel subspace track-before-detect methodology for passive multi-target tracking that effectively addresses the challenges posed by unknown emitted signals. This work represents a significant advancement in the field of audio signal processing and multi-target tracking, offering a robust solution for low-SNR environments and paving the way for future research in more complex scenarios.
While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-quality, fine-grained annotations, which restricts a model's ability to precisely control NV generation. To address this limitation, we propose a novel approach for fine-grained non-verbal expression synthesis. We curate and reprocess female NV utterances from the EARS corpus, develop a new annotation scheme using tags to encode NV types, frequencies, and durations, and build an emotional TTS benchmark to demonstrate its effectiveness. Our evaluation shows that while our NV approach leads to minor trade-offs in perceived naturalness, it significantly improves expressiveness (eMOS 4.20) and emotional recognition accuracy (78.8%). Emotion-specific analysis further reveals that NV cues are highly effective for high-arousal emotions like happy (82.5%) and fear (82.7%), and almost perfectly convey sadness (98.3%).
Primary: Nara Institute of Science and Technology
All Institutions: Nara Institute of Science and Technology
The main contribution of this paper is the introduction of a fine-grained non-verbal expression dataset and a corresponding TTS system that significantly enhances emotional expressiveness in synthesized speech. This work represents a meaningful advancement in the field of emotional TTS synthesis, addressing critical gaps in existing methodologies and datasets.
The methodology presented in this paper is robust, focusing on the development of a fine-grained non-verbal expression dataset and a corresponding TTS system. The authors effectively address the limitations of existing datasets by introducing a novel annotation scheme that allows for precise control over non-verbal vocalizations. The use of Grad-TTS as the backbone model, enhanced with an emotion encoder, demonstrates a thoughtful integration of emotional embeddings into the synthesis process. The segmentation and transcription processes are well-detailed, showcasing a clear understanding of audio processing and the importance of high-quality data in training TTS systems.
The experimental evaluation is comprehensive, involving subjective assessments of naturalness and emotional expressiveness, as well as emotion recognition accuracy. The use of a diverse set of evaluation metrics, including eMOS and nMOS, provides a nuanced understanding of the model's performance. The results indicate a significant improvement in expressiveness with the fine-grained NV approach, although there is a minor trade-off in perceived naturalness. The emotion-specific analysis adds depth to the findings, illustrating the effectiveness of NV cues in conveying various emotional states.
The paper provides sufficient detail regarding the dataset construction, model architecture, and evaluation procedures, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to fully replicate the study. The authors could improve reproducibility by sharing their code and trained models.
One limitation is the focus on female NV utterances, which may not generalize well to male voices or other demographics. Additionally, the minor trade-off in naturalness when incorporating NVs could be a concern for practical applications. The subjective nature of the evaluations may also introduce variability, as individual preferences for emotional expression can differ widely.
This research has significant implications for the development of more emotionally intelligent conversational AI systems. By enhancing the expressiveness of TTS systems through the integration of non-verbal vocalizations, the work contributes to creating more engaging and human-like interactions in various applications, including virtual assistants, gaming, and mental health support systems. The main contribution of this paper is the introduction of a fine-grained non-verbal expression dataset and a corresponding TTS system that significantly enhances emotional expressiveness in synthesized speech. This work represents a meaningful advancement in the field of emotional TTS synthesis, addressing critical gaps in existing methodologies and datasets.
Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, we propose FMelCodec, an ultra-low-bitrate neural speech codec in the mel-spectrogram domain, cast as a three-stage coding-refinement-reconstruction (CRR) framework that can operate at as low as 250 bps. In the CRR framework, the front-end mel-spectrogram coding stage employs a highly aggressive 640x compression/decompression encoder-decoder structure with a single 1024-entry VQ codebook, coupled with an online clustering strategy that reassigns underused codewords to prevent codebook collapse and preserve codebook diversity. The subsequent conditional flow matching (CFM)-based mel-spectrogram refinement stage leverages a lightweight velocity-field estimator and CFM-based solver to refine the codec-degraded mel-spectrogram produced by the preceding decoder, and adopts a self-consistency training scheme that supports fewer iterative inference steps for the purpose of reducing computational overhead. Finally, the vocoding-driven waveform reconstruction stage employs a HiFi-GAN vocoder to faithfully reconstruct waveform from the refined mel-spectrogram. Experiments conducted on two datasets spanning two sampling rates show that, under ultra-low-bitrate constraints of 250 bps for 16 kHz and 750 bps for 48 kHz, both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity, while incurring lower computational and model complexity.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Institute of Informatics, Baidu Speech Department, National Engineering Research Center of Speech and Language Information Processing
The main contribution of this paper is the introduction of FMelCodec, a novel ultra-low-bitrate speech codec that effectively balances compression efficiency and speech quality through a sophisticated three-stage framework, demonstrating significant advancements in the field of neural speech coding. The methodology and results presented have the potential to influence future developments in audio processing and communication technologies.
The paper introduces FMelCodec, a novel three-stage coding-refinement-reconstruction (CRR) framework for ultra-low-bitrate speech coding that operates in the mel-spectrogram domain. The methodology is well-structured, leveraging a single-codebook vector quantization approach combined with conditional flow matching (CFM) for refinement and a HiFi-GAN vocoder for reconstruction. The online clustering strategy for codebook management is particularly innovative, addressing codebook collapse effectively. The self-consistency training scheme enhances computational efficiency, allowing fewer inference steps while maintaining quality.
The experiments are robust, utilizing two datasets (LibriTTS and VCTK) across different sampling rates. The evaluation metrics include both objective and subjective assessments, showcasing FMelCodec's superiority in reconstruction quality and speaker similarity at ultra-low bitrates. The results are statistically significant, demonstrating the codec's effectiveness compared to existing baselines, which is crucial for validating the proposed approach.
The paper provides detailed implementation configurations, including model architectures, training procedures, and hyperparameters, which enhances reproducibility. The availability of code and trained models on GitHub further supports this aspect, allowing other researchers to replicate the results.
While the proposed method shows promising results, the reliance on a single codebook may limit flexibility in representing diverse speech characteristics. Additionally, the computational efficiency, although improved, may still be a concern in extremely resource-constrained environments. The paper does not extensively discuss the scalability of the approach to other languages or dialects, which could be a limitation in broader applications.
The FMelCodec has significant implications for bandwidth-constrained communication systems, such as satellite communications and mobile devices, where low-bitrate speech coding is essential. Its potential applications extend to telecommunication, voice-over-IP services, and assistive technologies for individuals with speech impairments. The advancements in neural speech coding could also influence future research in audio processing and machine learning. The main contribution of this paper is the introduction of FMelCodec, a novel ultra-low-bitrate speech codec that effectively balances compression efficiency and speech quality through a sophisticated three-stage framework, demonstrating significant advancements in the field of neural speech coding. The methodology and results presented have the potential to influence future developments in audio processing and communication technologies.
Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, diffusion models, despite training faster than GANs, have slow CPU inference. In this paper, we introduce WaveNeXt 2, a unified ConvNeXt-based framework compatible with both GAN and diffusion vocoders. Its core innovation is residual denoising and sub-modeling, where each sub-model progressively refines the waveform. Experimental results in the multi-speaker dataset demonstrate the effectiveness of our approach: (1) GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit, and (2) Diff-WaveNeXt 2 also delivers much faster inference and competitive synthesis quality compared with FastDiff with 4 steps. The Diff-WaveNeXt 2 is very training-efficient, training in only 32 hours, making it ideal for resource-constrained applications.
Primary: Nara Institute of Science and Technology
All Institutions: Nara Institute of Science and Technology, National Institute of Information and Communications Technology
WaveNeXt 2 represents a significant step forward in the development of neural vocoders, providing a unified framework that enhances performance and efficiency in both GAN and diffusion contexts. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and audio processing.
The proposed WaveNeXt 2 framework introduces a novel architecture that integrates ConvNeXt-based residual denoising and sub-modeling, allowing it to function effectively in both GAN and diffusion vocoder contexts. This dual compatibility is a significant advancement, as it addresses the limitations of existing models that are typically confined to one framework. The methodology is well-structured, with clear delineation between the GAN and diffusion approaches, and the use of sub-models for noise-level conditioning is a clever adaptation that enhances performance and efficiency. The authors provide a comprehensive description of the architecture, training strategies, and inference processes, which demonstrates a solid understanding of the challenges in neural vocoding.
The experiments are robust, utilizing a substantial dataset (LibriTTS-R) and employing both subjective (MOS) and objective (UTMOS, NISQA, MCD, log F0 RMSE) evaluation metrics. The results indicate that both GAN-WaveNeXt 2 and Diff-WaveNeXt 2 outperform existing models in terms of inference speed and synthesis quality. The comparative analysis with baseline models is thorough, providing clear evidence of the proposed models' advantages. However, the paper could benefit from more extensive ablation studies to further validate the contributions of individual components.
The authors provide sufficient implementation details, including the use of PyTorch and specific training configurations, which aids reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. Including a link to the implementation or a GitHub repository would enhance reproducibility significantly.
While the paper presents strong results, it acknowledges that the increased model size due to sub-modeling could be a drawback for deployment in resource-constrained environments. Additionally, the reliance on specific architectures may limit the generalizability of the findings to other vocoder designs. The paper could also explore the trade-offs between model complexity and performance in more depth.
The advancements presented in WaveNeXt 2 have significant implications for real-time speech synthesis applications, particularly in multi-speaker scenarios and resource-constrained environments. The ability to unify GAN and diffusion frameworks could lead to more versatile and efficient vocoders, potentially enhancing the quality of synthesized speech in various applications, including virtual assistants, audiobooks, and gaming. The work could inspire further research into hybrid models that leverage the strengths of both GANs and diffusion processes. WaveNeXt 2 represents a significant step forward in the development of neural vocoders, providing a unified framework that enhances performance and efficiency in both GAN and diffusion contexts. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and audio processing.
Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at https://cumulativeortis.github.io/ .
Primary: Sungkyunkwan University
All Institutions: Sungkyunkwan University, Korea University
The paper presents CORTIS, a novel framework for continual speaker identity unlearning in zero-shot text-to-speech systems, effectively addressing privacy concerns while maintaining model performance. The integration of advanced techniques in machine unlearning and continual learning marks a significant contribution to the field, with strong experimental validation and practical implications for privacy in AI.
The proposed CORTIS framework innovatively addresses the problem of continual speaker identity unlearning in zero-shot text-to-speech systems. By combining Fisher-information-based parameter masking with orthogonal projection, it effectively prevents catastrophic re-learning of previously unlearned speakers while maintaining the quality of the remaining speakers. This dual approach is a significant advancement over previous methods that assumed simultaneous unlearning requests and failed to account for sequential requests, which is a more realistic deployment scenario. The methodology is well-justified and grounded in the principles of continual learning and machine unlearning, showcasing a thoughtful integration of concepts from both fields.
The experiments are robust, utilizing a well-defined evaluation scenario with clear metrics for assessing both retention of previously learned speakers and the quality of the generated speech. The results demonstrate that CORTIS outperforms existing methods in maintaining speaker identity suppression across multiple requests, with quantitative metrics supporting the claims made. The use of a controlled backbone (VoiceBox) ensures fair comparisons, and the detailed ablation studies provide insights into the contributions of each component of the proposed method.
The paper provides comprehensive implementation details, including the architecture of the backbone model and the specific configurations used for training and evaluation. This level of detail enhances reproducibility, allowing other researchers to replicate the experiments effectively. However, the reliance on specific datasets and models may limit broader applicability without further validation across different architectures.
The paper acknowledges limitations such as the lack of adversarial robustness and the focus on a single backbone model (VoiceBox). Additionally, while the proposed method is effective, the computational overhead introduced by the CORTIS framework may pose challenges for real-time applications. Future work could explore the scalability of the method and its performance across various architectures and datasets.
The implications of this work are significant, particularly in the context of privacy and data protection regulations like GDPR and CCPA. By providing a mechanism for continual speaker identity unlearning, the research contributes to the responsible deployment of zero-shot text-to-speech systems, which can have far-reaching effects on user privacy and consent in AI applications. The framework could be adapted for other domains requiring similar unlearning capabilities, thus broadening its impact. The paper presents CORTIS, a novel framework for continual speaker identity unlearning in zero-shot text-to-speech systems, effectively addressing privacy concerns while maintaining model performance. The integration of advanced techniques in machine unlearning and continual learning marks a significant contribution to the field, with strong experimental validation and practical implications for privacy in AI.
Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations on the complex unit sphere, without explicitly extracting phase and level difference features based on the plane-wave or spherical-wave assumptions. However, prior studies have mostly compared a small number of separately defined directional statistical mixture models, whereas a broader distribution family would enable a more systematic study of how density profiles affect separation performance. We propose the complex spherical Student's t mixture model (cSTMM), a directional mixture model that connects the complex angular central Gaussian mixture model (cACGMM), complex Bingham mixture model (cBMM), and complex Watson mixture model (cWMM) through the degrees-of-freedom parameter $ν$. We also derive a generalized minorization-maximization (MM) based procedure for parameter estimation. A no-restart evaluation on noise-free LibriSpeech mixtures reverberated with measured room impulse responses shows that a single development-selected value $ν^\ast=1$ achieved higher test-set mean signal-to-distortion ratio improvements (SDRi) than the cACGMM-equivalent setting $ν=M$ in all acoustic conditions, with an average condition-wise gain of 0.25dB. The experiments also numerically verify that the proposed formulation numerically recovers the cACGMM, cBMM, and cWMM cases.
Primary: Artificial Intelligence Research Center, AIST, Japan
All Institutions: Artificial Intelligence Research Center, AIST, Japan
The main contribution of this paper is the introduction of the cSTMM, which unifies existing directional statistical models for blind speech separation and demonstrates its effectiveness through rigorous experimental evaluation. Overall, the paper makes a meaningful contribution to the field of audio signal processing, particularly in enhancing the performance of mask-based speech separation techniques.
The paper introduces the complex spherical Student's t mixture model (cSTMM), which unifies several existing directional statistical mixture models (cACGMM, cBMM, cWMM) under a single framework. The methodology is robust, employing a generalized minorization-maximization (MM) procedure for parameter estimation, which is a significant contribution to the field. The approach allows for systematic exploration of how different density profiles impact speech separation performance, addressing a gap in prior research that focused on isolated models. The derivation of the model and the updates for parameter estimation are well-articulated, showing a clear understanding of the underlying statistical principles.
The experiments are well-structured, utilizing the LibriSpeech dataset and a variety of acoustic conditions to evaluate the performance of the proposed model. The results demonstrate a statistically significant improvement in mean signal-to-distortion ratio (SDRi) across different conditions, with a clear methodology for selecting hyperparameters. The inclusion of model recovery tests further strengthens the experimental validation, confirming that the cSTMM can effectively recover the properties of the models it encompasses.
The paper provides sufficient detail regarding the experimental setup, including the choice of datasets, evaluation metrics, and parameter settings. However, the absence of a publicly available implementation or code repository limits reproducibility. Future work should consider making the model and experiments accessible to facilitate validation by other researchers.
While the paper presents a novel model and shows promising results, the improvements in SDRi are modest (averaging 0.25 dB), which may not be substantial enough to warrant a shift from existing methods in practical applications. Additionally, the model's performance in noisy or real-world environments remains untested, which could be a significant limitation for its applicability.
The cSTMM has the potential to advance the field of blind speech separation, particularly in scenarios where supervised learning is impractical. By providing a unified framework for directional statistics, it could lead to more robust speech separation systems, benefiting applications in telecommunications, hearing aids, and automatic speech recognition. The systematic exploration of density profiles may also inspire further research into adaptive signal processing techniques. The main contribution of this paper is the introduction of the cSTMM, which unifies existing directional statistical models for blind speech separation and demonstrates its effectiveness through rigorous experimental evaluation. Overall, the paper makes a meaningful contribution to the field of audio signal processing, particularly in enhancing the performance of mask-based speech separation techniques.
In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in quality. In the case of classical music, performances often differ not only in expressive aspects such as tempo, but also in their structural interpretation of the score (including repeat patterns and edition-specific variants). To meaningfully use large-scale transcribed datasets for performance research, transcriptions of the same piece must be grouped according to their underlying structural realisation to support valid comparison. We address this by applying sequence-to-sequence alignment followed by hierarchical clustering: we create pairwise alignments for all pairs of transcriptions of a given piece, and use the alignment cost and (dis)similarity of performed sequence lengths to resolve structural mismatches as features for grouping. We propose this approach as a first step towards automatically evaluating large-scale transcribed datasets that lack ground-truth score and/or audio, shifting the evaluation criterion from truth-based accuracy to musical coherence and plausibility. We demonstrate our score-agnostic approach on around 1,500 transcriptions of 88 compositions from a recently published large-scale transcribed piano performance dataset.
Primary: Johannes Kepler University Linz
All Institutions: Johannes Kepler University Linz, LIT AI Lab, Linz Institute of Technology
The paper presents a novel approach to automatically align and cluster transcriptions of musical performances based on structural interpretations. It significantly contributes to the field by providing a scalable, reference-free method for evaluating large-scale transcribed datasets, which is essential as the volume of available music data continues to grow.
The proposed methodology effectively combines sequence-to-sequence alignment using Dynamic Time Warping (DTW) with hierarchical clustering to address the challenge of grouping transcriptions based on structural interpretations. The use of a custom distance metric that balances harmonic similarity and timing differences is innovative and tailored to the nuances of musical performance. The two-step approach, which includes both alignment and clustering, is well-structured and demonstrates a clear understanding of the complexities involved in music performance analysis. However, the paper could benefit from a more detailed discussion on the choice of parameters and their impact on the results.
The experiments conducted on the ATEPP dataset are comprehensive, covering a significant number of transcriptions and compositions. The evaluation metrics used, such as homogeneity, completeness, and V-Measure, are appropriate for assessing clustering performance. The results indicate that the proposed method is robust against structural differences and transcription artifacts, which is a critical aspect of the research. However, the paper could enhance its impact by providing more comparative analyses with existing methods beyond the baseline score-dependent repeat estimator.
The paper provides a link to the implementation in the Python library mpteval, which is a positive aspect for reproducibility. However, the details regarding the parameter settings and the specific configurations used in the experiments could be more explicitly stated to facilitate replication. Additionally, providing a sample dataset or a more detailed description of the data preprocessing steps would further enhance reproducibility.
One limitation is that the method relies heavily on the quality of the transcriptions, which can vary significantly due to the nature of automatic music transcription. The paper acknowledges this but does not explore potential solutions or mitigations for low-quality transcriptions. Furthermore, the focus on classical music may limit the generalizability of the approach to other genres or forms of music, which could be a point of consideration for future work.
The approach has significant implications for the field of music performance analysis, particularly in automating the evaluation of large-scale datasets that lack ground-truth scores. This can lead to more efficient curation and maintenance of music collections, enabling researchers to focus on higher-level analyses rather than manual quality control. The method could also inspire further research into score-agnostic evaluation techniques across various musical genres and applications. The paper presents a novel approach to automatically align and cluster transcriptions of musical performances based on structural interpretations. It significantly contributes to the field by providing a scalable, reference-free method for evaluating large-scale transcribed datasets, which is essential as the volume of available music data continues to grow.
Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.
Primary: University of Melbourne
All Institutions: University of Melbourne
This paper makes a meaningful contribution by proposing a representation-centric approach to continual learning in speech and audio, addressing the unique challenges posed by the dynamic nature of acoustic environments. The framework established in this work has the potential to guide future research and development in the field, although empirical validation and implementation details are needed to fully realize its impact.
The paper presents a novel representation-centric taxonomy for continual learning (CL) in speech and audio, addressing the unique challenges posed by the non-stationary nature of acoustic environments. The authors effectively categorize CL scenarios based on representational evolution, which is a significant advancement over traditional task-based taxonomies. The methodology is well-structured, clearly articulating the need for preserving representational geometry in modern speech systems, and it proposes a comprehensive framework for understanding the interaction between representation dynamics and adaptation mechanisms.
While the paper does not present empirical experiments or quantitative results, it offers a thorough analysis of existing CL methods and their limitations in the context of speech and audio. The authors identify gaps in current methodologies and suggest future research directions, which is valuable for guiding subsequent empirical studies. The lack of experimental validation is a notable gap, as it limits the ability to assess the practical effectiveness of the proposed taxonomy.
The paper does not provide specific implementation details or datasets, which could hinder reproducibility. However, it does reference existing methods and frameworks, suggesting that future work could build upon these established techniques. The inclusion of a GitHub repository for related resources is a positive step towards facilitating reproducibility.
A key limitation of the paper is the absence of experimental validation, which makes it difficult to assess the practical applicability of the proposed taxonomy. Additionally, while the authors identify several open problems, they do not provide concrete solutions or methodologies to address these challenges, leaving a gap for future exploration.
The implications of this work are significant for the fields of speech processing and continual learning. By reframing CL in the context of speech and audio, the authors highlight the need for new strategies that accommodate the complexities of acoustic representations. This work could influence the development of more robust and adaptable speech systems, with applications in areas such as automatic speech recognition, speaker verification, and emotion recognition. This paper makes a meaningful contribution by proposing a representation-centric approach to continual learning in speech and audio, addressing the unique challenges posed by the dynamic nature of acoustic environments. The framework established in this work has the potential to guide future research and development in the field, although empirical validation and implementation details are needed to fully realize its impact.
Recent advances in zero-shot text-to-speech (TTS) have enabled accurate imitation of reference speech in terms of both speaking style and speaker timbre. However, achieving disentangled control over these aspects from separate references remains a challenging task. Several studies have proposed disentangled speech representations that decompose speech into interpretable attributes (e.g., timbre, prosody, and content), providing a promising foundation for TTS with attribute control from separate references. Yet, how to effectively integrate such representations into TTS systems to achieve independent and precise control remains underexplored. In this paper, we present FC-TTS, a zero-shot TTS framework that enables disentangled control of style and timbre by conditioning on two distinct reference utterances. Unlike existing systems that inherit limitations from those pre-trained disentangled representations, FC-TTS introduces key design strategies, including architectural choices, training framework, and auxiliary training objectives, which improve the reliability of attribute separation and dual-reference control. Experiments show that FC-TTS achieves high-fidelity synthesis and competitive zero-shot naturalness, while uniquely supporting consistent and independent manipulation of style and timbre. Audio samples are available at https://qualcomm-ai-research.github.io/fc-tts
Primary: Qualcomm AI Research
All Institutions: Qualcomm AI Research
The main contribution of this paper is the development of FC-TTS, a zero-shot TTS framework that allows for independent control over speaking style and timbre, significantly advancing the capabilities of text-to-speech synthesis. The methodology is innovative, addressing critical challenges in the field, and the experimental results demonstrate its effectiveness, marking a notable step forward in TTS technology.
The paper introduces FC-TTS, a novel framework for zero-shot text-to-speech synthesis that enables disentangled control over style and timbre using separate reference utterances. The methodology is well-structured, utilizing a two-stage spectrogram generation pipeline, a VQ-VAE-based style encoder, and a conditional consistency loss to enhance the robustness and interpretability of the generated audio. The approach is innovative, addressing the limitations of existing methods that typically entangle style and timbre in a single reference. The proposed architecture and training strategies are clearly articulated and demonstrate a significant advancement in TTS technology.
The experimental evaluation is comprehensive, employing a variety of datasets (LibriSpeech and RAVDESS) to assess the performance of FC-TTS against state-of-the-art models. The use of both objective metrics (UTMOS, WER, SPK, MCD) and subjective evaluations (ABX tests) provides a robust framework for measuring the effectiveness of the proposed system. The results indicate that FC-TTS achieves competitive performance while maintaining the ability to manipulate style and timbre independently, which is a critical contribution to the field.
The paper provides detailed implementation details, including model architecture, training procedures, and hyperparameter settings, which enhance reproducibility. However, the reliance on proprietary datasets and the absence of publicly available checkpoints for some baseline models may limit the ease of reproduction for external researchers.
The paper acknowledges limitations such as the dependency on the quality of codec representations, potential issues with generalization to non-English languages, and the ethical implications of using such technology for deepfake generation. These factors highlight the need for further research to address robustness and ethical concerns in TTS applications.
The advancements presented in this paper have significant implications for various applications, including virtual assistants, audiobooks, and accessibility tools. The ability to control style and timbre independently can enhance user experience and personalization in TTS systems. However, the potential for misuse in generating deepfake audio raises ethical concerns that must be addressed in future developments. The main contribution of this paper is the development of FC-TTS, a zero-shot TTS framework that allows for independent control over speaking style and timbre, significantly advancing the capabilities of text-to-speech synthesis. The methodology is innovative, addressing critical challenges in the field, and the experimental results demonstrate its effectiveness, marking a notable step forward in TTS technology.
Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.
Primary: StepFun-Audio Team
All Institutions: StepFun-Audio Team
PROJECT
Neural speech codecs have become the discrete interface between raw audio and speech language models, yet they remain optimized primarily for acoustic reconstruction fidelity, which leaves emotion-relevant cues vulnerable to being discarded during quantization, limiting the affective capacity of downstream models. We trace this degradation to two mechanisms: reconstruction-driven bit allocation under limited bitrate and cross-stream leakage in concatenation-based codecs, where acoustic gradients can overwrite nominally emotion-reserved dimensions. We propose AffectCodec, an emotion-preserving neural speech codec built on Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ). By imposing block-diagonal input and output projections over emotion and acoustic subspaces, BD-RFSQ transforms bit allocation from implicit and loss-driven to explicit and structurally guaranteed, while still preserving a flat token interface for downstream speech language models. AffectCodec further combines this structurally constrained quantizer with multi-granularity emotion conditioning and multi-rate training, enabling robust affect preservation at low bitrates. Experiments across multiple emotional speech benchmarks show that AffectCodec substantially improves emotion preservation, especially in the low-bitrate regime, while maintaining competitive acoustic quality and intelligibility. These results suggest that structurally protected quantization is an effective principle for preserving emotion-relevant information and may provide a general route toward attribute-aware neural speech compression.
Primary: Beijing University of Posts and Telecommunications
All Institutions: Beijing University of Posts and Telecommunications
The paper presents AffectCodec, an emotion-preserving neural speech codec that effectively addresses the loss of affective information in existing quantization pipelines. The innovative BD-RFSQ architecture and comprehensive experimental evaluation position this work as a significant contribution to the field of audio processing and machine learning.
The methodology proposed in AffectCodec is innovative, particularly with the introduction of Block-Diagonal Residual Finite Scalar Quantization (BD-RFSQ), which structurally separates emotion and acoustic information during quantization. This architectural choice is well-justified and addresses a significant gap in existing neural speech codecs, which typically prioritize acoustic fidelity over emotional content. The multi-rate training strategy and multi-granularity emotion conditioning further enhance the model's ability to preserve emotional cues, especially at low bitrates, demonstrating a thoughtful integration of various techniques to achieve the desired outcomes.
The experiments are comprehensive, utilizing multiple emotional speech benchmarks (IEMOCAP, CREMA-D, ESD) to evaluate the performance of AffectCodec against existing codecs. The results show substantial improvements in emotion preservation, particularly at low bitrates, while maintaining competitive acoustic quality. The use of Emotion Degradation Rate (EDR) and other metrics provides a robust evaluation framework, though the paper could benefit from more subjective evaluations such as Mean Opinion Score (MOS) to complement the objective metrics.
The paper outlines the training setup and provides sufficient detail on the architecture and training procedures, which aids reproducibility. However, the lack of a public code repository or demo URL limits immediate accessibility for other researchers. Future work should include sharing the code to enhance reproducibility further.
The reliance on a frozen emotion2vec encoder may introduce biases and limit the model's adaptability to different emotion categories. Additionally, the manual selection of BD-RFSQ partition sizes and multi-rate stage targets suggests a need for more automated or adaptive methods in future iterations. The evaluation focuses on specific datasets, and the generalizability of the findings to other contexts remains to be tested.
The ability to preserve emotional fidelity in speech codecs has significant implications for applications in empathetic dialogue systems, mental health monitoring, and expressive speech synthesis. However, there are ethical considerations regarding the potential misuse of such technologies for emotional manipulation or deceptive practices, necessitating responsible deployment practices. The paper presents AffectCodec, an emotion-preserving neural speech codec that effectively addresses the loss of affective information in existing quantization pipelines. The innovative BD-RFSQ architecture and comprehensive experimental evaluation position this work as a significant contribution to the field of audio processing and machine learning.
Speech deepfake detection has achieved remarkable success in clean environments but faces significant challenges in complex, real-world scenarios where speech is often mixed with background music or noise. Current state-of-the-art methods rely on semantic features from self-supervised learning (SSL) models, which often fail when processing non-speech or mixed-source audio. In this paper, we first introduce MixFake, a large-scale benchmark dataset designed to simulate diverse acoustic environments with varying SNR levels and mixed authenticity components. To address the "semantic-centric" limitation, we propose a Multi-stream Prompt Tuning framework that injects signal-level priors into SSL backbones. By integrating base, frequency, and texture streams through deep prompt injection, our model effectively captures acoustic artifacts. Experimental results demonstrate that our method significantly outperforms existing baselines, achieving a 0.95% EER in foreground detection and a substantial 7.72% absolute improvement in complex background detection tasks. Our dataset and code are available at https://github.com/saltfish233/MixFake.
Primary: Zhejiang University
All Institutions: Nanjing University of Science and Technology, Zhejiang University
This paper presents a significant advancement in audio deepfake detection by introducing a novel dataset and a multi-stream framework that enhances detection capabilities in complex acoustic environments. The comprehensive methodology and rigorous experimental evaluation contribute to its potential impact on the field of machine learning and audio processing.
The proposed methodology introduces a novel Multi-stream Prompt Tuning framework that enhances the capabilities of self-supervised learning models by integrating signal-level acoustic priors through three distinct streams (Base, Frequency, and Texture). This approach effectively addresses the limitations of existing models in detecting audio deepfakes in complex mixed environments, making it a significant advancement in the field. The use of the Hilbert-Huang Transform and Teager-Kaiser Energy Operator for feature extraction is particularly innovative, allowing for a more nuanced understanding of audio signals. The systematic construction of the MixFake dataset, which simulates real-world acoustic environments, further strengthens the methodological framework by providing a robust testing ground for the proposed techniques.
The experimental evaluation is comprehensive, utilizing a large-scale dataset (MixFake) that includes a variety of audio scenarios, including both single-source and mixed-source conditions. The results demonstrate a significant improvement over existing state-of-the-art models, particularly in challenging background detection tasks. The use of Equal Error Rate (EER) as a performance metric is appropriate for the domain, and the reported improvements in EER values substantiate the effectiveness of the proposed method. The ablation studies provide additional insights into the contributions of each component of the framework, reinforcing the validity of the results.
The paper provides sufficient implementation details, including the architecture of the model, training strategies, and evaluation metrics, which facilitates reproducibility. The availability of the dataset and code on GitHub is a strong point, as it allows other researchers to validate the findings and build upon the work. However, the paper could benefit from more detailed hyperparameter settings and training conditions to enhance reproducibility further.
One limitation of the study is the potential overfitting to the specific conditions of the MixFake dataset, as the performance may not generalize to all real-world scenarios. Additionally, while the dataset covers a range of SNR levels, it may not encompass all possible acoustic environments encountered in practice. The reliance on SSL models may also limit the approach's applicability in scenarios where labeled data is scarce or unavailable.
The implications of this research are significant, as it addresses a pressing issue in the realm of audio deepfake detection, which has far-reaching consequences in security, privacy, and misinformation. The development of robust detection methods can help mitigate the risks associated with malicious use of speech synthesis technologies. Furthermore, the MixFake dataset can serve as a valuable resource for future research in audio processing and deepfake detection, fostering advancements in the field. This paper presents a significant advancement in audio deepfake detection by introducing a novel dataset and a multi-stream framework that enhances detection capabilities in complex acoustic environments. The comprehensive methodology and rigorous experimental evaluation contribute to its potential impact on the field of machine learning and audio processing.
Competitive music transcription models require large amounts of paired audio-score data, which is scarce due to collection costs, alignment difficulty, and copyright restrictions. Meanwhile, vast quantities of unpaired audio recordings and symbolic scores are freely available but have gone unused. We adopt a cycle-consistent translation framework in which a small amount of paired data acts as a minimal anchor, unlocking the full potential of the unpaired pool. We find that: unpaired data yields surprisingly large gains, especially under limited supervision; unpaired audio contributes more than unpaired scores; incorporating unlabeled audio from a new instrument during training improves transcription for that instrument without any paired supervision. Together, these results suggest that scaling unpaired data offers a practical path toward high-quality transcription for instruments where labeled data remains scarce.
Primary: Cornell University
All Institutions: Cornell University
This paper significantly advances the field of music transcription by introducing a novel framework that effectively utilizes unpaired data, demonstrating its potential to enhance transcription quality across various instruments and settings. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the machine learning community.
The paper presents a cycle-consistent translation framework that leverages a minimal amount of paired data to effectively utilize a vast pool of unpaired audio and symbolic scores. This approach addresses the significant challenge of data scarcity in music transcription by demonstrating that unpaired audio provides a stronger learning signal than unpaired scores. The methodology is well-structured, employing a pre-trained variational autoencoder (VAE) to create a continuous latent space for cross-modal translation, which is a notable innovation in the field.
The experiments are comprehensive, utilizing multiple datasets (MAESTRO, GuitarSet, MusicNet-EM) to validate the proposed method across various scenarios, including low-resource settings and cross-instrument generalization. The results demonstrate significant improvements in transcription accuracy, particularly when unpaired data is incorporated, and highlight the robustness of the model against distribution shifts. The use of Frame F1 as a performance metric is appropriate for the task.
The paper provides sufficient implementation details, including the architecture of the models, training procedures, and datasets used. However, the lack of a direct link to a demo or interactive visualization limits the ease of reproducibility for external researchers who may want to validate the findings.
One limitation is the reliance on a small amount of paired data to anchor the training process, which may not always be feasible in practice. Additionally, while the results are promising, the model's performance still lags behind fully supervised methods, indicating room for improvement. The paper also does not explore the potential impact of varying the amount of unpaired data in more detail.
The findings have significant implications for music transcription, particularly in contexts where labeled data is scarce or unavailable. By unlocking the potential of unpaired data, this research could lead to advancements in automatic music transcription systems, making them more accessible and applicable to a wider range of instruments and styles. The approach could also inspire further research into unsupervised learning techniques in other domains. This paper significantly advances the field of music transcription by introducing a novel framework that effectively utilizes unpaired data, demonstrating its potential to enhance transcription quality across various instruments and settings. The comprehensive evaluation and innovative methodology position this work as a valuable contribution to the machine learning community.
In this technical report, we describe our submission for the WildSpoof Challenge TTS Track: Text-to-Speech with In-the-Wild Data. We introduce F5-TTS-DPS, a model built upon the F5-TTS architecture. Our approach integrates Exponential Moving Average (EMA) into supervised fine-tuning to stabilize training and improve generalization. To enhance synthesis fidelity, we leverage large language models (LLMs) and large audio language models (LALMs) for dual-scoring prompt selection, filtering reference audio and text prompts to ensure quality while addressing alignment issues in noisy datasets. Experimental evaluation demonstrates that F5-TTS-DPS achieves strong performance with UTMOS of 3.20 and speaker similarity of 0.51 on the development set. More importantly, our model achieves the best a-DCF scores of 0.1582, 0.5233, and 0.2562 across three advanced SASV systems among all submissions, indicating our synthesized speech is the most difficult to detect and exhibits the highest degree of naturalness and authenticity. Combined with competitive WER performance, these results validate the effectiveness of our approach in generating natural-sounding speech with strong spoofing capabilities.
Primary: Ant Group
All Institutions: Ant Group, The Chinese University of Hong Kong, Zhejiang University
The main contribution of this paper is the introduction of F5-TTS-DPS, a novel TTS model that effectively combines EMA for training stability and dual-scoring prompt selection for improved synthesis quality, demonstrating significant advancements in TTS technology for in-the-wild applications. This work is poised to influence future research in TTS and related fields, particularly in enhancing the robustness and naturalness of synthesized speech in diverse acoustic environments.
The paper introduces F5-TTS-DPS, which innovatively integrates Exponential Moving Average (EMA) into the fine-tuning process to enhance training stability and generalization in TTS models. The dual-scoring prompt selection mechanism, which utilizes both large language models (LLMs) and large audio language models (LALMs), is a significant methodological advancement that addresses the challenges of noisy in-the-wild datasets. This approach not only improves the quality of the synthesized speech but also ensures semantic coherence between audio and text prompts. The methodology is well-structured and demonstrates a clear understanding of the challenges associated with TTS in real-world scenarios.
The experimental setup is robust, with a comprehensive ablation study that evaluates the contributions of various components of the F5-TTS-DPS system. The authors provide detailed metrics, including UTMOS, speaker similarity, and a-DCF scores, which are critical for assessing the performance of TTS systems. The reported results indicate that the proposed model outperforms existing systems, achieving the best a-DCF scores across multiple advanced SASV systems, thereby validating the effectiveness of the proposed methods.
The paper provides sufficient details regarding the training configuration, including hyperparameters and dataset descriptions. However, the lack of a publicly accessible demo or project URL limits the reproducibility of the results. Future work could benefit from sharing code and models to facilitate independent verification of the findings.
While the paper presents strong results, it does not discuss potential limitations in terms of the diversity of the training data or the generalizability of the model across different languages and accents. Additionally, the reliance on large models for prompt selection may introduce computational overhead, which could be a barrier for practical deployment in resource-constrained environments.
The advancements in TTS synthesis presented in this paper have significant implications for various applications, including virtual assistants, audiobooks, and accessibility technologies. By improving the naturalness and robustness of synthesized speech, the work contributes to making TTS systems more effective in real-world scenarios, thereby enhancing user experience and accessibility. The main contribution of this paper is the introduction of F5-TTS-DPS, a novel TTS model that effectively combines EMA for training stability and dual-scoring prompt selection for improved synthesis quality, demonstrating significant advancements in TTS technology for in-the-wild applications. This work is poised to influence future research in TTS and related fields, particularly in enhancing the robustness and naturalness of synthesized speech in diverse acoustic environments.