Audio ML Papers

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang ... · arXiv

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inf...

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

Institutional Affiliations

Primary: Columbia University

All Institutions: Columbia University

Demo · GitHub

ML Relevance Analysis (84)

The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust and innovative, combining implicit affect inference from user speech with explicit valence-arousal (VA) conditioning. The authors construct a novel dataset (Sympatheia-18k) that allows for the training of a speech-to-speech dialogue system capable of generating emotionally appropriate responses. The use of continuous VA coordinates as a conditioning mechanism is a significant advancement over traditional discrete emotion categories, allowing for more nuanced emotional responses. The integration of multimodal emotion sensing modules adds further depth to the system, making it adaptable to various input types. The architecture follows a well-established speech-language model (GLM-4-Voice) but enhances it with emotional conditioning, which is a thoughtful approach to improving empathetic dialogue systems.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing both automated and human assessments to evaluate the empathetic response quality of the Sympatheia system. The authors employ a variety of metrics, including empathy scores from an audio-capable LLM and a human Emotion Mean Opinion Score (MOS) study, which provides a well-rounded view of the model's performance. The results indicate that Sympatheia significantly outperforms baseline models in generating emotionally appropriate responses, validating the effectiveness of the proposed methods. The use of both emotional and neutral splits in the dataset allows for a thorough examination of the model's capabilities across different emotional contexts.

Reproducibility

The paper provides detailed implementation details, including training configurations and dataset generation processes, which enhance reproducibility. The availability of the project code and dataset on GitHub and Hugging Face respectively further supports the ability of other researchers to replicate the study. However, the reliance on synthetic data for training may introduce variability that could affect reproducibility in real-world applications.

Limitations

The paper acknowledges several limitations, including the synthetic nature of the training data, which may not fully capture the complexity of real-world conversations. Additionally, the fixed VA anchors used for emotional conditioning may not universally apply across different cultures or individual expressions of emotion. The authors also note that the current evaluation primarily relies on automated assessments, which may miss nuanced failures in empathy and appropriateness.

Broader Impact

The potential applications of Sympatheia are significant, particularly in assistive technologies, education, and mental health support, where emotionally aware interactions can enhance user experience. However, the deployment of such systems raises ethical considerations regarding privacy and the potential for misuse in manipulative contexts. The authors emphasize the need for safeguards and responsible deployment practices to mitigate these risks. The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.

Analysis: Full Paper • Full text: 50,026 characters

Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection

Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen ... · arXiv

We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the densit...

We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.

Institutional Affiliations

Primary: Norwegian University of Science and Technology

All Institutions: Norwegian University of Science and Technology, Tsinghua University

ML Relevance Analysis (83)

This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel Lagrangian sub-flow (LSF) framework for out-of-distribution (OOD) detection using continuous normalizing flows (CNFs). The methodology is well-grounded in fluid dynamics principles, allowing for localized analysis of high-dimensional data while maintaining global context. The approach effectively addresses the "likelihood paradox" by isolating relevant components in the data representation, which is a significant advancement in the field of generative models. The proposed geometric diagnostic signals and metrics for phoneme-level mispronunciation detection are innovative and provide a fresh perspective on OOD detection.

Experimental Evaluation

The experiments are robust, utilizing a real-world dataset (CMU Kids) for zero-shot phoneme-level mispronunciation detection. The results demonstrate the superiority of the proposed metrics over traditional likelihood-based methods, highlighting the effectiveness of the LSF framework. The evaluation metrics, including ROC-AUC, are appropriate for the task, although further validation across diverse datasets would strengthen the findings.

Reproducibility

The paper provides sufficient details on the experimental setup, including model training and evaluation processes. However, the lack of publicly available code or a demo limits reproducibility. Clear descriptions of the methods and metrics used contribute positively, but access to implementation details would enhance reproducibility.

Limitations

The study is primarily focused on a specific application in speech synthesis, which may limit the generalizability of the findings. The authors acknowledge the need for further validation across other domains, indicating that the framework's applicability is yet to be fully explored. Additionally, the complexity of the proposed methods may pose challenges for practical implementation in real-time systems.

Broader Impact

The proposed framework has the potential to significantly improve OOD detection in various applications beyond speech synthesis, such as computer vision and medical imaging. By enhancing the ability to detect mispronunciations and other anomalies, this work could lead to advancements in automated speech recognition and generative modeling, ultimately benefiting user experience and system reliability. This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.

Analysis: Full Paper • Full text: 44,979 characters

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams ... · arXiv

Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP),...

Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

GitHub

ML Relevance Analysis (75)

The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.

Comprehensive Analysis

Methodology Assessment

The methodology employed in the development of QuAP is robust, integrating a hybrid retrieval system with procedural audio synthesis and an intelligent parameter assistant. The use of MobileNet for audio embeddings and the feature-driven bottleneck framework for optimizing synthesis parameters demonstrates a thoughtful approach to addressing the challenges in sound design workflows. However, the paper could benefit from a more detailed description of the implementation specifics and the exact parameters used in the optimization process.

Experimental Evaluation

The experimental evaluation is well-structured, utilizing a MUSHRA subjective evaluation to assess the quality of the synthesized audio and an ablation study to compare encoder architectures. The results indicate statistically significant improvements in sound quality for most models, which supports the effectiveness of the proposed system. However, the relatively small sample size in the user evaluation (16 participants) may limit the generalizability of the findings.

Reproducibility

While the paper provides a project URL and mentions the use of established datasets and frameworks, it lacks detailed implementation instructions or code availability, which could hinder reproducibility. More explicit documentation on the setup and execution of experiments would enhance this aspect.

Limitations

The study acknowledges limitations, particularly in the synthesis quality of certain models (e.g., Rocket and Jet) and the narrow scope of sound categories supported by QuAP. The reliance on subjective evaluations may also introduce biases, and the tool's performance in real-world scenarios remains to be fully validated.

Broader Impact

QuAP has the potential to significantly impact sound design practices by streamlining workflows and enhancing creative exploration. By unifying retrieval and synthesis, it could facilitate more efficient sound design processes across various industries, including film, gaming, and music production. The focus on maintaining creative agency while providing intelligent assistance is particularly relevant in the context of increasing automation in creative fields. The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.

Analysis: Full Paper • Full text: 30,559 characters

Latent Space Disentanglement via Activation Steering for Interpretable Attribute Control in Symbolic Music Generation

Ioannis Prokopiou, Pantelis Vikatos, Maximos Kaliakatsos-Papakostas ... · EUSIPCO 2026 (34th European Signal Processing Conference)

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of th...

Transformer-based architectures have significantly advanced the generation of complex symbolic sequences, yet a significant gap remains in achieving fine-grained, interpretable control over discrete signal attributes. This paper investigates the mechanistic interpretability of the Multitrack Music Transformer (MMT) and proposes a framework for deterministic attribute modulation without retraining to bridge this gap via inference-time activation steering. Utilizing the Difference-in-Means (DiffMean) methodology, we isolate latent directions for signal attributes, specifically Pitch and Duration, within the residual stream. We validate the Linear Representation Hypothesis in this domain, achieving high correlation between steering magnitude and attribute shift. To address the inherent feature entanglement in multi-attribute steering, we introduce a Dual Steering framework utilizing Gram-Schmidt Orthogonalization. Experimental results demonstrate that this geometric decoupling reduces conceptual interference and signal degradation compared to naive vector addition, enabling independent deterministic control even against strong autoregressive conditioning.

Institutional Affiliations

Primary: Athens University of Economics and Business

All Institutions: Athens University of Economics and Business, Orfium, Hellenic Mediterranean University, National Center for Scientific Research “Demokritos”

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a framework for deterministic attribute modulation in symbolic music generation through activation steering, which enhances interpretability and control without the need for retraining. This work is significant as it bridges the gap between complex generative models and user-driven control, paving the way for more interactive and user-friendly music generation systems.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel approach to activation steering in the Multitrack Music Transformer (MMT) by utilizing the Difference-in-Means (DiffMean) methodology to isolate latent directions for musical attributes. The introduction of a Dual Steering framework using Gram-Schmidt Orthogonalization is a significant advancement in addressing feature entanglement, allowing for independent control of attributes like Pitch and Duration. The methodology is well-structured, leveraging existing theories in mechanistic interpretability while innovatively applying them to symbolic music generation.

Experimental Evaluation

The experimental setup is robust, with clear definitions of the steering vectors and comprehensive evaluations across both unconditional and conditional generation paradigms. The use of statistical measures such as Pearson correlation coefficients and R² values provides a solid quantitative basis for the effectiveness of the steering methods. The results demonstrate a high degree of success in achieving the intended attribute shifts, with detailed analysis of steering dynamics across various layers of the transformer architecture.

Reproducibility

The paper includes sufficient detail regarding the model architecture, data representation, and experimental procedures, which enhances reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the experiments. The URL provided for audio examples is a positive aspect, but a more comprehensive project URL would bolster reproducibility further.

Limitations

One limitation is the reliance on a single dataset (SOD), which may affect the generalizability of the findings. Additionally, while the paper addresses conceptual interference, the methods for dual steering may still encounter challenges in more complex musical contexts or with additional attributes. The paper could also benefit from a discussion on the computational efficiency of the proposed methods in real-time applications.

Broader Impact

This research has the potential to significantly impact the field of music generation and AI-driven creative tools, providing musicians and composers with more precise control over generated outputs. The findings could be applied in various applications, including algorithmic composition, interactive music systems, and educational tools for music theory. The focus on mechanistic interpretability also contributes to the broader discourse on transparency and explainability in AI systems. The main contribution of this paper is the introduction of a framework for deterministic attribute modulation in symbolic music generation through activation steering, which enhances interpretability and control without the need for retraining. This work is significant as it bridges the gap between complex generative models and user-driven control, paving the way for more interactive and user-friendly music generation systems.

Analysis: Full Paper • Full text: 18,814 characters

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

Guangyin Bao, Taiping Zeng, Jianfeng Feng ... · arXiv

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive ...

Reconstructing continuous speech from non-invasive neural recordings is a fundamental problem for probing human auditory perception and building safe, scalable speech brain-computer interfaces. Despite recent progress, intelligible reconstruction remains elusive, as non-invasive recordings are inherently noisy, spatially blurred, and only partially preserve information about perceived speech. Existing methods directly map neural activity to entangled speech representations before synthesizing waveforms with neural vocoders, resulting in spectral-similar but unintelligible results. To overcome these limitations, we introduce MindVoice, a neuro-to-speech reconstruction framework that uses pretrained models to compensate for the incomplete semantic and acoustic information in neural recordings. MindVoice disentangles reconstruction into two complementary pathways: one recovers high-level semantic content, while the other estimates fine-grained acoustic attributes. These inferred representations are then fused with powerful speech generation models and in-context voice cloning to synthesize natural and intelligible utterances. Extensive experiments on EEG and MEG demonstrate that MindVoice substantially outperforms existing methods on various metrics. These results show that pretrained priors provide a principled way to bridge the gap between noisy neural recordings and natural speech, highlighting a promising attempt for auditory neuroscience research and non-invasive speech brain-computer interfaces.

Institutional Affiliations

Primary: Fudan University

All Institutions: Fudan University

ML Relevance Analysis (83)

The MindVoice framework represents a significant advancement in reconstructing intelligible speech from non-invasive neural signals, utilizing a novel dual-stream architecture that effectively leverages pretrained models to address the challenges posed by noisy and incomplete neural recordings. This work has the potential to impact both the fields of auditory neuroscience and speech technology significantly.

Comprehensive Analysis

Methodology Assessment

The proposed MindVoice framework introduces a dual-stream architecture that separates semantic and acoustic reconstruction, leveraging pretrained models to enhance the intelligibility of reconstructed speech from non-invasive neural signals. This approach is innovative as it addresses the inherent noise and spatial blurring of neural recordings by disentangling the reconstruction process into two complementary pathways. The use of pretrained models for both semantic and acoustic attributes is a significant methodological advancement, allowing the model to compensate for the incomplete information present in neural signals. The architecture's design is well-justified, and the integration of various neural network components, including CNNs and Transformers, is appropriate for the task.

Experimental Evaluation

The authors conduct extensive experiments on two datasets (Brennan EEG and Gwilliams MEG), demonstrating that MindVoice outperforms existing baselines across multiple metrics, including semantic accuracy and speech quality. The evaluation metrics employed, such as HuBERT representation similarity and BERTScore-F1, are robust and relevant for assessing the intelligibility and quality of reconstructed speech. The results indicate a clear improvement over previous methods, validating the effectiveness of the proposed framework. However, the paper could benefit from more detailed comparisons with additional baselines and a broader range of evaluation metrics.

Reproducibility

The implementation details are provided, including the architecture, training parameters, and preprocessing steps. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should consider releasing the code and models to facilitate further research and validation by the community.

Limitations

The study acknowledges limitations, including the model's tendency to produce generative hallucinations when neural signals do not provide sufficient information. The focus on semantic and timbre similarity may compromise fine-grained temporal fidelity, which is critical for certain applications. Additionally, the framework's applicability is currently limited to non-invasive neural signals related to auditory perception, leaving open questions about its performance on other types of neural signals.

Broader Impact

The research has significant implications for the development of non-invasive speech brain-computer interfaces, potentially enabling communication for individuals with speech impairments. It also contributes to our understanding of auditory processing in the brain, paving the way for future studies in auditory neuroscience. The framework's ability to reconstruct intelligible speech from neural signals could lead to advancements in assistive technologies and enhance our understanding of human cognition. The MindVoice framework represents a significant advancement in reconstructing intelligible speech from non-invasive neural signals, utilizing a novel dual-stream architecture that effectively leverages pretrained models to address the challenges posed by noisy and incomplete neural recordings. This work has the potential to impact both the fields of auditory neuroscience and speech technology significantly.

Analysis: Full Paper • Full text: 50,026 characters

Privacy-preserving Prosody Representation Learning

Kevin Everson, Mari Ostendorf · ACL 2026

Speech representations that capture prosodic information can be useful for both understanding and generation. However, speaker characteristics are reflected in acoustic-prosodic features (e.g., pitch). To address privacy concerns from the leakage of identity information, we propo...

Speech representations that capture prosodic information can be useful for both understanding and generation. However, speaker characteristics are reflected in acoustic-prosodic features (e.g., pitch). To address privacy concerns from the leakage of identity information, we propose a new self-supervised approach to learning prosody representations that incorporates speaker disentanglement strategies. We evaluate our encoder on three tasks to probe representation capabilities, including pitch reconstruction and detection of different prosodic events. Our encoder outperforms raw prosody and HuBERT-base baselines, achieving strong speaker disentanglement without adverse impact on prosody-related downstream tasks.

Institutional Affiliations

Primary: University of Washington

All Institutions: University of Washington

GitHub

ML Relevance Analysis (78)

The main contribution of this paper is the development of a self-supervised prosody encoder that successfully disentangles speaker characteristics while preserving prosodic information, addressing critical privacy concerns in speech processing. The technical contributions and innovative methodology position this work as a meaningful advancement in the field of audio processing, with potential applications in privacy-sensitive speech technologies.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust, leveraging self-supervised learning to create a prosody encoder that effectively disentangles speaker characteristics from prosodic features. The use of glottal source estimation as input is innovative, and the combination of adversarial training with speaker normalization is a thoughtful approach to mitigate privacy concerns while maintaining prosody representation quality. The architecture builds on existing models like HuBERT and ProsodyBERT, but introduces significant enhancements, particularly in the context of privacy-preserving applications.

Experimental Evaluation

The experimental evaluation is comprehensive, utilizing multiple tasks to assess the encoder's performance, including pitch reconstruction and prosodic event detection. The results demonstrate clear improvements over baseline models, indicating that the proposed methods effectively enhance prosody modeling without compromising speaker disentanglement. The use of extensive datasets, such as the GigaSpeech corpus, strengthens the validity of the findings.

Reproducibility

The paper provides detailed implementation information, including the training setup and the specific datasets used. However, the reliance on pseudo-labels for speaker normalization may affect reproducibility, as the effectiveness of the disentanglement strategies could vary with different labeling approaches. The GitHub repository linked in the paper aids in reproducibility, but the absence of publicly available code for some related works limits comparative evaluations.

Limitations

The paper acknowledges limitations, including the use of pseudo-labels instead of ground-truth speaker labels, which may hinder the effectiveness of the proposed methods. Additionally, the focus on local prosodic events could limit the generalizability of the findings to more complex paralinguistic tasks. The model's non-causal nature also restricts its application in real-time scenarios.

Broader Impact

The implications of this research are significant, particularly in the context of privacy-preserving speech technologies. By effectively disentangling speaker information from prosodic features, the proposed encoder can contribute to safer speech processing applications, such as AI assistants and voice synthesis systems, where user privacy is paramount. The approach could also inspire further research into privacy-preserving techniques across various domains of machine learning. The main contribution of this paper is the development of a self-supervised prosody encoder that successfully disentangles speaker characteristics while preserving prosodic information, addressing critical privacy concerns in speech processing. The technical contributions and innovative methodology position this work as a meaningful advancement in the field of audio processing, with potential applications in privacy-sensitive speech technologies.

Analysis: Full Paper • Full text: 20,467 characters

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu ... · arXiv

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies ...

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

Institutional Affiliations

Primary: National Taiwan University

All Institutions: National Taiwan University

ML Relevance Analysis (83)

This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.

Comprehensive Analysis

Methodology Assessment

The paper presents a comprehensive taxonomy of jailbreak attacks and defenses in Large Audio Language Models (LALMs), categorizing them into semantic, acoustic, signal, and embedding-layer attacks, as well as guard-based, training-free, and training-based defenses. The methodology is robust, combining a structured survey with empirical evaluations across ten open-source LALMs, which allows for a fair comparison of various attack and defense strategies. The authors also introduce a cost-aware evaluation framework that considers not just attack success rates but also benign refusal and latency, which is a significant improvement over previous works that focused solely on success rates.

Experimental Evaluation

The experiments are well-structured, utilizing a controlled dataset from JailbreakBench with 100 harmful and 100 benign requests, allowing for a clear assessment of the effectiveness of various attacks and defenses. The results indicate that different attack strategies yield varying success rates, with the Acoustic Best-of-N attack demonstrating the highest vulnerability. The empirical evaluation of defenses reveals a trade-off between robustness and usability, highlighting the complexity of ensuring safety in LALMs.

Reproducibility

The paper provides detailed descriptions of the experimental setup, including the datasets used, the models evaluated, and the specific attack and defense methods employed. However, the reliance on specific hardware and configurations may limit the reproducibility of results in different environments. The authors do not provide code or data access, which could further hinder reproducibility.

Limitations

The authors acknowledge several limitations, including the restricted model coverage to ten open-source LALMs and the controlled nature of the dataset, which may not fully represent real-world scenarios. Additionally, the evaluation metrics used may not capture all aspects of deployment, such as user satisfaction with benign responses. The paper also does not explore all possible attack and defense categories outlined in the taxonomy.

Broader Impact

The findings of this paper have significant implications for the development of safe and robust LALMs, particularly in applications involving voice assistants and interactive systems. The emphasis on cost-aware evaluation and the identification of vulnerabilities across different modalities can guide future research in creating more resilient audio systems. The work also raises awareness about the potential for misuse of LALMs in bypassing safety mechanisms, highlighting the need for ongoing research into equitable and effective safety measures. This paper provides a unified taxonomy and empirical evaluation of jailbreak attacks and defenses for LALMs, contributing significantly to the understanding of vulnerabilities in audio-based models. The comprehensive approach and findings underscore the importance of considering multiple dimensions of safety and usability in the design of LALMs.

Analysis: Full Paper • Full text: 50,026 characters

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Tiantian Feng, Anfeng Xu, Xuan Shi ... · arXiv

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, ...

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Institutional Affiliations

Primary: University of Southern California

All Institutions: University of Southern California, The Ohio State University, University of California, Los Angeles, Harvard University, Boston University, University of Miami

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is robust, as it introduces the ChildVox benchmark, which encompasses a wide range of child-centered audio and speech tasks. The integration of over 20 sub-tasks across 17 datasets is a significant advancement, allowing for a comprehensive evaluation of various audio and speech foundation models. The approach to define "voice" in children broadly, including physiological sounds and non-linguistic vocalizations, is innovative and necessary for understanding child communication. The evaluation of multiple model architectures, including self-supervised and ASR-oriented models, provides a well-rounded perspective on the capabilities of current technologies in this domain.

Experimental Evaluation

The experiments are thorough, with a clear structure that includes a variety of tasks and datasets. The benchmark results demonstrate that ChildVox provides high-performance models for recognizing a wide range of acoustic signals from children. The paper effectively compares the performance of different models on specific tasks, highlighting the strengths and weaknesses of each. The use of Macro-F1 scores for classification tasks and WER for ASR tasks is appropriate, ensuring that the evaluation metrics are relevant to the goals of the benchmark.

Reproducibility

The paper provides detailed information about the datasets, experimental setup, and model training parameters, which enhances reproducibility. However, the lack of publicly available code or models limits the ability for other researchers to replicate the results fully. The authors mention plans to release models under a Responsible AI License, which is a positive step towards improving reproducibility in the future.

Limitations

The paper acknowledges several limitations, including the focus on English-language recordings, which may restrict generalizability to other languages and dialects. Additionally, the subjective nature of some tasks, such as affective vocalization classification, may introduce variability in annotation reliability. The authors also note that the benchmark does not cover all recent advancements in audio foundation models, which could limit its comprehensiveness.

Broader Impact

The ChildVox benchmark has significant implications for research in child development, speech therapy, and early childhood education. By providing a structured framework for evaluating child-centered audio processing, it can facilitate advancements in understanding children's communication and support the development of tools for monitoring and enhancing language skills. The potential applications in clinical settings for tracking speech production and language development are particularly noteworthy. The main contribution of this paper is the introduction of the ChildVox benchmark, which systematically evaluates a wide range of child-centered audio and speech tasks, significantly advancing the field of child communication research. The comprehensive methodology, rigorous experimental design, and acknowledgment of limitations highlight the paper's significance and potential impact on future research and applications in audio processing for children.

Analysis: Full Paper • Full text: 50,026 characters

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Yonggang Zhu, Liting Gao, Aidong Men ... · arXiv

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Exis...

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

Institutional Affiliations

Primary: Beijing University of Posts and Telecommunications

All Institutions: Beijing University of Posts and Telecommunications, University of Surrey

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework, COMET, utilizing Partial Least Squares Singular Value Decomposition (PLS-SVD) to analyze and mitigate the modality gap between audio and text embeddings in CLAP models. The methodology is well-structured, offering a fresh perspective on the decomposition of multimodal embeddings into interpretable concepts. The spectral truncation method proposed is innovative, allowing for effective dimensionality reduction while maintaining performance, which is a significant contribution to the field of multimodal contrastive learning.

Experimental Evaluation

The experiments are comprehensive, utilizing standard datasets like Clotho and AudioCaps for evaluation. The results demonstrate that the proposed PLSHead method achieves comparable or improved performance over the original embeddings, validating the effectiveness of the approach. The paper provides detailed metrics for retrieval tasks, showcasing the robustness of the method across different scenarios, including in-domain and cross-domain evaluations.

Reproducibility

The paper lacks explicit implementation details or code availability, which could hinder reproducibility. While the methodology is clearly described, the absence of a publicly available codebase or demo limits the ability for other researchers to replicate the findings.

Limitations

One limitation is the reliance on existing CLAP models, which may introduce biases based on their training data. Additionally, while the proposed methods show promise, the paper does not explore the potential impacts of varying the number of retained dimensions in the spectral truncation, which could affect generalization in different contexts.

Broader Impact

The findings have significant implications for audio understanding and generation tasks, particularly in zero-shot scenarios. By effectively bridging the modality gap, the proposed methods could enhance the performance of multimodal applications, making them more accessible and efficient. This work could pave the way for future research in multimodal learning and its applications in real-world scenarios. The main contribution of this paper is the introduction of COMET, a novel framework for analyzing and mitigating the modality gap in audio-text multimodal contrastive embeddings, which significantly enhances the performance of zero-shot audio captioning tasks. The comprehensive analysis and innovative methodology position this work as a meaningful advancement in the field of multimodal machine learning.

Analysis: Full Paper • Full text: 50,026 characters

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

Jeong Hun Yeo, Minsu Kim, Hyeongseop Rha ... · arXiv

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper an...

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain under-explored in ASR contexts. This paper analyzes three decoding schemes for DLM-based ASR: fixed-number, static confidence threshold, and dynamic confidence threshold. We propose measuring round-wise accuracy using Negative Log-Likelihood-based uncertainty as a proxy for decoding progress. Our results show that both threshold-based strategies significantly outperform fixed-number schemes in accuracy and speed. We attribute this to a property unique to ASR: most tokens reach high confidence early, allowing reliable ones to be harvested aggressively while leaving only difficult tokens for later rounds. Notably, the static-threshold strategy matches the accuracy of autoregressive decoding while offering superior efficiency.

Institutional Affiliations

Primary: KAIST

All Institutions: KAIST, Google DeepMind

ML Relevance Analysis (83)

The main contribution of this paper is the systematic evaluation of decoding strategies for DLM-based ASR, revealing that static and dynamic thresholding significantly enhance accuracy and speed compared to fixed-number decoding. This work provides a crucial step towards optimizing ASR systems, particularly in leveraging the unique properties of DLMs for improved performance.

Comprehensive Analysis

Methodology Assessment

The paper presents a systematic evaluation of decoding strategies for DLM-based ASR, comparing fixed-number, static threshold, and dynamic threshold approaches. The methodology is well-structured, utilizing Negative Log-Likelihood (NLL) as a measure of uncertainty, which is a novel approach in this context. The authors effectively analyze the performance of each strategy in terms of accuracy and speed, providing a clear rationale for their findings. However, the reliance on a single baseline model (Whisper-LLaDA) may limit the generalizability of the results.

Experimental Evaluation

The experiments are comprehensive, utilizing the LibriSpeech dataset and focusing on various hyperparameters for each decoding strategy. The evaluation metrics, including Word Error Rate (WER) and Real-Time Factor (RTF), are appropriate for assessing the performance of ASR systems. The results indicate that threshold-based strategies significantly outperform fixed-number schemes, which is a valuable contribution to the field. However, the paper could benefit from additional experiments on diverse datasets to validate the findings further.

Reproducibility

The paper provides sufficient details on the experimental setup, including the training process and evaluation metrics. However, the absence of code or a project URL limits reproducibility. Future work should include sharing the implementation to facilitate validation by other researchers.

Limitations

The study is limited to clean read English speech from the LibriSpeech test-clean set, which may not fully represent the challenges of noisy or spontaneous speech. Additionally, the findings may not generalize to multilingual ASR systems, as the confidence distribution could vary significantly across different languages and contexts.

Broader Impact

The findings have significant implications for the development of more efficient ASR systems, particularly in applications requiring real-time processing. By demonstrating the effectiveness of threshold-based decoding strategies, this work could influence future research directions in ASR and related fields, potentially leading to advancements in speech technology and accessibility. The main contribution of this paper is the systematic evaluation of decoding strategies for DLM-based ASR, revealing that static and dynamic thresholding significantly enhance accuracy and speed compared to fixed-number decoding. This work provides a crucial step towards optimizing ASR systems, particularly in leveraging the unique properties of DLMs for improved performance.

Analysis: Full Paper • Full text: 12,677 characters

Extracting accent features in spoken Brazilian Portuguese without sociolinguistic labels

Pedro H. L. Leite, Pedro Benevenuto Valadares, Luiz W. P. Biscainho · XLIV Brazilian Symposium on Telecommunications and Signal Processing (SBrT 2026)

Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reli...

Regional accent classification in Brazilian Portuguese (pt-BR) suffers from the need for reliable labeling. While large self-supervised learning (SSL) speech models are powerful, their training pipelines dilute sociophonetic information, since accent labels are generally not reliable or are not used in training objectives. This work introduces a novel workflow for feature extraction using only acoustic labels. By isolating explicit regional accent landmarks and using a phoneme-based forced aligner (ZIPA), our targeted feature set captures dialectal variance more effectively than utterance embeddings, demonstrating that localized features can outperform general-purpose architectures on accent-related tasks using minimal and objective data labels.

Institutional Affiliations

Primary: Faculdade de Engenharia Elétrica e Computação (FEEC)

All Institutions: Faculdade de Engenharia Elétrica e Computação (FEEC), CNPq, UFRJ, UNICAMP

GitHub

ML Relevance Analysis (83)

This paper presents a novel workflow for accent classification in Brazilian Portuguese, demonstrating that localized acoustic features can effectively capture dialectal variance without the need for sociolinguistic labels. The methodology and results contribute meaningfully to the field, showcasing the potential for improved speech processing techniques that are both interpretable and computationally efficient.

Comprehensive Analysis

Methodology Assessment

The methodology is innovative in its approach to accent classification by utilizing a purely audio-driven pipeline that relies on acoustic labels rather than sociolinguistic labels. The use of ZIPA for phoneme-based forced alignment to isolate accent markers is a significant methodological advancement. The authors effectively demonstrate the extraction of localized features that outperform general-purpose architectures, which is a novel contribution to the field of speech processing. The detailed description of the feature extraction process and the classification tasks is commendable, although the reliance on manual annotation may introduce bias.

Experimental Evaluation

The experimental evaluation is thorough, employing a variety of classifiers and a well-structured cross-validation protocol to assess the performance of the proposed features against established SSL models. The results indicate that the proposed method achieves competitive accuracy, which is a strong validation of the approach. However, the paper could benefit from more extensive comparisons with other state-of-the-art methods and a clearer presentation of results in tables.

Reproducibility

The paper provides sufficient detail regarding the methods and datasets used, which aids in reproducibility. However, the lack of publicly available code or datasets limits the ability for independent verification of results. The authors mention a companion webpage, which could potentially provide additional resources, but this needs to be explicitly linked.

Limitations

The study acknowledges that the accent markers used are not exhaustive for all Brazilian Portuguese accents, indicating a limitation in generalizability. The reliance on manual annotation for training data may also introduce biases that affect the model's performance. Additionally, the paper does not address potential challenges in real-world applications, such as variability in speaker accents and environmental noise.

Broader Impact

The work has significant implications for the field of speech recognition and sociolinguistics, particularly in regions with diverse dialects like Brazil. By demonstrating that reliable accent classification can be achieved without sociolinguistic labels, the research opens avenues for more inclusive and accessible speech technologies. This could enhance applications in automatic speech recognition, language learning, and sociophonetic research. This paper presents a novel workflow for accent classification in Brazilian Portuguese, demonstrating that localized acoustic features can effectively capture dialectal variance without the need for sociolinguistic labels. The methodology and results contribute meaningfully to the field, showcasing the potential for improved speech processing techniques that are both interpretable and computationally efficient.

Analysis: Full Paper • Full text: 21,976 characters

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Bohan Li, Shi Lian, Hankun Wang ... · arXiv

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architect...

Unified speech foundation models require a holistic tokenization space that is both learnable by language models and decodable into high-quality waveforms. Existing speech tokenizers, however, often fail to satisfy these requirements simultaneously, leading to increased architectural complexity and more involved training designs. We propose HoliTok, a continuous Holistic speech Tokenization model designed for unified generation-understanding modeling. HoliTok encodes 48~kHz speech into a compact 25~Hz sequence of 128-dimensional latents. It is trained with a progressive strategy that jointly preserves signal-level fidelity, incorporates semantic information, and maintains strong latent learnability. Based on this tokenization, we build a unified AR+DiT model for speech synthesis and recognition, where the same latent sequence supports both generation-specific and unified generation-understanding tasks. Experiments show that HoliTok achieves competitive reconstruction fidelity, improves generative learnability for high-quality and controllable synthesis, and, among the evaluated representations, is the only one that operates robustly in our unified generation-understanding architecture without additional optimization tricks. These results suggest that HoliTok serves as an effective speech tokenizer and a foundational representation interface for unified spoken language modeling. The code is available at: https://github.com/bovod-sjtu/HoliTok.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: Shanghai Jiao Tong University, Xiaohongshu Inc

GitHub

ML Relevance Analysis (83)

The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.

Comprehensive Analysis

Methodology Assessment

The proposed HoliTok model introduces a novel continuous tokenization approach that effectively balances the requirements of learnability and decodability for unified speech generation and understanding. The progressive training strategy enhances the model's ability to preserve signal fidelity while incorporating semantic information, which is a significant advancement over existing tokenization methods. The architecture's integration of a variational autoencoder with a temporal bottleneck and a downstream-aware supervision network is a thoughtful design choice that addresses the limitations of traditional tokenizers.

Experimental Evaluation

The experiments conducted demonstrate the model's competitive performance in reconstruction fidelity, speech synthesis, and unified generation-understanding tasks. The evaluation metrics used, including PESQ, STOI, and WER, provide a robust framework for assessing the quality of the generated outputs. The results indicate that HoliTok not only outperforms existing methods but also maintains a compact latent representation, which is crucial for practical applications in speech technology.

Reproducibility

The paper provides a clear description of the model architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of detailed hyperparameter settings and specific training configurations in the main text may pose challenges for full replication. The availability of the code on GitHub is a positive aspect that aids in reproducibility efforts.

Limitations

The study primarily focuses on speech generation and understanding, leaving out broader audio applications such as environmental sounds and music. The evaluation is limited to a specific architecture (AR+DiT), which may not capture the full potential of the proposed tokenizer across various modeling paradigms. Future work should explore these areas to validate the generalizability of the approach.

Broader Impact

The advancements presented in this paper have the potential to significantly enhance speech synthesis and recognition technologies, making them more efficient and effective. The model's ability to serve as a unified interface for both tasks could lead to improvements in applications such as virtual assistants, automated transcription services, and interactive voice response systems. The implications for accessibility and user interaction with technology are substantial, as improved speech models can facilitate better communication for individuals with speech impairments. The paper presents HoliTok, a continuous holistic tokenization model that effectively bridges the gap between speech generation and understanding tasks. Its innovative approach and strong experimental results position it as a significant contribution to the field of audio machine learning.

Analysis: Full Paper • Full text: 42,108 characters

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Sung-Lin Yeh, Wei Zhou, Gil Keren ... · arXiv

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce ...

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks. To address this limitation, we introduce a discrete latent variable model on mel spectrograms that jointly optimizes the encoder and the speech language model. Joint optimization not only brings improvements over codec-based and other mel-spectrogram-based baselines on zero-shot Text-to-Speech (TTS) and Speech-to-Text (STT) tasks, but also effectively alleviates common issues in autoregressive mel-spectrogram modeling, such as prolonged silence generation and word omissions.

Institutional Affiliations

Primary: University of Edinburgh

All Institutions: University of Edinburgh, Google DeepMind, Meta Superintelligence Labs

ML Relevance Analysis (83)

The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel approach to speech language modeling by introducing MELD, which integrates discrete latent variables into the autoregressive modeling of mel-spectrograms. This joint optimization of the encoder and autoregressive model addresses limitations of previous two-stage methods, particularly in preserving task-relevant information. The methodology is well-structured, leveraging variational inference to optimize a lower bound on the log likelihood, and effectively incorporates both TTS and STT tasks within a single framework. The use of discrete latent variables to suppress silence generation is a significant innovation, enhancing the model's performance over existing methods.

Experimental Evaluation

The experiments are comprehensive, utilizing the 960-hour subset of the LibriSpeech dataset for training and evaluation. The authors compare MELD against several baselines, including codec-based models and other mel-spectrogram-based approaches, demonstrating clear improvements in both TTS and STT tasks. The evaluation metrics include both subjective (MOS, speaker similarity) and objective (WER) assessments, providing a well-rounded view of the model's performance. The results indicate that MELD outperforms its competitors, particularly in reducing silence and improving word error rates.

Reproducibility

The paper provides detailed implementation specifics, including model architecture, training configurations, and evaluation protocols. However, the authors acknowledge challenges in reproducing results from related work (e.g., MELLE), which may affect the perceived reliability of their comparisons. The use of specific datasets and training strategies is well-documented, but the lack of a public code repository or demo limits reproducibility.

Limitations

The authors note several limitations, including the difficulty in making fair comparisons between codec-based and mel-spectrogram-based methods due to differences in representation mapping. Additionally, while the joint optimization framework is promising, the paper does not explore its application to other speech tasks beyond TTS and STT. The potential for overfitting or collapsing solutions in the discrete latent space is also mentioned, although not observed in their experiments.

Broader Impact

The proposed model has significant implications for real-world applications in speech synthesis and recognition, particularly in enhancing the quality and efficiency of TTS systems. The ability to jointly model TTS and STT tasks could streamline workflows in various applications, such as virtual assistants and automated transcription services. However, ethical considerations regarding the misuse of speech generation technologies, such as voice cloning, must be addressed to ensure responsible use. The main contribution of this work is the introduction of MELD, a joint optimization framework for speech language modeling that effectively integrates discrete latent variables to enhance TTS and STT performance. This approach represents a significant advancement in the field, addressing key limitations of existing methods and paving the way for future research in multimodal speech processing.

Analysis: Full Paper • Full text: 40,389 characters

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

Heejoon Koo, Yoon Tae Kim, Miika Toikkanen ... · arXiv

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced de...

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.

Institutional Affiliations

Primary: University of Illinois Urbana-Champaign

All Institutions: University of Illinois Urbana-Champaign, Wonkwang University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a novel federated domain generalization framework specifically tailored for respiratory sound classification, addressing the critical issue of inter-stethoscope variability. The integration of a causality-inspired device style intervention network, counterfactual text augmentation, and gradient alignment represents a significant advancement in the field, as it not only tackles the entanglement of device style and disease content but also enhances the robustness of the model across heterogeneous devices. The approach is well-structured, leveraging causal inference principles to inform data augmentation strategies, which is a fresh perspective in the context of audio classification.

Experimental Evaluation

The experimental setup is robust, utilizing two well-defined datasets (ICBHI and SPRSound) and employing leave-one-device-out validation to rigorously assess the model's performance. The results demonstrate that the proposed method consistently outperforms conventional data augmentation and federated learning baselines, indicating its effectiveness in improving cross-device generalization. The ablation studies further substantiate the contributions of each component of the framework, providing clear evidence for the importance of the causality-inspired interventions.

Reproducibility

While the paper mentions that code will be released upon publication, the absence of a current project URL limits immediate reproducibility. The methodology is described in sufficient detail to allow for replication, but access to the code and datasets would be essential for full verification of results.

Limitations

One limitation is the reliance on specific datasets, which may not fully capture the diversity of respiratory sound recordings across different clinical settings. Additionally, the paper acknowledges the need for future work to address privacy concerns and computational efficiency in federated learning settings, which are critical for real-world applications.

Broader Impact

The framework has significant potential implications for telemedicine and automated pulmonary disease detection, particularly in enhancing the reliability of AI-driven diagnostics across various healthcare environments. By addressing device-induced biases, the work contributes to the broader goal of equitable healthcare access and improved patient outcomes. The main contribution of this paper is the introduction of a causality-inspired multimodal federated domain generalization framework for respiratory sound classification, which effectively mitigates stethoscope-induced biases and enhances model robustness across heterogeneous devices. The technical contributions are substantial, offering a new lens through which to view the challenges of audio classification in medical contexts, thereby advancing the field significantly.

Analysis: Full Paper • Full text: 22,303 characters

State-Anchored Complete-View Distillation for Robust Conversational Multimodal Emotion Recognition

Zhaoyan Pan, Xiangdong Li, Wenke Wu ... · arXiv

Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonver...

Conversational multimodal emotion recognition (MER) requires reliable prediction when language, acoustic, or visual observations are missing or unreliable. Many missing-modality methods reconstruct absent inputs, yet such recovery can be non-unique in dialogue context, and nonverbal cues may conflict with the target utterance. To this end, we propose CoRe-KD (Complete-view Reference-guided Knowledge Distillation), a state-anchored, conflict-regularized complete-view distillation framework for robust conversational MER. A complete-view teacher provides structured references, including prediction-level references, fused states, and modality-specific states. Complete-view State Anchoring (CSA) aligns incomplete-view student predictions and states with these references, while Nonverbal Conflict Exposure (NCE) trains on target-preserving nonverbal conflict views to reduce donor-label bias. Experiments on IEMOCAP and MELD, with CMU-MOSEI as a supplementary utterance-level check, show consistent gains under fixed- and random-missing protocols. Comprehensive ablation studies and further analyses support the role of CSA and the complementary effect of NCE.

Institutional Affiliations

Primary: Zhejiang University

All Institutions: Zhejiang University

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.

Comprehensive Analysis

Methodology Assessment

The proposed CoRe-KD framework innovatively addresses the challenges of multimodal emotion recognition (MER) under incomplete observations. It introduces two key components: Complete-view State Anchoring (CSA) and Nonverbal Conflict Exposure (NCE), which enhance the robustness of emotion recognition by aligning incomplete-view predictions with structured references from a complete-view teacher. The methodology is well-structured, leveraging knowledge distillation effectively while avoiding the pitfalls of input reconstruction, which is a common issue in existing methods. The use of Gaussian-inspired states for modality fusion is a notable technical contribution that adds precision to the alignment process.

Experimental Evaluation

The experiments are comprehensive, utilizing established datasets (IEMOCAP, MELD, and CMU-MOSEI) to validate the effectiveness of CoRe-KD under both fixed- and random-missing protocols. The results demonstrate consistent improvements in accuracy and F1 scores compared to various baselines, indicating the robustness of the proposed method. The inclusion of ablation studies further strengthens the findings by elucidating the contributions of each component within the framework.

Reproducibility

The paper provides detailed implementation specifics, including training protocols, hyperparameters, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly accessible code repository limits the ease with which other researchers can replicate the results.

Limitations

One significant limitation is that CoRe-KD requires complete multimodal observations for training the teacher model, which may not be feasible in all real-world scenarios. Additionally, the NCE module relies on controlled conflict views that might not comprehensively cover all possible real-world misalignments or corruptions in multimodal data.

Broader Impact

The advancements in robust conversational MER have implications for various applications, including human-computer interaction, sentiment analysis, and affective computing. By improving the reliability of emotion recognition systems in the presence of missing or unreliable modalities, this work could enhance user experience in applications such as virtual assistants, mental health monitoring, and interactive entertainment. The main contribution of this paper is the introduction of CoRe-KD, a structured complete-view distillation framework that significantly enhances the robustness of conversational multimodal emotion recognition under incomplete observations. The methodology effectively addresses key challenges in the field, and the experimental results validate its effectiveness, marking a meaningful advancement in multimodal learning.

Analysis: Full Paper • Full text: 50,026 characters

The WER Trap: Shattering the Illusion of Unified Tokens in Speech Language Models

Xiangyu Zhang, Yuxin Li, Haoyang Zhang ... · arXiv

The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality....

The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the definitive proxy for representation quality. This fosters the assumption that low-WER tokens inherently preserve the information necessary for intelligible acoustic synthesis. We argue this is fundamentally deceptive. While high-frequency tokens succeed in generation tasks due to implicit information leakage, isolating pure semantic information at ultra-low frame rates strips away the finegrained articulation and micro-dynamics essential for ODE-based generation. Empirically validating this requires extreme compression without sacrificing WER -- a methodological bottleneck, as standard fixed-stride downsampling arbitrarily truncates phonetic boundaries. To overcome this, we develop a dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, achieving ultra-low frame rates with exceptionally low WER. Using these isolated "pure" semantic tokens, we expose the WER trap: when conditioning generative models -- even with oracle duration alignments -- the reconstructed speech suffers from severe articulation blur and is rendered acoustically unintelligible. Our findings demonstrate that semantic categorization rewarded by low WER is inherently orthogonal to the continuous phonetic trajectories required for synthesis, shattering the illusion of the unified token and advocating for explicitly decoupled speech representations.

Institutional Affiliations

Primary: The University of New South Wales

All Institutions: The University of New South Wales, Nanyang Technological University

ML Relevance Analysis (83)

The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel dynamic compression tokenizer that intelligently aligns representations with semantic boundaries, addressing the methodological bottleneck of fixed-stride downsampling that corrupts phonetic boundaries. This approach is innovative as it allows for extreme compression while maintaining low WER, enabling a rigorous evaluation of the unified token hypothesis through the Dual-Probing Protocol. The methodology is well-structured, leveraging existing frameworks while introducing significant improvements in tokenization for speech synthesis.

Experimental Evaluation

The experiments are comprehensive, utilizing large-scale multilingual datasets and employing a dual-probing protocol to assess both discriminative understanding and generative viability. The results demonstrate that while the dynamic tokens achieve high performance in understanding tasks, they fail in generating intelligible speech, effectively illustrating the WER trap. The evaluation metrics, including CER and AVQA accuracy, are appropriate and provide a clear picture of the model's performance.

Reproducibility

The paper provides detailed architectural specifications, hyperparameter configurations, and training methodologies, which enhance reproducibility. However, the absence of a public code repository limits the ease with which others can replicate the results. The thoroughness of the experimental setup and the clear delineation of methods contribute positively to reproducibility.

Limitations

The study acknowledges its limitations, particularly that the generative probe employs a single synthesis paradigm, which may not generalize across different architectures. Additionally, the focus on Mandarin as the sole language for evaluation may restrict the applicability of findings to other languages with different phonetic structures. The paper also notes that while it identifies a critical flaw in the unified token approach, it does not propose a concrete solution for decoupled representations.

Broader Impact

The findings have significant implications for the development of speech language models, challenging the prevailing assumption that a single token can suffice for both understanding and generation. This work advocates for a separation of semantic and acoustic representations, which could lead to more effective and intelligible speech synthesis systems. The insights gained from this research could influence future designs in multimodal AI systems, particularly in improving the quality of synthesized speech. The paper exposes a fundamental flaw in the assumption that low WER tokens can universally serve both speech understanding and generation. It rigorously demonstrates that while these tokens may excel in comprehension tasks, they fail to preserve the necessary micro-dynamics for intelligible speech synthesis, advocating for decoupled representations in future speech models.

Analysis: Full Paper • Full text: 41,558 characters

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

S. Sutharya, Remya K. Sasi · arXiv

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguis...

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

Institutional Affiliations

Primary: Cochin University of Science and Technology (CUSAT)

All Institutions: Cochin University of Science and Technology (CUSAT)

GitHub

ML Relevance Analysis (82)

The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.

Comprehensive Analysis

Methodology Assessment

The proposed CAFNet architecture is innovative in its approach to jointly address the challenges of ternary classification and temporal boundary localization for half-truth audio deepfake detection. The use of cross-attentive feature fusion and depthwise-separable convolutions enhances the model's ability to process multiple acoustic features effectively. The integration of BiLSTM for boundary prediction is a well-justified choice, given the temporal nature of the task. However, the paper could benefit from a more detailed discussion on the design choices for the architecture and the rationale behind the specific feature sets used.

Experimental Evaluation

The experiments are robust, utilizing a comprehensive dataset (MLADDC) that covers a diverse range of languages and audio conditions. The performance metrics reported, including accuracy, AUC, and MAE for boundary localization, are convincing and demonstrate the effectiveness of CAFNet compared to existing models. The cross-dataset generalization study adds significant value, revealing critical insights into the limitations of current training paradigms in deepfake detection.

Reproducibility

The authors provide sufficient details regarding the implementation, including hyperparameters, training protocols, and the architecture of CAFNet. The availability of code and trained models on GitHub enhances reproducibility. However, the paper lacks detailed information on the specific datasets used for training and evaluation, which could hinder full reproducibility.

Limitations

One notable limitation is the model's performance on the real class, where a significant number of half-truth samples are misclassified as real. This indicates that while the model excels in detecting fully fake and half-truth audio, it struggles with distinguishing genuine audio, which is crucial for practical applications. Additionally, the study highlights the challenge of catastrophic forgetting during domain adaptation, suggesting that the current approach may not be robust across different datasets.

Broader Impact

The findings of this research have significant implications for audio forensics and the detection of manipulated media, especially in contexts where misinformation can have serious consequences. The ability to localize manipulations within audio clips enhances the forensic value of detection systems, making them more actionable for users. As deepfake technology continues to evolve, advancements in detection methods like CAFNet will be critical in maintaining trust in audio communications. The paper presents CAFNet, a novel architecture for audio deepfake detection that effectively addresses the challenges of ternary classification and temporal localization of half-truth audio. The methodology is sound, and the experimental results demonstrate significant advancements over existing models, particularly in a multilingual context.

Analysis: Full Paper • Full text: 20,024 characters

FiPA-SR -- FiLM-Conditioned Perceptually Informed Audio Super-Resolution

Wallace Abreu, Luiz W. P. Biscainho · arXiv

Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMa...

Audio bandwidth extension aims to reconstruct missing high-frequency content from bandlimited signals. This paper proposes FiPA-SR, a GAN-based perceptual architecture capable of handling different input bandwidths within a single model. Building upon the previous $\textrm{AEROMamba}_\textrm{P}$ framework, the proposed model incorporates FiLM layers to adapt the reconstruction process according to the respective bandwidth. Experiments on the MUSDB dataset show that FiPA-SR outperforms the state-of-the-art AudioSR model across 8, 20, and 32 kHz input sampling rates. Moreover, the proposed architecture uses approximately 3$\times$ less GPU memory and performs inference more than 60$\times$ faster than the diffusion-based baseline.

Institutional Affiliations

Primary: PEE/COPPE, UFRJ

All Institutions: PEE/COPPE, UFRJ, Carlos Chagas Filho Foundation for Research Support in the State of Rio de Janeiro, National Council for Scientific and Technological Development, CAPES

ML Relevance Analysis (82)

This paper presents FiPA-SR, a GAN-based model for audio bandwidth extension, demonstrating significant improvements in reconstruction quality and computational efficiency. The innovative use of FiLM layers to adaptively handle multiple bandwidths marks a notable advancement in the field of audio super-resolution.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, leveraging a GAN-based architecture with FiLM layers to adaptively handle different bandwidths. The use of perceptual metrics and a well-defined training procedure enhances the model's ability to generalize across various input configurations. The innovative approach of combining upsampling with conditional modulation through FiLM layers is a significant advancement over previous models.

Experimental Evaluation

The experiments are thorough, utilizing the MUSDB dataset and comparing against state-of-the-art models. The use of objective metrics like Log-Spectral Distance and ViSQOL provides a solid foundation for evaluating performance. However, the paper could benefit from more qualitative assessments, such as user studies or listening tests, to complement the objective metrics.

Reproducibility

The paper provides sufficient details regarding the architecture, training setup, and evaluation metrics, which should enable other researchers to replicate the results. However, the absence of a publicly available code repository limits accessibility.

Limitations

The study is limited to specific bandwidth configurations and does not explore the model's performance across a broader range of frequencies. Additionally, while the results are promising, the reliance on objective metrics alone may not fully capture perceptual audio quality.

Broader Impact

The proposed model has significant implications for audio processing applications, particularly in telecommunications and music production, where bandwidth limitations are prevalent. The ability to reconstruct high-frequency content efficiently could enhance audio quality in various consumer and professional settings. This paper presents FiPA-SR, a GAN-based model for audio bandwidth extension, demonstrating significant improvements in reconstruction quality and computational efficiency. The innovative use of FiLM layers to adaptively handle multiple bandwidths marks a notable advancement in the field of audio super-resolution.

Analysis: Full Paper • Full text: 17,529 characters

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

Jiahao Mei, Heinrich Dinkel, Yadong Niu ... · arXiv

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audi...

Audio generation has long been fragmented, with speech, music, and sound effects produced by domain-specific models that fail to jointly generate coherent audio scenes from a single description. The key obstacles are insufficient fine-grained supervision for real-world mixed audio and limited acoustic representations for modeling concurrent audio components. We present Dasheng AudioGen, a unified framework for generating general mixed-audio scenes from text. Dasheng AudioGen introduces structured multi-view captions, which explicitly decouple complex acoustic scenes into complementary description views, thereby enabling fine-grained control over audio layers. Furthermore, we employ a high-dimensional unified semantic-acoustic representation as the shared latent space. It injects semantic priors that facilitate cross-modal training convergence, while its high-dimensional feature space provides sufficient capacity to disentangle and fuse concurrent audio components effectively. With these designs, a simple flow-matching DiT achieves high-quality end-to-end audio scene generation. We also establish a comprehensive evaluation pipeline for audio scene generation. Experiments demonstrate that Dasheng AudioGen achieves performance approaching real-world recordings in mixed-audio categories, while remaining competitive with specialized models in single-type generation tasks. Demos are available at https://nieeim.github.io/Dasheng-AudioGen-Web/.

Institutional Affiliations

Primary: Shanghai Jiao Tong University

All Institutions: Shanghai Jiao Tong University, Xiaomi Inc.

Demo

ML Relevance Analysis (86)

Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel framework, Dasheng AudioGen, which effectively integrates multiple audio generation tasks into a single model using structured multi-view captions and a unified semantic-acoustic representation. This approach addresses the fragmentation in audio generation by allowing for coherent mixed-audio scene generation from text, which is a significant advancement in the field. The methodology is well-structured, leveraging a flow-matching DiT architecture and a unique conditioning framework that enhances control over audio components. The use of high-dimensional latent spaces for audio representation is particularly innovative, as it allows for better modeling of overlapping audio elements.

Experimental Evaluation

The experiments conducted are comprehensive, utilizing a large-scale dataset (ACAVCaps) and a robust evaluation pipeline that includes both objective and subjective metrics. The results demonstrate that Dasheng AudioGen outperforms existing specialized models in mixed-audio generation while maintaining competitive performance in single-type tasks. The introduction of the MECAT benchmark for mixed-audio evaluation is a valuable contribution, providing a new standard for assessing model performance in this area.

Reproducibility

The paper mentions limitations in reproducibility due to reliance on a private dataset, which may hinder others from replicating the results. However, the detailed methodology and experimental setup provide a clear path for future researchers to build upon this work. The authors should consider releasing their dataset or providing a public version to enhance reproducibility.

Limitations

Key limitations include the model's restriction to generating 10-second audio clips and the lack of advanced speaker control in TTS applications. Additionally, the performance in terms of speech intelligibility lags behind specialized TTS systems, indicating room for improvement. The reliance on a private dataset also poses challenges for reproducibility and broader accessibility.

Broader Impact

The implications of this work are significant, as it paves the way for more integrated audio generation systems that can produce realistic and contextually coherent audio scenes. This could have applications in various fields, including film production, gaming, virtual reality, and assistive technologies. The ability to generate complex audio scenes from simple text prompts could also enhance user experiences in interactive media. Dasheng AudioGen represents a substantial advancement in unified audio generation, combining multiple audio types into coherent scenes from textual descriptions. The innovative methodology and comprehensive evaluation contribute significantly to the field, setting a new standard for future research in audio generation.

Analysis: Full Paper • Full text: 50,026 characters

VoiceGiraffe: A Benchmark for Extreme Long-Context Audio-Language Understanding

Jashin Ye, Dongxiao Wang, Yixuan Ye ... · arXiv

While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenate...

While large audio language models (LALMs) have achieved remarkable progress in audio processing at the second- or minute-level scale, understanding hour-level audio remains a fundamental bottleneck. Existing benchmarks predominantly rely on short clips or artificially concatenated segments, failing to faithfully assess LALM capacity for long-range information comprehension in real-world scenarios such as podcasts and lengthy speeches. To address this gap, we introduce VoiceGiraffe, a novel benchmark designed to rigorously evaluate LALMs across diverse real-world scenarios, modalities, and languages under long-context settings. It comprises 1500 curated triplets structured into a dual-level taxonomy of single-hop perception and multi-hop reasoning. We evaluate a broad suite of open-source and proprietary LALMs against human performance. Results underscore three fundamental findings. First, VoiceGiraffe remains highly challenging and far from saturation. Second, we show that no single inference paradigm universally dominates. The E2E inference benefits models with native long-context audio understanding, cascaded caption aggregation stabilizes small models overwhelmed by hour-scale audio, and reasoning-enhanced cascading with external LLM helps weaker models but can bottleneck stronger proprietary systems. Third, we reveal long-range memory persistence as a key bottleneck. LALMs are better at answering questions that require connecting salient causal cues than those requiring sustained tracking of sparse events across long audio, whereas humans show the opposite pattern. These findings position VoiceGiraffe as a challenging and diagnostic testbed for long-form audio understanding, highlighting the need for LALMs with persistent memory and robust long-range aggregation.

Institutional Affiliations

Primary: Future Living Lab, Alibaba

All Institutions: Future Living Lab, Alibaba

ML Relevance Analysis (86)

The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel benchmark, VoiceGiraffe, designed specifically for evaluating long-context audio-language models (LALMs) in realistic scenarios. The methodology is robust, employing a dual-level taxonomy for question generation that captures both single-hop and multi-hop reasoning tasks. The data curation process is thorough, involving a multi-stage pipeline that includes voice activity detection, hierarchical captioning, and collaborative verification by human annotators. This rigorous approach ensures high-quality data for evaluation, addressing the limitations of existing benchmarks that rely on short clips or concatenated segments.

Experimental Evaluation

The experimental evaluation is comprehensive, benchmarking a wide range of LALMs against human performance across various tasks and inference paradigms. The results reveal significant challenges in long-context understanding, with only one proprietary model surpassing human performance. The findings highlight the limitations of current models in memory persistence and reasoning capabilities, providing valuable insights into areas for future research. The use of multiple inference settings (E2E, cascaded caption aggregation, and reasoning-enhanced cascading) allows for a nuanced understanding of model performance.

Reproducibility

While the paper outlines a detailed methodology and experimental setup, it lacks specific implementation details or links to code repositories that would facilitate reproducibility. The absence of a project URL or demo limits the ability of other researchers to replicate the study or build upon the findings.

Limitations

The primary limitations include the lack of a publicly available dataset or benchmark for other researchers to use, which could hinder wider adoption and validation of the proposed methods. Additionally, the paper acknowledges that even human annotators found the tasks challenging, indicating that the benchmark may be too difficult for current models. There is also a potential bias in language performance, as the models exhibited varying capabilities across English and Chinese inputs.

Broader Impact

The introduction of VoiceGiraffe has the potential to significantly advance the field of audio-language understanding by providing a rigorous evaluation framework that addresses real-world challenges. This benchmark can guide future research towards developing models with improved long-context reasoning and memory capabilities, which are essential for applications in audio assistants, automated transcription, and multimedia content analysis. The paper presents VoiceGiraffe, a pioneering benchmark for evaluating hour-scale audio understanding in LALMs, addressing critical gaps in existing evaluation protocols. The comprehensive methodology and experimental results underscore the pressing need for advancements in long-context audio processing and reasoning, positioning this work as a significant contribution to the field.

Analysis: Full Paper • Full text: 41,471 characters

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Changhao Pan, Rui Yang, Han Wang ... · ACL 2026 (Findings)

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test...

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is indispensable for two reasons: 1) existing test scenarios are often confined to limited domains, creating a significant gap with the diverse downstream applications; 2) existing metrics overlook critical long-text factors such as consistency and coherence, failing to generalize reliably. To this end, we propose Swanbench-Speech, a comprehensive benchmark that decomposes long-form speech quality into specific, disentangled dimensions. SwanBench-Speech has three key properties. 1) Rich speech scenarios: Focusing on long-form speech generation and dialog generation, SwanBench-Speech covers acoustics, semantics, and expressiveness challenges, and consists of 1,101 samples spanning 17 common speech scenarios; 2) Comprehensive evaluation dimensions: Along the acoustics, semantics, and expressiveness axes, SwanBench-Speech defines an automated evaluation protocol with seven metrics to provide a comprehensive, accurate, and standardized assessment; 3) Valuable Insights: Through extensive experiments, we reveal that current models still struggle in highly expressive scenarios and exhibit a notable gap in consistency and hierarchy compared to real recordings.

Institutional Affiliations

Primary: Zhejiang University

All Institutions: Zhejiang University, Bytedance

Demo

ML Relevance Analysis (83)

The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.

Comprehensive Analysis

Methodology Assessment

The paper introduces SwanBench-Speech, a comprehensive benchmark for evaluating long-form speech generation models. It effectively addresses the limitations of existing evaluation methods by proposing a multi-dimensional framework that includes seven disentangled metrics across three core challenges: acoustics, semantics, and expressiveness. The methodology is well-structured, with a clear focus on real-world applications and the incorporation of human-aligned metrics, which enhances the relevance of the evaluation. The use of diverse scenarios and a rigorous data collection process further strengthens the methodology.

Experimental Evaluation

The experiments are extensive, involving over 20 models evaluated across 1,101 samples in 17 scenarios. The results provide valuable insights into the performance gaps of current models compared to human recordings, particularly in expressiveness and consistency. The use of both objective metrics and human evaluations adds robustness to the findings. However, while the experiments are thorough, the paper could benefit from more detailed statistical analyses to quantify the significance of the results.

Reproducibility

The paper provides a clear description of the data collection and evaluation processes, along with the metrics used. The open-sourcing of the benchmark and the availability of evaluation scripts enhance reproducibility. However, the reliance on specific models for evaluation may limit the generalizability of the findings to other systems.

Limitations

The study acknowledges limitations, including a narrow linguistic scope (only Chinese and English) and a lack of robustness in assessing emotional and stylistic transitions. Additionally, the dataset's speaker diversity is limited, which may introduce bias in evaluations. Future work should address these gaps to enhance the benchmark's applicability.

Broader Impact

This work has significant implications for the field of speech synthesis, particularly in enhancing the evaluation of long-form speech generation systems. By establishing a standardized benchmark, it paves the way for future research and development in this area, potentially leading to more immersive and expressive speech synthesis applications. The focus on real-world scenarios and human-aligned metrics also suggests potential applications in education, entertainment, and customer service. The paper presents a comprehensive benchmarking framework for long-form speech generation, addressing critical gaps in existing evaluation methodologies. Its innovative approach, rigorous methodology, and extensive experimental validation contribute significantly to the advancement of the field, providing a valuable resource for future research.

Analysis: Full Paper • Full text: 50,026 characters

EigeNet: Geometry-Informed Multi-Modal Learning for Few-shot Novel View RIR Prediction

Chong Jing, Zitong Lan, Junan Zhang ... · arXiv

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR pr...

Predicting spatially varying Room Impulse Response (RIR) from sparse observations is a critical but highly challenging inverse problem for immersive spatial audio rendering. In this work, we present EIGENET, a geometry-informed multi-modal framework for few-shot novel view RIR prediction. At its core is a Cross-view Alternate-attention Transformer that iteratively refines local intra-view acoustic structures and global cross-view spatial relationships. We empirically demonstrate that this architecture is capable of making full use of the multi-view multi-modal context while performing spatial-temporal reasoning for RIR prediction. Inspired by acoustic ray tracing, we design a geometry-informed modulation block to formulate the connection between geometric features and RIR power spectrum. In the mean time, an auxiliary loss is introduced to transform the single-target waveform prediction into a multi-task learning framework. Through ablation studies, we demonstrate that this design yields consistent performance gains regardless of the underlying backbone, thereby confirming its foundational utility and architecture-agnostic generalizability for RIR prediction task. Evaluated on both simulated and real-world benchmarks, EIGENET achieves both state-of-the-art performance in few-shot novel view RIR prediction and sim-to-real generalization. Codes and checkpoints are available on https://github.com/FEAfeatherTHER/EigeNet.

Institutional Affiliations

Primary: University of Pennsylvania

All Institutions: University of Pennsylvania, The Chinese University of Hong Kong

GitHub

ML Relevance Analysis (83)

This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.

Comprehensive Analysis

Methodology Assessment

The proposed methodology introduces a Cross-view Alternate-attention Transformer (CVAT) that effectively captures both local intra-view and global cross-view relationships, addressing the challenges of few-shot Room Impulse Response (RIR) prediction. The integration of a geometry-informed modulation block enhances the model's ability to leverage geometric features, which is a significant advancement over existing methods. The auxiliary loss for multi-task learning further strengthens the model's performance by promoting generalizability across different architectures.

Experimental Evaluation

The experiments are robust, utilizing both simulated and real-world datasets, and demonstrate state-of-the-art performance across various metrics. The ablation studies provide clear evidence of the contributions of each component, validating the effectiveness of the proposed architecture. The quantitative results indicate substantial improvements over baseline methods, particularly in sparse reference scenarios.

Reproducibility

The paper provides sufficient implementation details, including architecture specifications and training configurations, which should facilitate reproducibility. The availability of code and checkpoints on GitHub enhances this aspect, although specific hyperparameters and training procedures could be elaborated further for clarity.

Limitations

While the model shows impressive performance, it may still be limited by the quality of the input data and the assumptions made regarding room geometry. The reliance on geometric features may not generalize well to all acoustic environments, particularly those with complex or unconventional geometries.

Broader Impact

The advancements in few-shot learning for RIR prediction have significant implications for immersive audio applications in AR/VR and spatial audio rendering, potentially enhancing user experiences in virtual environments. The methodology could inspire further research into integrating geometric and acoustic modeling in other domains. This paper presents EigeNet, a novel geometry-informed multi-modal learning framework that significantly advances few-shot novel view RIR prediction through innovative architectural designs and empirical validation. The comprehensive approach to integrating geometric features with acoustic modeling represents a meaningful contribution to the field of spatial audio rendering.

Analysis: Full Paper • Full text: 42,257 characters

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

Zhisheng Zhang, Xiang Li, Yixuan Zhou ... · arXiv

Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which incr...

Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in high-dimensional continuous latents, which increases the modeling burden of Diffusion Transformers (DiTs) for generation. We propose LoSATok, a low-dimensional audio tokenizer for cross-domain audio understanding and generation. Motivated by the observation that 1280-dimensional semantic encoder features are compressible, we introduce a Semantic Bottleneck that compresses them into 128 dimensions, regularized by the proposed time-relation loss for temporal feature consistency. We further design a dual-level semantic supervision method that leverages both high- and low-dimensional semantic signals, enabling the tokenizer to jointly capture semantics and acoustic details within a compact latent space. Experiments on speech, music, and general audio show that SemBo preserves strong low-dimensional semantic capacity and LoSATok retains competitive understanding performance compared with several semantic representations, while consistently improving DiT modeling performance on speech, music, and audio generation. These results demonstrate that LoSATok's low-dimensional representations can effectively support audio understanding and generation. Our code is provided at https://github.com/wxzyd123/LoSATok.

Institutional Affiliations

Primary: Shenzhen International Graduate School, Tsinghua University

All Institutions: Shenzhen International Graduate School, Tsinghua University, ModelBest Inc.

GitHub

ML Relevance Analysis (83)

The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.

Comprehensive Analysis

Methodology Assessment

The paper introduces a novel low-dimensional audio tokenizer, LoSATok, which effectively compresses high-dimensional semantic representations while maintaining semantic richness and acoustic details. The methodology includes the Semantic Bottleneck (SemBo) for dimensionality reduction, and a dual-level semantic supervision strategy that enhances the learning process. The proposed time-relation loss is a significant innovation that ensures temporal consistency in the representations. Overall, the methodology is well-structured and addresses a critical gap in current audio modeling approaches.

Experimental Evaluation

The experiments are comprehensive, covering various audio tasks across speech, music, and general audio domains. The results demonstrate that LoSATok achieves competitive performance in understanding tasks and outperforms existing models in generation tasks, particularly in terms of efficiency and quality. The use of objective metrics (e.g., FAD, CLAP) alongside subjective evaluations strengthens the findings. However, the paper could benefit from more extensive comparisons with state-of-the-art methods in a broader range of tasks.

Reproducibility

The paper provides a GitHub repository with the code, which is essential for reproducibility. However, specific implementation details, such as hyperparameter choices and training setups, could be more clearly outlined to facilitate replication by other researchers.

Limitations

The authors acknowledge that LoSATok sacrifices some reconstruction fidelity for improved semantic organization and generative performance. Additionally, while it shows promise in understanding tasks, it does not fully reach the performance of high-dimensional semantic representations. Future work is needed to optimize the balance between semantics, acoustics, and generation.

Broader Impact

The proposed tokenizer has significant implications for audio understanding and generation, potentially enhancing applications in speech recognition, music generation, and audio synthesis. By enabling more efficient models, it could lead to advancements in real-time audio processing and interactive applications. The research also opens avenues for further exploration of low-dimensional representations in multimodal contexts. The paper presents LoSATok, a unified low-dimensional tokenizer that enhances audio understanding and generation by effectively compressing high-dimensional semantic representations while preserving essential acoustic details. The methodology and results demonstrate its potential to significantly impact the field of audio processing and generation.

Analysis: Full Paper • Full text: 50,026 characters

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

Yuyue Wang, Xihua Wang, Xin Cheng ... · arXiv

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs an...

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

Institutional Affiliations

Primary: Renmin University of China

All Institutions: Renmin University of China

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.

Comprehensive Analysis

Methodology Assessment

The proposed methodology, PlanAudio, introduces a novel framework for generating unified audio from free-form text prompts, leveraging an autoregressive LLM architecture and a semantic latent Chain-of-Thought (CoT) mechanism. This approach is innovative as it avoids traditional text encoders and explicit text rewriting, which are common in existing models. The integration of semantic planning in the latent space before audio synthesis is a significant advancement, allowing for better alignment between high-level semantics and low-level audio generation. The methodology is well-structured, with clear phases for semantic planning and acoustic generation, which enhances the model's ability to produce coherent audio outputs.

Experimental Evaluation

The experiments are comprehensive, evaluating PlanAudio across multiple scenarios (sound, speech, and composite) using both objective metrics (FAD, KL divergence, WER) and subjective assessments (human ratings on acoustic quality, temporal correctness, etc.). The results demonstrate that PlanAudio outperforms existing pipeline and unified models, showcasing its versatility and effectiveness. The creation of PlanAudio-Bench as a specialized benchmark for composite audio scenarios adds value to the evaluation process, providing a structured way to assess the model's performance in real-world applications.

Reproducibility

The paper provides detailed implementation details, including the datasets used, training procedures, and evaluation metrics. However, the lack of a publicly available demo or project URL limits the reproducibility of the results. While the methodology is clearly described, access to the code and trained models would enhance the ability of other researchers to replicate the findings.

Limitations

One limitation is the potential for the model to struggle with highly complex prompts that require intricate audio interactions, as indicated by the slight performance drop in speech generation compared to specialized models. Additionally, the reliance on the quality of the training data and the inherent challenges in synthesizing audio from free-form text prompts may introduce variability in performance across different contexts.

Broader Impact

The implications of this research are significant for various applications, including content creation, game development, and assistive technologies for individuals with speech impairments. By enabling the generation of coherent audio from natural language prompts, this work could facilitate new forms of human-computer interaction and enhance multimedia experiences. The main contribution of this paper is the introduction of PlanAudio, a unified framework for generating complex audio compositions from free-form text prompts, which significantly advances the state-of-the-art in audio synthesis by integrating semantic understanding with acoustic generation. The methodology is innovative, the experiments are rigorous, and the potential applications are broad, marking a meaningful contribution to the field of machine learning and audio generation.

Analysis: Full Paper • Full text: 26,183 characters

DEMON: Diffusion Engine for Musical Orchestrated Noise

Ryan Fosdick · arXiv

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoisin...

We present DEMON, a real-time diffusion engine that makes the denoising process playable as a live musical instrument: a control surface both broad (many parameters shaped per-frame across the output) and responsive (each control taking effect as fast as its place in the denoising loop allows). Built on ACE-Step 1.5 and StreamDiffusion's ring-buffer architecture with TensorRT acceleration, it sustains up to 12.3 decoder completions per second for 60-second music on a single consumer GPU (RTX 5090), or 11.3 generations per second at our production ring-depth of 4. At these rates denoising parameters become viable as live performance controls, but the ring buffer propagates per-request changes only at its drain rate, a floor of S denoising steps. We contribute four mechanisms. (1) Per-slot heterogeneous denoise scheduling: each ring-buffer slot owns its timestep schedule, so a moving denoise slider is tracked without wiping the in-flight queue, where the upstream global-schedule design must rebuild and discard it. (2) Shared mutable per-step state, giving any parameter consulted at every solver step next-tick effect, bypassing ring-buffer drain. (3) Per-frame source blending: a sampling-time control on the standard SDE re-noise step, giving a framewise transformation-strength axis that complements scalar denoise scheduling. (4) Windowed VAE decode exploiting receptive-field analysis for an 8.0x decode speedup. Together these separate streaming-diffusion parameters into four propagation classes, by onset and convergence latency.

Institutional Affiliations

Primary: Daydream

All Institutions: Daydream

Demo · GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.

Comprehensive Analysis

Methodology Assessment

The methodology presented in the paper is innovative, leveraging a real-time diffusion engine that transforms the denoising process into a playable musical instrument. The authors introduce several mechanisms that enhance the responsiveness and control of audio generation, including per-slot heterogeneous denoise scheduling, shared mutable per-step state, per-frame source blending, and a windowed VAE decode. These contributions are well-structured and address significant challenges in real-time audio generation, particularly in maintaining high throughput while allowing for fine-grained control over audio parameters.

Experimental Evaluation

The experimental evaluation is thorough, with a focus on latency, output quality, and responsiveness of parameter changes. The authors provide empirical results that substantiate their claims regarding the effectiveness of their proposed mechanisms, including quantitative comparisons with existing systems. The use of various audio sources and the detailed reporting of metrics such as CLAP and SNR demonstrate a rigorous approach to validating the system's performance.

Reproducibility

The paper includes sufficient detail regarding the architecture and implementation of the DEMON system, including the use of TensorRT for acceleration and the specific configurations used for experiments. However, the absence of a detailed description of the datasets and the evaluation metrics used may pose challenges for complete reproducibility. The provided URLs for the project and demo enhance accessibility to the code and results.

Limitations

One limitation of the paper is the reliance on a specific hardware setup (NVIDIA RTX 5090) for performance metrics, which may not generalize across different systems. Additionally, while the authors address the latency of their system, the practical implications of the onset latency in live performance contexts could be further explored. The paper does not discuss potential limitations in the quality of audio generated under varying conditions or the scalability of the system.

Broader Impact

The work has significant implications for the fields of music generation and real-time audio processing, particularly for live performances. By enabling musicians to manipulate denoising parameters in real-time, DEMON opens up new avenues for creative expression and interaction with AI-generated music. The integration of machine learning into musical instruments could lead to innovative performance practices and new genres of music. The main contribution of this paper is the introduction of DEMON, a real-time diffusion engine that allows for interactive control of audio generation, significantly enhancing the responsiveness and flexibility of music production tools. The technical contributions are robust, addressing key challenges in real-time audio processing and demonstrating a clear advancement in the field of machine learning for audio.

Analysis: Full Paper • Full text: 50,026 characters

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

Yucheng Wang, Jing Peng, Hanqi Li ... · arXiv

Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence ac...

Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use to determining when agentic evidence acquisition genuinely benefits audio understanding. We propose Audio-Mind, an auditable and pluggable framework for conditional evidence acquisition in audio understanding. Audio-Mind dynamically combines a strong frontend with planner-guided tool use, preserving frontend judgment when initial evidence is sufficient while acquiring bounded external evidence for questions with unresolved evidence gaps. Experiments on MMAR and MSU-Bench show that Audio-Mind outperforms prior audio-agent baselines, reaching 80.4% accuracy on MMAR and 82.8% accuracy on MSU-Bench. A matched-backbone comparison highlights why this design matters: under strong audio frontends, agentic decomposition can become an orchestration bottleneck when the workflow does not preserve the frontend's holistic audio-grounded judgment. Beyond accuracy, Audio-Mind produces higher-quality, auditable reasoning traces that expose uncertainty, tool evidence, and answer rationales, offering a potential basis for more reliable audio-QA annotation and error analysis.

Institutional Affiliations

Primary: unknown

All Institutions: unknown

ML Relevance Analysis (75)

The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.

Comprehensive Analysis

Methodology Assessment

The proposed Audio-Mind framework introduces a novel approach to audio understanding by integrating a strong frontend with planner-guided tool use. This method allows for dynamic evidence acquisition, which is a significant improvement over existing audio-agent baselines. The framework's ability to preserve the frontend's judgment while addressing evidence gaps is a noteworthy contribution to the field, as it enhances the overall reasoning process in audio question answering.

Experimental Evaluation

The experiments conducted on MMAR and MSU-Bench demonstrate the effectiveness of Audio-Mind, achieving impressive accuracy scores of 80.4% and 82.8%, respectively. The matched-backbone comparison further validates the framework's design by highlighting the orchestration bottleneck in agentic decomposition under strong audio frontends. However, the paper lacks detailed descriptions of the datasets and evaluation metrics used, which could enhance the transparency and reproducibility of the results.

Reproducibility

The paper does not provide sufficient implementation details or code availability, which raises concerns about reproducibility. Without access to the framework or clear guidelines on how to replicate the experiments, it is challenging for other researchers to validate the findings.

Limitations

One limitation is the potential complexity introduced by the planner-guided tool use, which may not generalize well to all audio understanding tasks. Additionally, the framework's reliance on strong frontends could limit its applicability in scenarios where such models are not available.

Broader Impact

The Audio-Mind framework has the potential to significantly impact the field of audio understanding and question answering by providing a more reliable and auditable reasoning process. Its contributions could lead to advancements in audio-QA annotation and error analysis, making it a valuable tool for researchers and practitioners in the domain. The main contribution of this paper is the introduction of the Audio-Mind framework, which enhances audio understanding through dynamic evidence acquisition and improved reasoning processes. This work is significant as it addresses key challenges in the field and proposes a method that could lead to more reliable audio question answering systems.

Analysis: Full Paper • Full text: 720 characters

CFMDCTCodec: A Low-Bitrate Neural Speech Codec with Noise-Prior-aware Conditional Flow Matching for MDCT-Spectral Enhancement

Xiao-Hang Jiang, Yang Ai, Hui-Peng Du ... · IEEE Transactions on Audio, Speech and Language Processing

High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neu...

High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge, we propose CFMDCTCodec, a low-bitrate neural speech codec that operates entirely in the modified discrete cosine transform (MDCT) domain. CFMDCTCodec integrates a lightweight encoder-quantizer-decoder-style MDCT-spectral codec with a noise-prior-aware, conditional-flow-matching (CFM)-based MDCT-spectral enhancer. Within this framework, the codec serves as a base module that compactly discretizes the MDCT spectrum extracted from speech and produces an initial coarse reconstruction, while the enhancer further restores fine-grained spectral details. The enhancer improves the decoded MDCT spectrum by integrating a conditional MDCT velocity-field filter with an ordinary differential equation (ODE) solver, under the guidance of an MDCT-derived magnitude-adaptive noise prior, aiming to emphasize perceptually significant high-energy regions while stabilizing low-energy and silent regions. Finally, the enhanced MDCT spectrum is reconstructed into the decoded speech using the inverse MDCT. When optimizing CFMDCTCodec, we adopt a unified non-adversarial training strategy that jointly combines reconstruction, quantization and CFM objectives. Both objective and subjective evaluations show that CFMDCTCodec outperforms competitive baselines in low-bitrate regimes, e.g., 0.65 kbps, while approaching the perceptual quality of large-scale codecs with significantly fewer parameters and computations.

Institutional Affiliations

Primary: University of Science and Technology of China

All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing, Tsinghua University

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.

Comprehensive Analysis

Methodology Assessment

The proposed CFMDCTCodec introduces a novel architecture for low-bitrate speech coding that operates entirely in the MDCT domain, integrating a single-codebook quantization strategy with a noise-prior-aware conditional flow matching (CFM) enhancement mechanism. This approach effectively addresses the limitations of existing codecs by enhancing the spectral quality of decoded speech without increasing bitrate, utilizing a joint training strategy that simplifies the learning process. The methodology is well-structured, with clear descriptions of the encoder, decoder, and enhancer components, and the use of ordinary differential equations (ODE) for state evolution is particularly innovative.

Experimental Evaluation

The experimental setup is robust, utilizing two different speech corpora and multiple bitrate settings to evaluate the codec's performance. The paper provides both objective and subjective evaluation metrics, including MUSHRA tests and various objective measures (STOI, SI-SDR, etc.), which demonstrate the codec's superiority over competitive baselines. The results indicate significant improvements in speech quality at low bitrates, validating the effectiveness of the proposed enhancements.

Reproducibility

The paper includes detailed descriptions of the experimental setup, including hyperparameters, training configurations, and evaluation metrics, which facilitate reproducibility. However, the absence of a publicly available code repository limits the ease of replication for other researchers.

Limitations

One limitation is the reliance on a single-codebook quantization strategy, which may not capture the full diversity of speech signals as effectively as multi-codebook approaches. Additionally, while the results are promising, further testing across a wider range of speech datasets and real-world scenarios would strengthen the findings.

Broader Impact

The CFMDCTCodec has significant potential applications in bandwidth-constrained environments such as satellite communications, teleconferencing, and mobile applications, where high-quality speech transmission is critical. Its lightweight design and efficient processing could facilitate broader adoption in various speech processing applications, contributing to advancements in telecommunications and accessibility technologies. The main contribution of this paper is the development of CFMDCTCodec, a low-bitrate neural speech codec that effectively enhances spectral quality through a novel conditional flow matching approach, demonstrating significant improvements in speech quality while maintaining low computational complexity. This work represents a meaningful advancement in the field of speech coding, particularly for applications requiring efficient bandwidth usage without compromising audio fidelity.

Analysis: Full Paper • Full text: 50,026 characters

OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants

Xudong Lu, Xueying Li, Annan Wang ... · arXiv

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual s...

We introduce OmniInteract, a streaming benchmark for real-time omnimodal large language models evaluated through native online inference over audio-visual streams. Unlike offline video understanding or text-prompted streaming QA, OmniInteract preserves the original audio-visual stream and requires models to process it online, without access to future content. User queries and ambient sounds are embedded in the audio track, requiring models to detect multimodal triggers, decide when to respond, and answer while the stream unfolds. OmniInteract contains 250 videos with 1,430 temporally grounded response slots: 1,062 1Q1A slots across real-time, proactive, and nested scenarios, and 368 1QnA slots for continuous task monitoring and step guidance. Each slot includes a trigger, response window, and target answer. We evaluate response correctness, timing, invalid outputs, interruption handling, and context continuity using Interaction-Aware Quality-Timeliness F1, Interruption Diagnostic Suite, and Nested Chain Completion Score. Experiments show that current models remain weak in streaming interaction, with the best overall IA-QTF1 reaching only 0.368 and the best 1QnA IA-QTF1 only 0.052. Further study on mathematical reasoning in full-duplex settings shows that offline capability does not necessarily transfer to online interaction. Code and datasets will be made publicly accessible at https://github.com/Lucky-Lance/OmniInteract.

Institutional Affiliations

Primary: CUHK MMLab

All Institutions: CUHK MMLab, SJTU, NTU, McMaster, CityUHK, JUFE

GitHub

ML Relevance Analysis (83)

The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.

Comprehensive Analysis

Methodology Assessment

The methodology introduces a novel interaction slot formulation that captures real-time, multimodal interactions in a continuous audio-visual stream. This approach is innovative as it shifts the evaluation paradigm from static question-answer pairs to dynamic, temporally grounded interactions, allowing for a more realistic assessment of model capabilities in real-time settings. The proposed metrics (IA-QTF1, IDS, NCCS) are well-defined and tailored to the unique challenges of streaming interactions, effectively measuring not just correctness but also timing and context management.

Experimental Evaluation

The experiments are comprehensive, evaluating multiple state-of-the-art omnimodal models under the new benchmark. The results reveal significant gaps in current models' abilities to handle real-time interactions, particularly in continuous task monitoring and nested query scenarios. The use of a diverse dataset of 250 videos with 1,430 response slots provides a solid foundation for the evaluations, although the performance scores indicate that there is considerable room for improvement in the models tested.

Reproducibility

The paper mentions that the code and datasets will be made publicly accessible, which is crucial for reproducibility. However, details on the exact implementation of the models tested and the specific evaluation protocols could be elaborated upon to enhance reproducibility further.

Limitations

The paper acknowledges limitations such as the narrow focus on specific interaction types and the reliance on synthesized speech for the 1QnA split. Additionally, the benchmark currently covers only Chinese and English scenarios, which may limit its applicability across different languages and cultures. The analysis is also limited to a small number of models, which may not represent the full landscape of omnimodal systems.

Broader Impact

The introduction of OmniInteract has the potential to significantly advance the field of real-time human-AI interaction by providing a standardized benchmark for evaluating omnimodal models. This can lead to improved AI assistants that are more capable of understanding and responding to user queries in real-time, enhancing applications in accessibility, education, and everyday tasks. The focus on real-time interaction also raises important considerations regarding privacy and the ethical deployment of always-on systems. The paper presents OmniInteract, a benchmark for evaluating omnimodal large language models in real-time audio-visual interactions, significantly advancing the assessment of AI capabilities in dynamic environments. The innovative methodology and comprehensive experimental evaluations highlight critical gaps in current models, paving the way for future research and development in this area.

Analysis: Full Paper • Full text: 50,026 characters

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

Bowen Li, Shaotong Guo, Zhen Wang ... · arXiv

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregr...

Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.

Institutional Affiliations

Primary: Amap, Alibaba Group

All Institutions: Amap, Alibaba Group, The Chinese University of Hong Kong, Shenzhen

GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, featuring a well-structured multi-stage data processing pipeline that enhances data quality and a compact autoregressive architecture that effectively decouples speaker identity from style. The use of Q-Former-based conditioning and cross-sample paired training is innovative and addresses common challenges in TTS systems.

Experimental Evaluation

The experiments are comprehensive, utilizing the Seed-TTS Eval benchmark to demonstrate superior performance in terms of WER, CER, and speaker similarity. The inclusion of human evaluations for emotion control and paralinguistic synthesis adds depth to the assessment of the system's capabilities.

Reproducibility

The paper emphasizes reproducibility by providing a complete data processing pipeline built from publicly available tools, along with pretrained weights and code. This transparency enhances the likelihood of other researchers replicating the results.

Limitations

The paper acknowledges limitations such as insufficient explicit style modeling and the constraints of single-codebook quantization, which may hinder performance in more complex scenarios. Additionally, the reliance on mel-spectrograms could introduce reconstruction artifacts.

Broader Impact

The potential applications of PilotTTS are significant, particularly for resource-constrained teams seeking to develop competitive TTS systems. Its modular approach and open-source nature could democratize access to high-quality speech synthesis technology. The main contribution of this paper is the introduction of PilotTTS, a lightweight and competitive TTS system that leverages rigorous data engineering and a disciplined modular architecture to achieve state-of-the-art performance with significantly less training data than existing systems. This work is significant as it addresses the barriers faced by resource-constrained teams in the field of speech synthesis, providing a practical solution that maintains high performance while promoting reproducibility and accessibility.

Analysis: Full Paper • Full text: 33,619 characters

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

Yang Xiao, Siyi Wang, Han Yin ... · arXiv

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, ...

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.

Institutional Affiliations

Primary: The University of Melbourne

All Institutions: The University of Melbourne, The University of Auckland, UNSW Sydney, KAIST

ML Relevance Analysis (83)

The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.

Comprehensive Analysis

Methodology Assessment

The methodology is robust, introducing the EnvMem framework to systematically analyze the retention of acoustic information in multi-turn interactions. The authors employ a combination of controlled experiments, linear probing, and attention analysis to dissect the representation and retrieval mechanisms in LALMs. The use of synthetic dialogues and a clear structure for the evaluation tasks enhances the clarity of the experimental design. However, the reliance on synthetic data may limit the generalizability of the findings to real-world scenarios.

Experimental Evaluation

The experiments are comprehensive, evaluating multiple LALMs across various context lengths. The results demonstrate a clear performance gap between semantic and acoustic memory, with detailed analyses of representational drift and attention allocation. The use of metrics like accuracy and relative degradation provides a solid basis for comparison, although the paper could benefit from additional qualitative assessments of model outputs.

Reproducibility

The paper provides detailed descriptions of the experimental setup, including dataset construction and evaluation protocols. However, the lack of publicly available code or datasets limits reproducibility. Future work should consider releasing the EnvMem benchmark and associated models to facilitate further research in this area.

Limitations

The primary limitation is the use of synthetic data, which may not capture the complexities of natural conversations. Additionally, the interventions are post-hoc and may not translate to practical solutions for improving acoustic memory in deployed models. The study also acknowledges potential ethical concerns regarding privacy and surveillance in real-world applications.

Broader Impact

This research has significant implications for the development of more robust audio language models, particularly in applications requiring persistent awareness of environmental sounds. By highlighting the representational bottlenecks in LALMs, the findings can guide future training strategies and benchmark designs, ultimately improving the integration of acoustic memory in multimodal systems. The paper provides a systematic investigation into the mechanisms underlying acoustic memory in long-context audio-language models, revealing critical insights into representational drift and attention dynamics that can inform future research and model design.

Analysis: Full Paper • Full text: 42,440 characters

MERIT: Learning Disentangled Music Representations for Audio Similarity

Abhinaba Roy, Junyi Liang, Dorien Herremans · arXiv

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework fo...

Current music similarity models typically compute a single, monolithic score, entangling distinct musical dimensions like melody, rhythm, and timbre. This limits user control and interpretability, making it impossible to execute nuanced queries. We introduce MERIT, a framework for learning disentangled, factor-specific music representations tailored to these three core dimensions. To overcome the lack of isolated musical variations in real-world audio, we use a novel training strategy that uses conditional audio generation and source-separated stems to strongly encourage single-factor variation in training data. Our evaluations demonstrate strong factor-wise disentanglement. Each head responds strongly to its intended perceptual dimension while remaining near chance on the others, a representational property that holds across both the synthetic training domain and independent real-world audio.

Institutional Affiliations

Primary: University

All Institutions: Company, Department of Computer Science, International Laboratories, University

GitHub

ML Relevance Analysis (82)

The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.

Comprehensive Analysis

Methodology Assessment

The methodology presented in MERIT is innovative, focusing on disentangled representations of music based on melody, rhythm, and timbre. The use of a frozen MERT backbone combined with a novel triplet construction strategy allows for effective training on isolated musical dimensions without manual labeling. The approach of leveraging generative models for creating training data is particularly noteworthy, as it addresses the challenge of entangled real-world audio data. The Circle Loss optimization technique further enhances the training process by focusing on hard negatives, which is a sound choice for improving representation quality.

Experimental Evaluation

The experiments are well-structured, utilizing both internal and external evaluations to assess the model's performance. The use of zero-shot probes on independent datasets demonstrates the generalizability of the learned representations. The results indicate strong factor-wise disentanglement, with high accuracy in distinguishing between the different musical dimensions. The human evaluation of triplet quality adds a valuable subjective perspective to the findings, reinforcing the model's effectiveness. Overall, the experimental design is robust and provides compelling evidence of the framework's capabilities.

Reproducibility

The paper provides sufficient details regarding the architecture, training procedures, and datasets used, which supports reproducibility. The authors have made the code and pre-trained models publicly available, further facilitating replication of their results. However, the reliance on specific datasets like MoisesDB and the generative model JASCO may limit reproducibility if these resources are not accessible to all researchers.

Limitations

Some limitations are acknowledged, such as the focus on only three musical dimensions (melody, rhythm, and timbre), which may overlook other important aspects like harmony and dynamics. Additionally, the operationalization of timbre at the instrument-class level may not capture within-class variations adequately. The authors also mention potential biases from the training data that could affect the model's performance in real-world scenarios.

Broader Impact

The implications of MERIT are significant for music information retrieval, recommendation systems, and music analysis tools. By enabling users to query music based on specific dimensions, it enhances user control and interpretability, which can lead to more personalized music experiences. The framework could also inspire further research into disentangled representations in other domains, potentially influencing broader applications in audio processing and machine learning. The main contribution of this paper is the introduction of MERIT, a framework that effectively disentangles musical dimensions for improved audio similarity assessment. This work significantly advances the state of music representation learning by providing a novel approach that enhances interpretability and user control in music similarity queries.

Analysis: Full Paper • Full text: 25,629 characters

PitchBench: Measuring Pitch Hearing in Audio-Language Models

Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith ... · arXiv

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal ...

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

Institutional Affiliations

Primary: University of California, Berkeley

All Institutions: University of California, Berkeley, Thoughtful Lab

GitHub

ML Relevance Analysis (86)

The main contribution of this paper is the introduction of PitchBench, a systematic evaluation suite for measuring pitch hearing in audio-language models, which significantly enhances the understanding of how these models perceive musical pitch. This work represents a critical step toward improving the reliability and effectiveness of ALMs in real-world audio applications.

Comprehensive Analysis

Methodology Assessment

The methodology presented in PitchBench is robust and systematic, focusing on a hierarchical evaluation of pitch perception in audio-language models (ALMs). The paper introduces a comprehensive framework that includes 28 experiments designed to isolate and assess various aspects of pitch hearing, such as absolute and relative pitch perception. The use of controlled synthetic stimuli allows for precise measurement of model performance across different acoustic conditions and response formats. This structured approach is a significant improvement over existing benchmarks, which often fail to directly evaluate the fundamental ability to perceive pitch.

Experimental Evaluation

The experimental evaluation is thorough, involving six frontier ALMs across a wide range of tasks that assess pitch perception under varying conditions. The results reveal significant performance variability among models, highlighting specific failure modes that are not captured by higher-level benchmarks. The detailed analysis of model performance, including the effects of acoustic variations and response modalities, provides valuable insights into the strengths and weaknesses of current ALMs in pitch perception.

Reproducibility

The paper emphasizes reproducibility by providing a Python package that includes the evaluation data and generation tools. The authors detail the deterministic generation of stimuli, ensuring that other researchers can replicate the experiments. The inclusion of metadata and standardized output formats further supports reproducibility.

Limitations

While PitchBench offers a significant advancement in evaluating pitch perception, it relies entirely on algorithmically synthesized stimuli, which may not fully capture the complexities of real-world audio. The current instrument selection is limited to General MIDI instruments, and the benchmark does not address non-Western musical traditions or more complex rhythmic reasoning tasks. Future work is needed to incorporate real recordings and broaden the diversity of the instrument pool.

Broader Impact

The implications of PitchBench are substantial for the development of audio-language models, particularly in applications requiring reliable musical understanding, such as music tutoring, transcription, and recommendation systems. By providing a diagnostic tool for evaluating pitch perception, this work lays the groundwork for future advancements in multimodal AI systems that integrate audio understanding with other sensory inputs. The main contribution of this paper is the introduction of PitchBench, a systematic evaluation suite for measuring pitch hearing in audio-language models, which significantly enhances the understanding of how these models perceive musical pitch. This work represents a critical step toward improving the reliability and effectiveness of ALMs in real-world audio applications.

Analysis: Full Paper • Full text: 40,992 characters

A Multimodal Framework for Dementia Detection via Linguistic and Acoustic Representation Learning

Loukas Ilias, Dimitris Askounis · arXiv

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve...

Alzheimer's disease (AD) is a progressive neurodegenerative disorder and the leading cause of dementia, affecting memory, reasoning, communication, and daily functioning. Early diagnosis is particularly important, as timely intervention may help slow cognitive decline and improve patient care. Recent studies have demonstrated that spontaneous speech contains valuable linguistic and acoustic biomarkers associated with dementia. However, existing approaches often rely on independently trained modality-specific models, feature concatenation strategies, ensemble methods, or attention-based fusion mechanisms that do not explicitly maximize the dependency between speech and transcript representations. In this work, we propose a multimodal deep learning framework for automatic dementia detection that jointly exploits speech and transcript information in an end-to-end trainable manner. Specifically, speech recordings are divided into 10-second segments and passed through a pre-trained HuBERT model to extract contextualized acoustic representations. To better capture informative temporal speech characteristics, attentive statistics pooling is employed to aggregate frame-level acoustic embeddings. For the textual modality, transcripts are encoded using a pre-trained BERT model, where the [CLS] token representation is used as the linguistic embedding. The acoustic and textual representations are subsequently combined using an attention-based Audio-Text Fusion (AT-Fusion) mechanism. In addition, we introduce a MINE objective to maximize the mutual information between modalities and improve multimodal representation alignment. The fused multimodal representation is finally used for dementia classification. Experiments conducted on the publicly available ADReSS Challenge and PROCESS-2 dataset demonstrate the effectiveness and robustness of the proposed approach for speech-based dementia assessment.

Institutional Affiliations

Primary: National Technical University of Athens

All Institutions: National Technical University of Athens

ML Relevance Analysis (83)

This paper presents a multimodal deep learning framework for dementia detection that effectively combines acoustic and linguistic features, showcasing innovative methods and robust experimental validation. The technical contributions are significant, addressing critical gaps in existing approaches and offering a promising direction for future research in automatic dementia assessment.

Comprehensive Analysis

Methodology Assessment

The proposed methodology employs a novel multimodal deep learning framework that integrates both acoustic and linguistic representations for dementia detection. The use of HuBERT for acoustic representation and BERT for textual representation, combined with attentive statistics pooling and an innovative Audio-Text Fusion mechanism, demonstrates a sophisticated approach to capturing the nuances of speech relevant to cognitive decline. The introduction of the Mutual Information Neural Estimation (MINE) objective to enhance cross-modal representation alignment is particularly noteworthy, as it addresses a significant gap in existing multimodal approaches.

Experimental Evaluation

The experiments are well-structured, utilizing two publicly available datasets (ADReSS Challenge and PROCESS-2) to validate the proposed framework. The results indicate competitive performance compared to state-of-the-art methods, with detailed metrics provided for accuracy, recall, and specificity. The ablation studies further strengthen the findings by demonstrating the effectiveness of various components of the proposed framework, such as pooling strategies and fusion methods.

Reproducibility

The paper provides a clear description of the methodology and experimental setup, including details on the datasets and evaluation metrics. However, there is no mention of code availability or a repository for others to reproduce the results, which limits the reproducibility aspect.

Limitations

One limitation is the reliance on specific datasets, which may not fully represent the diversity of speech patterns in broader populations. Additionally, while the framework shows promising results, the performance on different demographic groups or in real-world settings remains untested. The absence of a demo or project URL also hinders practical application and further exploration by the community.

Broader Impact

The framework has significant implications for early diagnosis and intervention in Alzheimer's disease, potentially improving patient care and outcomes. By leveraging speech analysis, the approach could facilitate non-invasive and efficient screening methods, which are crucial given the increasing prevalence of dementia globally. The integration of multimodal learning in this context also opens avenues for future research in cognitive health monitoring and related fields. This paper presents a multimodal deep learning framework for dementia detection that effectively combines acoustic and linguistic features, showcasing innovative methods and robust experimental validation. The technical contributions are significant, addressing critical gaps in existing approaches and offering a promising direction for future research in automatic dementia assessment.

Analysis: Full Paper • Full text: 41,790 characters

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

Junyang Chen, Yuhang Jia, Hui Wang ... · arXiv

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT...

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

Institutional Affiliations

Primary: Nankai University

All Institutions: Nankai University

Demo

ML Relevance Analysis (83)

The paper presents CosyEdit2, a novel framework that enhances speech editing and zero-shot TTS through innovative reinforcement learning techniques and a well-structured methodology. The contributions are significant, addressing key limitations in the field and paving the way for future advancements in audio processing technologies.

Comprehensive Analysis

Methodology Assessment

The paper introduces CosyEdit2, a two-stage post-training framework that innovatively combines supervised fine-tuning with reinforcement learning (GRPO) to enhance speech editing capabilities while also improving zero-shot TTS performance. The methodology is well-structured, addressing the limitations of previous approaches by eliminating the need for imperfect paired data and optimizing through editing-specific rewards. The architecture leverages a unified text-speech language model and a conditional flow-matching model, showcasing a novel integration of LLMs with audio processing.

Experimental Evaluation

The experiments are extensive, utilizing multiple benchmarks for both speech editing and zero-shot TTS. The results demonstrate significant improvements over existing models, particularly in terms of acoustic consistency and editing fidelity. The use of both objective and subjective evaluation metrics strengthens the findings, providing a comprehensive assessment of the model's performance.

Reproducibility

The paper provides detailed training and evaluation setups, including data sources, training parameters, and model architectures, which facilitate reproducibility. However, access to the datasets used for training and evaluation may be a limiting factor for complete reproducibility.

Limitations

The authors acknowledge limitations in the design space of the reward formulation and the language coverage of the framework, which is currently constrained to a few languages. Additionally, broader acoustic editing capabilities remain unexplored, suggesting areas for future research.

Broader Impact

The advancements in speech editing and zero-shot TTS have significant implications for applications in accessibility, multimedia production, and human-computer interaction. However, the potential for misuse in voice impersonation and misinformation propagation raises ethical concerns that need to be addressed through responsible deployment practices. The paper presents CosyEdit2, a novel framework that enhances speech editing and zero-shot TTS through innovative reinforcement learning techniques and a well-structured methodology. The contributions are significant, addressing key limitations in the field and paving the way for future advancements in audio processing technologies.

Analysis: Full Paper • Full text: 50,026 characters

Subspace Track-before-Detect for Passive Multi-Target Tracking with Unknown Emitted Signals

Nobutaka Ito, Yoshiaki Bando · arXiv

Passive multi-target tracking (MTT) aims to infer the kinematic states of multiple targets from noisy sensor data in which contributions from unknown target-emitted signals are superposed. Track-before-detect (TBD) methods improve robustness to noise by operating directly on raw ...

Passive multi-target tracking (MTT) aims to infer the kinematic states of multiple targets from noisy sensor data in which contributions from unknown target-emitted signals are superposed. Track-before-detect (TBD) methods improve robustness to noise by operating directly on raw sensor data without relying on a preceding detection stage. However, many existing TBD methods assume that each target's contribution to the sensor data is determined solely by its kinematic state. This assumption limits their applicability to passive MTT, where each target's contribution depends on both its kinematic state and the unknown emitted signal. We propose subspace TBD, a passive multi-target TBD method based on a likelihood derived from the complex Bingham distribution that does not require explicit modeling or estimation of the unknown emitted signals. In a particle filter (PF) framework, each multi-target hypothesis is mapped to a low-dimensional subspace spanned by the steering vectors corresponding to the hypothesized target states. The likelihood is then used to evaluate the alignment of the normalized multichannel sensor data with this subspace. Preliminary experiments with simulated acoustic measurements and a given target activity pattern show that the proposed method can track two moving targets emitting unknown signals at a signal-to-noise ratio (SNR) of -10dB, whereas a conventional TBD baseline yields substantially larger tracking errors.

Institutional Affiliations

Primary: National Institute of Advanced Industrial Science and Technology (AIST)

All Institutions: National Institute of Advanced Industrial Science and Technology (AIST)

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a novel subspace track-before-detect methodology for passive multi-target tracking that effectively addresses the challenges posed by unknown emitted signals. This work represents a significant advancement in the field of audio signal processing and multi-target tracking, offering a robust solution for low-SNR environments and paving the way for future research in more complex scenarios.

Comprehensive Analysis

Methodology Assessment

The proposed methodology, subspace track-before-detect (TBD), innovatively addresses the challenges of passive multi-target tracking (MTT) in environments where the emitted signals from targets are unknown. By leveraging the complex Bingham distribution to model the observation likelihood without requiring explicit estimation of the emitted signals, the authors effectively circumvent a significant limitation of conventional TBD methods. The use of a particle filter framework to implement this approach allows for robust tracking of multiple targets in low signal-to-noise ratio (SNR) conditions, which is a notable advancement in the field.

Experimental Evaluation

The experiments conducted are well-structured, utilizing simulated acoustic measurements to validate the proposed method. The comparison against a conventional deterministic-contribution baseline highlights the effectiveness of the subspace TBD approach, particularly in low SNR scenarios. The results demonstrate a significant improvement in tracking accuracy, with lower root mean square errors (RMSE) across various conditions, reinforcing the practical applicability of the method.

Reproducibility

The paper provides sufficient details regarding the experimental setup, including the simulation parameters and the configuration of the particle filter. However, the lack of a publicly accessible code repository or demo limits the reproducibility of the results. Future work should include sharing the implementation to facilitate validation and further exploration by the research community.

Limitations

One limitation of the study is the reliance on simulated data, which may not fully capture the complexities of real-world scenarios. The paper also assumes a fixed activity pattern for the targets, which may not be realistic in dynamic environments. Additionally, the method's performance in more complex acoustic settings, such as reverberant environments or with more than two targets, remains to be evaluated.

Broader Impact

The proposed subspace TBD method has significant potential applications in various fields, including surveillance, autonomous vehicles, and audio signal processing. By improving the robustness of multi-target tracking in noisy environments, this research could enhance systems that rely on accurate target localization and tracking, thereby contributing to advancements in safety and efficiency in real-time applications. The main contribution of this paper is the introduction of a novel subspace track-before-detect methodology for passive multi-target tracking that effectively addresses the challenges posed by unknown emitted signals. This work represents a significant advancement in the field of audio signal processing and multi-target tracking, offering a robust solution for low-SNR environments and paving the way for future research in more complex scenarios.

Analysis: Full Paper • Full text: 21,905 characters

Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control

Wangzixi Zhou, Bagus Tris Atmaja, Sakriani Sakti · Proc. 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), pp. 1-6, 2025 · Proc. 2025 28th Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA)

While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-qualit...

While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-quality, fine-grained annotations, which restricts a model's ability to precisely control NV generation. To address this limitation, we propose a novel approach for fine-grained non-verbal expression synthesis. We curate and reprocess female NV utterances from the EARS corpus, develop a new annotation scheme using tags to encode NV types, frequencies, and durations, and build an emotional TTS benchmark to demonstrate its effectiveness. Our evaluation shows that while our NV approach leads to minor trade-offs in perceived naturalness, it significantly improves expressiveness (eMOS 4.20) and emotional recognition accuracy (78.8%). Emotion-specific analysis further reveals that NV cues are highly effective for high-arousal emotions like happy (82.5%) and fear (82.7%), and almost perfectly convey sadness (98.3%).

Institutional Affiliations

Primary: Nara Institute of Science and Technology

All Institutions: Nara Institute of Science and Technology

Demo

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of a fine-grained non-verbal expression dataset and a corresponding TTS system that significantly enhances emotional expressiveness in synthesized speech. This work represents a meaningful advancement in the field of emotional TTS synthesis, addressing critical gaps in existing methodologies and datasets.

Comprehensive Analysis

Methodology Assessment

The methodology presented in this paper is robust, focusing on the development of a fine-grained non-verbal expression dataset and a corresponding TTS system. The authors effectively address the limitations of existing datasets by introducing a novel annotation scheme that allows for precise control over non-verbal vocalizations. The use of Grad-TTS as the backbone model, enhanced with an emotion encoder, demonstrates a thoughtful integration of emotional embeddings into the synthesis process. The segmentation and transcription processes are well-detailed, showcasing a clear understanding of audio processing and the importance of high-quality data in training TTS systems.

Experimental Evaluation

The experimental evaluation is comprehensive, involving subjective assessments of naturalness and emotional expressiveness, as well as emotion recognition accuracy. The use of a diverse set of evaluation metrics, including eMOS and nMOS, provides a nuanced understanding of the model's performance. The results indicate a significant improvement in expressiveness with the fine-grained NV approach, although there is a minor trade-off in perceived naturalness. The emotion-specific analysis adds depth to the findings, illustrating the effectiveness of NV cues in conveying various emotional states.

Reproducibility

The paper provides sufficient detail regarding the dataset construction, model architecture, and evaluation procedures, which enhances reproducibility. However, the absence of a publicly available code repository limits the ability for other researchers to fully replicate the study. The authors could improve reproducibility by sharing their code and trained models.

Limitations

One limitation is the focus on female NV utterances, which may not generalize well to male voices or other demographics. Additionally, the minor trade-off in naturalness when incorporating NVs could be a concern for practical applications. The subjective nature of the evaluations may also introduce variability, as individual preferences for emotional expression can differ widely.

Broader Impact

This research has significant implications for the development of more emotionally intelligent conversational AI systems. By enhancing the expressiveness of TTS systems through the integration of non-verbal vocalizations, the work contributes to creating more engaging and human-like interactions in various applications, including virtual assistants, gaming, and mental health support systems. The main contribution of this paper is the introduction of a fine-grained non-verbal expression dataset and a corresponding TTS system that significantly enhances emotional expressiveness in synthesized speech. This work represents a meaningful advancement in the field of emotional TTS synthesis, addressing critical gaps in existing methodologies and datasets.

Analysis: Full Paper • Full text: 19,042 characters

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

Hui-Peng Du, Yang Ai, Xiao-Hang Jiang ... · IEEE/ACM Transactions on Audio, Speech, and Language Processing

Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, ...

Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and quantization instability. To this end, we propose FMelCodec, an ultra-low-bitrate neural speech codec in the mel-spectrogram domain, cast as a three-stage coding-refinement-reconstruction (CRR) framework that can operate at as low as 250 bps. In the CRR framework, the front-end mel-spectrogram coding stage employs a highly aggressive 640x compression/decompression encoder-decoder structure with a single 1024-entry VQ codebook, coupled with an online clustering strategy that reassigns underused codewords to prevent codebook collapse and preserve codebook diversity. The subsequent conditional flow matching (CFM)-based mel-spectrogram refinement stage leverages a lightweight velocity-field estimator and CFM-based solver to refine the codec-degraded mel-spectrogram produced by the preceding decoder, and adopts a self-consistency training scheme that supports fewer iterative inference steps for the purpose of reducing computational overhead. Finally, the vocoding-driven waveform reconstruction stage employs a HiFi-GAN vocoder to faithfully reconstruct waveform from the refined mel-spectrogram. Experiments conducted on two datasets spanning two sampling rates show that, under ultra-low-bitrate constraints of 250 bps for 16 kHz and 750 bps for 48 kHz, both objective and subjective evaluations consistently demonstrate that FMelCodec achieves higher speech reconstruction quality and speaker similarity, while incurring lower computational and model complexity.

Institutional Affiliations

Primary: University of Science and Technology of China

All Institutions: University of Science and Technology of China, National Institute of Informatics, Baidu Speech Department, National Engineering Research Center of Speech and Language Information Processing

Demo · GitHub

ML Relevance Analysis (83)

The main contribution of this paper is the introduction of FMelCodec, a novel ultra-low-bitrate speech codec that effectively balances compression efficiency and speech quality through a sophisticated three-stage framework, demonstrating significant advancements in the field of neural speech coding. The methodology and results presented have the potential to influence future developments in audio processing and communication technologies.

Comprehensive Analysis

Methodology Assessment

The paper introduces FMelCodec, a novel three-stage coding-refinement-reconstruction (CRR) framework for ultra-low-bitrate speech coding that operates in the mel-spectrogram domain. The methodology is well-structured, leveraging a single-codebook vector quantization approach combined with conditional flow matching (CFM) for refinement and a HiFi-GAN vocoder for reconstruction. The online clustering strategy for codebook management is particularly innovative, addressing codebook collapse effectively. The self-consistency training scheme enhances computational efficiency, allowing fewer inference steps while maintaining quality.

Experimental Evaluation

The experiments are robust, utilizing two datasets (LibriTTS and VCTK) across different sampling rates. The evaluation metrics include both objective and subjective assessments, showcasing FMelCodec's superiority in reconstruction quality and speaker similarity at ultra-low bitrates. The results are statistically significant, demonstrating the codec's effectiveness compared to existing baselines, which is crucial for validating the proposed approach.

Reproducibility

The paper provides detailed implementation configurations, including model architectures, training procedures, and hyperparameters, which enhances reproducibility. The availability of code and trained models on GitHub further supports this aspect, allowing other researchers to replicate the results.

Limitations

While the proposed method shows promising results, the reliance on a single codebook may limit flexibility in representing diverse speech characteristics. Additionally, the computational efficiency, although improved, may still be a concern in extremely resource-constrained environments. The paper does not extensively discuss the scalability of the approach to other languages or dialects, which could be a limitation in broader applications.

Broader Impact

The FMelCodec has significant implications for bandwidth-constrained communication systems, such as satellite communications and mobile devices, where low-bitrate speech coding is essential. Its potential applications extend to telecommunication, voice-over-IP services, and assistive technologies for individuals with speech impairments. The advancements in neural speech coding could also influence future research in audio processing and machine learning. The main contribution of this paper is the introduction of FMelCodec, a novel ultra-low-bitrate speech codec that effectively balances compression efficiency and speech quality through a sophisticated three-stage framework, demonstrating significant advancements in the field of neural speech coding. The methodology and results presented have the potential to influence future developments in audio processing and communication technologies.

Analysis: Full Paper • Full text: 50,026 characters

WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models

Wangzixi Zhou, Takuma Okamoto, Yamato Ohtani ... · Proc. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 17012-17016, 2026 · Proc. ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing

Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, ...

Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited performance in multi-speaker settings. Moreover, diffusion models, despite training faster than GANs, have slow CPU inference. In this paper, we introduce WaveNeXt 2, a unified ConvNeXt-based framework compatible with both GAN and diffusion vocoders. Its core innovation is residual denoising and sub-modeling, where each sub-model progressively refines the waveform. Experimental results in the multi-speaker dataset demonstrate the effectiveness of our approach: (1) GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit, and (2) Diff-WaveNeXt 2 also delivers much faster inference and competitive synthesis quality compared with FastDiff with 4 steps. The Diff-WaveNeXt 2 is very training-efficient, training in only 32 hours, making it ideal for resource-constrained applications.

Institutional Affiliations

Primary: Nara Institute of Science and Technology

All Institutions: Nara Institute of Science and Technology, National Institute of Information and Communications Technology

Demo

ML Relevance Analysis (83)

WaveNeXt 2 represents a significant step forward in the development of neural vocoders, providing a unified framework that enhances performance and efficiency in both GAN and diffusion contexts. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and audio processing.

Comprehensive Analysis

Methodology Assessment

The proposed WaveNeXt 2 framework introduces a novel architecture that integrates ConvNeXt-based residual denoising and sub-modeling, allowing it to function effectively in both GAN and diffusion vocoder contexts. This dual compatibility is a significant advancement, as it addresses the limitations of existing models that are typically confined to one framework. The methodology is well-structured, with clear delineation between the GAN and diffusion approaches, and the use of sub-models for noise-level conditioning is a clever adaptation that enhances performance and efficiency. The authors provide a comprehensive description of the architecture, training strategies, and inference processes, which demonstrates a solid understanding of the challenges in neural vocoding.

Experimental Evaluation

The experiments are robust, utilizing a substantial dataset (LibriTTS-R) and employing both subjective (MOS) and objective (UTMOS, NISQA, MCD, log F0 RMSE) evaluation metrics. The results indicate that both GAN-WaveNeXt 2 and Diff-WaveNeXt 2 outperform existing models in terms of inference speed and synthesis quality. The comparative analysis with baseline models is thorough, providing clear evidence of the proposed models' advantages. However, the paper could benefit from more extensive ablation studies to further validate the contributions of individual components.

Reproducibility

The authors provide sufficient implementation details, including the use of PyTorch and specific training configurations, which aids reproducibility. However, the absence of a public code repository limits the ease with which other researchers can replicate the results. Including a link to the implementation or a GitHub repository would enhance reproducibility significantly.

Limitations

While the paper presents strong results, it acknowledges that the increased model size due to sub-modeling could be a drawback for deployment in resource-constrained environments. Additionally, the reliance on specific architectures may limit the generalizability of the findings to other vocoder designs. The paper could also explore the trade-offs between model complexity and performance in more depth.

Broader Impact

The advancements presented in WaveNeXt 2 have significant implications for real-time speech synthesis applications, particularly in multi-speaker scenarios and resource-constrained environments. The ability to unify GAN and diffusion frameworks could lead to more versatile and efficient vocoders, potentially enhancing the quality of synthesized speech in various applications, including virtual assistants, audiobooks, and gaming. The work could inspire further research into hybrid models that leverage the strengths of both GANs and diffusion processes. WaveNeXt 2 represents a significant step forward in the development of neural vocoders, providing a unified framework that enhances performance and efficiency in both GAN and diffusion contexts. The comprehensive methodology, rigorous experimental evaluation, and potential for real-world applications underscore its importance in the field of machine learning and audio processing.

Analysis: Full Paper • Full text: 21,150 characters

Continual Speaker Identity Unlearning with Minimal Interference

Jinju Kim, Yunsung Kang, Gyeong-Moon Park ... · arXiv

Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Exist...

Machine unlearning removes designated concepts or knowledge from pre-trained models. Recent work has extended this paradigm to speaker identity unlearning in zero-shot text-to-speech (ZS-TTS), the task of selectively erasing a model's ability to replicate a speaker's voice. Existing methods, however, quietly assume all unlearning requests arrive at once; an unrealistic assumption, since privacy-motivated removals arrive sequentially over time. We show this assumption breaks state-of-the-art methods: unlearning each new speaker fully revives previously unlearned speakers, reintroducing the very privacy risk unlearning was meant to eliminate. We present Cumulative ORThogonal Identity Suppression (CORTIS), the first framework for continual speaker identity unlearning in ZS-TTS that requires no access to previously-unlearned speaker data. CORTIS combines Fisher-information-based parameter masking, which localizes updates to speaker-relevant weights, with orthogonal projection against subspaces spanned by prior unlearning updates. With VoiceBox, CORTIS unlearns each requested speaker while keeping previously unlearned speakers forgotten across long request sequences, substantially outperforming sequential application of prior methods. The demo is available at https://cumulativeortis.github.io/ .

Institutional Affiliations

Primary: Sungkyunkwan University

All Institutions: Sungkyunkwan University, Korea University

Demo

ML Relevance Analysis (82)

The paper presents CORTIS, a novel framework for continual speaker identity unlearning in zero-shot text-to-speech systems, effectively addressing privacy concerns while maintaining model performance. The integration of advanced techniques in machine unlearning and continual learning marks a significant contribution to the field, with strong experimental validation and practical implications for privacy in AI.

Comprehensive Analysis

Methodology Assessment

The proposed CORTIS framework innovatively addresses the problem of continual speaker identity unlearning in zero-shot text-to-speech systems. By combining Fisher-information-based parameter masking with orthogonal projection, it effectively prevents catastrophic re-learning of previously unlearned speakers while maintaining the quality of the remaining speakers. This dual approach is a significant advancement over previous methods that assumed simultaneous unlearning requests and failed to account for sequential requests, which is a more realistic deployment scenario. The methodology is well-justified and grounded in the principles of continual learning and machine unlearning, showcasing a thoughtful integration of concepts from both fields.

Experimental Evaluation

The experiments are robust, utilizing a well-defined evaluation scenario with clear metrics for assessing both retention of previously learned speakers and the quality of the generated speech. The results demonstrate that CORTIS outperforms existing methods in maintaining speaker identity suppression across multiple requests, with quantitative metrics supporting the claims made. The use of a controlled backbone (VoiceBox) ensures fair comparisons, and the detailed ablation studies provide insights into the contributions of each component of the proposed method.

Reproducibility

The paper provides comprehensive implementation details, including the architecture of the backbone model and the specific configurations used for training and evaluation. This level of detail enhances reproducibility, allowing other researchers to replicate the experiments effectively. However, the reliance on specific datasets and models may limit broader applicability without further validation across different architectures.

Limitations

The paper acknowledges limitations such as the lack of adversarial robustness and the focus on a single backbone model (VoiceBox). Additionally, while the proposed method is effective, the computational overhead introduced by the CORTIS framework may pose challenges for real-time applications. Future work could explore the scalability of the method and its performance across various architectures and datasets.

Broader Impact

The implications of this work are significant, particularly in the context of privacy and data protection regulations like GDPR and CCPA. By providing a mechanism for continual speaker identity unlearning, the research contributes to the responsible deployment of zero-shot text-to-speech systems, which can have far-reaching effects on user privacy and consent in AI applications. The framework could be adapted for other domains requiring similar unlearning capabilities, thus broadening its impact. The paper presents CORTIS, a novel framework for continual speaker identity unlearning in zero-shot text-to-speech systems, effectively addressing privacy concerns while maintaining model performance. The integration of advanced techniques in machine unlearning and continual learning marks a significant contribution to the field, with strong experimental validation and practical implications for privacy in AI.

Analysis: Full Paper • Full text: 47,456 characters

cSTMM: A Unified Complex Spherical Student's $t$ Mixture Model for Directional Statistics in Mask-Based Blind Speech Separation

Nobutaka Ito · arXiv

Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations on the complex unit sphere, without expli...

Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations on the complex unit sphere, without explicitly extracting phase and level difference features based on the plane-wave or spherical-wave assumptions. However, prior studies have mostly compared a small number of separately defined directional statistical mixture models, whereas a broader distribution family would enable a more systematic study of how density profiles affect separation performance. We propose the complex spherical Student's t mixture model (cSTMM), a directional mixture model that connects the complex angular central Gaussian mixture model (cACGMM), complex Bingham mixture model (cBMM), and complex Watson mixture model (cWMM) through the degrees-of-freedom parameter $ν$. We also derive a generalized minorization-maximization (MM) based procedure for parameter estimation. A no-restart evaluation on noise-free LibriSpeech mixtures reverberated with measured room impulse responses shows that a single development-selected value $ν^\ast=1$ achieved higher test-set mean signal-to-distortion ratio improvements (SDRi) than the cACGMM-equivalent setting $ν=M$ in all acoustic conditions, with an average condition-wise gain of 0.25dB. The experiments also numerically verify that the proposed formulation numerically recovers the cACGMM, cBMM, and cWMM cases.

Institutional Affiliations

Primary: Artificial Intelligence Research Center, AIST, Japan

All Institutions: Artificial Intelligence Research Center, AIST, Japan

ML Relevance Analysis (78)

The main contribution of this paper is the introduction of the cSTMM, which unifies existing directional statistical models for blind speech separation and demonstrates its effectiveness through rigorous experimental evaluation. Overall, the paper makes a meaningful contribution to the field of audio signal processing, particularly in enhancing the performance of mask-based speech separation techniques.

Comprehensive Analysis

Methodology Assessment

The paper introduces the complex spherical Student's t mixture model (cSTMM), which unifies several existing directional statistical mixture models (cACGMM, cBMM, cWMM) under a single framework. The methodology is robust, employing a generalized minorization-maximization (MM) procedure for parameter estimation, which is a significant contribution to the field. The approach allows for systematic exploration of how different density profiles impact speech separation performance, addressing a gap in prior research that focused on isolated models. The derivation of the model and the updates for parameter estimation are well-articulated, showing a clear understanding of the underlying statistical principles.

Experimental Evaluation

The experiments are well-structured, utilizing the LibriSpeech dataset and a variety of acoustic conditions to evaluate the performance of the proposed model. The results demonstrate a statistically significant improvement in mean signal-to-distortion ratio (SDRi) across different conditions, with a clear methodology for selecting hyperparameters. The inclusion of model recovery tests further strengthens the experimental validation, confirming that the cSTMM can effectively recover the properties of the models it encompasses.

Reproducibility

The paper provides sufficient detail regarding the experimental setup, including the choice of datasets, evaluation metrics, and parameter settings. However, the absence of a publicly available implementation or code repository limits reproducibility. Future work should consider making the model and experiments accessible to facilitate validation by other researchers.

Limitations

While the paper presents a novel model and shows promising results, the improvements in SDRi are modest (averaging 0.25 dB), which may not be substantial enough to warrant a shift from existing methods in practical applications. Additionally, the model's performance in noisy or real-world environments remains untested, which could be a significant limitation for its applicability.

Broader Impact

The cSTMM has the potential to advance the field of blind speech separation, particularly in scenarios where supervised learning is impractical. By providing a unified framework for directional statistics, it could lead to more robust speech separation systems, benefiting applications in telecommunications, hearing aids, and automatic speech recognition. The systematic exploration of density profiles may also inspire further research into adaptive signal processing techniques. The main contribution of this paper is the introduction of the cSTMM, which unifies existing directional statistical models for blind speech separation and demonstrates its effectiveness through rigorous experimental evaluation. Overall, the paper makes a meaningful contribution to the field of audio signal processing, particularly in enhancing the performance of mask-based speech separation techniques.

Analysis: Full Paper • Full text: 13,287 characters

Score-Agnostic Structure Analysis in Large-Scale Performance Datasets

Patricia Hu, Silvan Peter, Gerhard Widmer · Music Encoding Conference (MEC) 2026

In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in...

In recent years, thanks to advances in automatic music transcription (AMT), several large-scale datasets of automatically transcribed piano solo music have been released. While these datasets undoubtedly offer extensive material for performance studies, they vary substantially in quality. In the case of classical music, performances often differ not only in expressive aspects such as tempo, but also in their structural interpretation of the score (including repeat patterns and edition-specific variants). To meaningfully use large-scale transcribed datasets for performance research, transcriptions of the same piece must be grouped according to their underlying structural realisation to support valid comparison. We address this by applying sequence-to-sequence alignment followed by hierarchical clustering: we create pairwise alignments for all pairs of transcriptions of a given piece, and use the alignment cost and (dis)similarity of performed sequence lengths to resolve structural mismatches as features for grouping. We propose this approach as a first step towards automatically evaluating large-scale transcribed datasets that lack ground-truth score and/or audio, shifting the evaluation criterion from truth-based accuracy to musical coherence and plausibility. We demonstrate our score-agnostic approach on around 1,500 transcriptions of 88 compositions from a recently published large-scale transcribed piano performance dataset.

Institutional Affiliations

Primary: Johannes Kepler University Linz

All Institutions: Johannes Kepler University Linz, LIT AI Lab, Linz Institute of Technology

Demo · GitHub

ML Relevance Analysis (78)

The paper presents a novel approach to automatically align and cluster transcriptions of musical performances based on structural interpretations. It significantly contributes to the field by providing a scalable, reference-free method for evaluating large-scale transcribed datasets, which is essential as the volume of available music data continues to grow.

Comprehensive Analysis

Methodology Assessment

The proposed methodology effectively combines sequence-to-sequence alignment using Dynamic Time Warping (DTW) with hierarchical clustering to address the challenge of grouping transcriptions based on structural interpretations. The use of a custom distance metric that balances harmonic similarity and timing differences is innovative and tailored to the nuances of musical performance. The two-step approach, which includes both alignment and clustering, is well-structured and demonstrates a clear understanding of the complexities involved in music performance analysis. However, the paper could benefit from a more detailed discussion on the choice of parameters and their impact on the results.

Experimental Evaluation

The experiments conducted on the ATEPP dataset are comprehensive, covering a significant number of transcriptions and compositions. The evaluation metrics used, such as homogeneity, completeness, and V-Measure, are appropriate for assessing clustering performance. The results indicate that the proposed method is robust against structural differences and transcription artifacts, which is a critical aspect of the research. However, the paper could enhance its impact by providing more comparative analyses with existing methods beyond the baseline score-dependent repeat estimator.

Reproducibility

The paper provides a link to the implementation in the Python library mpteval, which is a positive aspect for reproducibility. However, the details regarding the parameter settings and the specific configurations used in the experiments could be more explicitly stated to facilitate replication. Additionally, providing a sample dataset or a more detailed description of the data preprocessing steps would further enhance reproducibility.

Limitations

One limitation is that the method relies heavily on the quality of the transcriptions, which can vary significantly due to the nature of automatic music transcription. The paper acknowledges this but does not explore potential solutions or mitigations for low-quality transcriptions. Furthermore, the focus on classical music may limit the generalizability of the approach to other genres or forms of music, which could be a point of consideration for future work.

Broader Impact

The approach has significant implications for the field of music performance analysis, particularly in automating the evaluation of large-scale datasets that lack ground-truth scores. This can lead to more efficient curation and maintenance of music collections, enabling researchers to focus on higher-level analyses rather than manual quality control. The method could also inspire further research into score-agnostic evaluation techniques across various musical genres and applications. The paper presents a novel approach to automatically align and cluster transcriptions of musical performances based on structural interpretations. It significantly contributes to the field by providing a scalable, reference-free method for evaluating large-scale transcribed datasets, which is essential as the volume of available music data continues to grow.

Analysis: Full Paper • Full text: 10,374 characters

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems

Yang Xiao, Siyi Wang, Eun-Jung Holden ... · arXiv

Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations....

Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.

Institutional Affiliations

Primary: University of Melbourne

All Institutions: University of Melbourne

GitHub

ML Relevance Analysis (83)

This paper makes a meaningful contribution by proposing a representation-centric approach to continual learning in speech and audio, addressing the unique challenges posed by the dynamic nature of acoustic environments. The framework established in this work has the potential to guide future research and development in the field, although empirical validation and implementation details are needed to fully realize its impact.

Comprehensive Analysis

Methodology Assessment

The paper presents a novel representation-centric taxonomy for continual learning (CL) in speech and audio, addressing the unique challenges posed by the non-stationary nature of acoustic environments. The authors effectively categorize CL scenarios based on representational evolution, which is a significant advancement over traditional task-based taxonomies. The methodology is well-structured, clearly articulating the need for preserving representational geometry in modern speech systems, and it proposes a comprehensive framework for understanding the interaction between representation dynamics and adaptation mechanisms.

Experimental Evaluation

While the paper does not present empirical experiments or quantitative results, it offers a thorough analysis of existing CL methods and their limitations in the context of speech and audio. The authors identify gaps in current methodologies and suggest future research directions, which is valuable for guiding subsequent empirical studies. The lack of experimental validation is a notable gap, as it limits the ability to assess the practical effectiveness of the proposed taxonomy.

Reproducibility

The paper does not provide specific implementation details or datasets, which could hinder reproducibility. However, it does reference existing methods and frameworks, suggesting that future work could build upon these established techniques. The inclusion of a GitHub repository for related resources is a positive step towards facilitating reproducibility.

Limitations

A key limitation of the paper is the absence of experimental validation, which makes it difficult to assess the practical applicability of the proposed taxonomy. Additionally, while the authors identify several open problems, they do not provide concrete solutions or methodologies to address these challenges, leaving a gap for future exploration.

Broader Impact

The implications of this work are significant for the fields of speech processing and continual learning. By reframing CL in the context of speech and audio, the authors highlight the need for new strategies that accommodate the complexities of acoustic representations. This work could influence the development of more robust and adaptable speech systems, with applications in areas such as automatic speech recognition, speaker verification, and emotion recognition. This paper makes a meaningful contribution by proposing a representation-centric approach to continual learning in speech and audio, addressing the unique challenges posed by the dynamic nature of acoustic environments. The framework established in this work has the potential to guide future research and development in the field, although empirical validation and implementation details are needed to fully realize its impact.

Analysis: Full Paper • Full text: 20,539 characters

Audio ML Papers

🏆 Top Papers This Week

Institutional Affiliations

ML Relevance Analysis (92)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (87)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (87)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (84)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (75)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (78)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility

Limitations

Broader Impact

Institutional Affiliations

ML Relevance Analysis (83)

Comprehensive Analysis

Methodology Assessment

Experimental Evaluation

Reproducibility