Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.
Primary: National Taiwan University
All Institutions: National Taiwan University, ASUS Open Cloud Infrastructure Software Center
Audio-Maestro introduces a novel framework for tool-augmented reasoning in audio-language models, significantly enhancing their interpretability and accuracy. The comprehensive evaluation of its methodology, experimental results, and broader implications underscores its potential to advance the field of audio understanding.
The methodology presented in Audio-Maestro is innovative, focusing on tool-augmented reasoning in the audio domain. The two-phase design allows the model to decide when to invoke external tools, which is a significant advancement over traditional end-to-end models. The integration of structured outputs from tools into the reasoning process is well-articulated, and the decision-making mechanism is grounded in both semantic understanding and acoustic cues, enhancing interpretability.
The experiments are robust, utilizing the MMAU benchmark to evaluate the framework's performance across multiple models. The results clearly demonstrate improvements in accuracy when using the tool-augmented approach, with detailed comparisons against baseline models. The analysis of error cases provides valuable insights into the limitations of the current approach, particularly regarding tool output errors.
The paper provides a GitHub repository for the complete codebase, which is essential for reproducibility. However, the details regarding the implementation of the tools and the exact experimental setup could be more thoroughly documented to facilitate easier replication of the results.
The paper acknowledges two main limitations: increased inference time due to tool integration and the dependency on the accuracy of external tools. These factors could hinder real-time applications and indicate areas for future improvement, particularly in enhancing tool robustness.
The implications of this work are significant for various applications in audio processing, including speech recognition, music analysis, and environmental sound understanding. By bridging high-level reasoning with low-level acoustic analysis, Audio-Maestro opens avenues for more interpretable and accurate audio-language models, potentially impacting fields such as human-computer interaction and automated audio analysis. Audio-Maestro introduces a novel framework for tool-augmented reasoning in audio-language models, significantly enhancing their interpretability and accuracy. The comprehensive evaluation of its methodology, experimental results, and broader implications underscores its potential to advance the field of audio understanding.
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
Primary: University of Seoul
All Institutions: University of Seoul, Korea Advanced Institute of Science and Technology
The main contribution of this paper is the introduction of Diffusion-Link, a novel diffusion-based module that successfully bridges the audio-text modality gap, leading to significant improvements in Automatic Audio Captioning tasks. This work represents a meaningful advancement in multimodal machine learning, showcasing the potential of diffusion models in enhancing the integration of different data modalities.
The methodology presented in the paper is innovative, introducing Diffusion-Link as a lightweight diffusion-based module that effectively bridges the audio-text modality gap. The use of residual MLP blocks for this purpose is a clever design choice that balances complexity and performance. The authors provide a clear explanation of how the module operates, leveraging generative mapping of audio embeddings to text distributions. However, the paper could benefit from a more detailed discussion on the training process and hyperparameter tuning, which are crucial for replicating the results.
The experimental evaluation is robust, with the authors conducting a thorough modality-gap analysis and demonstrating significant improvements in Automatic Audio Captioning (AAC) tasks. The results show that Diffusion-Link outperforms previous methods, achieving state-of-the-art performance on the AudioCaps dataset in both zero-shot and fully supervised settings. The reported relative gains are substantial, indicating the effectiveness of the proposed method. Nevertheless, the paper could enhance its credibility by including additional datasets or benchmarks to validate the generalizability of the results.
The authors mention that code will be released upon acceptance, which is a positive step toward ensuring reproducibility. However, the paper lacks detailed implementation specifics, such as the exact architecture of the multimodal encoder used and the training configurations. Providing these details would greatly aid other researchers in replicating the study and building upon the work.
One limitation of the study is the reliance on a single dataset (AudioCaps) for evaluating the performance of Diffusion-Link. While the results are impressive, they may not fully represent the method's effectiveness across diverse audio-text tasks. Additionally, the paper does not address potential scalability issues or computational costs associated with deploying the proposed model in real-world applications.
The potential applications of this research are significant, particularly in fields such as multimedia content generation, accessibility technologies, and human-computer interaction. By effectively bridging the audio-text modality gap, the proposed method could enhance the capabilities of multimodal systems, leading to more intuitive and responsive applications. The findings suggest a promising direction for future research in multimodal representation learning. The main contribution of this paper is the introduction of Diffusion-Link, a novel diffusion-based module that successfully bridges the audio-text modality gap, leading to significant improvements in Automatic Audio Captioning tasks. This work represents a meaningful advancement in multimodal machine learning, showcasing the potential of diffusion models in enhancing the integration of different data modalities.
Scaling laws have profoundly shaped our understanding of model performance in computer vision and natural language processing, yet their application to general audio representation learning remains underexplored. A key challenge lies in the multifactorial nature of general audio representation-representation quality is jointly influenced by variables such as audio length, embedding dimensionality, model depth, model architecture, data volume, etc., many of which are difficult to isolate or express analytically. In this work, we present a systematic study of scaling laws for general audio representations by utilizing embedding effective rank (RankMe) as a unifying metric that encapsulates the impact of diverse variables on representation quality. RankMe enables a label-free, information-theoretic quantification of audio embeddings, allowing us to examine scaling behaviors across a wide hyper-parameter space, including model size, training data volume, computational budget, architectural configurations, etc. Our empirical findings reveal a consistent power-law relationship between RankMe and representation quality, suggesting that embedding effective rank serves as a reliable proxy for assessing and predicting model performance in audio representation learning. This work not only validates the applicability of classical scaling principles to the general audio domain but also offers a theoretically grounded and empirically robust framework for guiding future model scaling strategies in audio foundation models.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology
The paper presents a novel approach to understanding scaling laws in audio representation learning through the lens of embedding effective rank. This contribution is significant as it not only validates classical scaling principles in a new domain but also provides a framework for future research to optimize model performance without requiring labeled data.
The methodology presented in the paper is robust, employing the concept of embedding effective rank (RankMe) as a unifying metric for analyzing the scaling laws in audio representation learning. The authors systematically investigate how various hyper-parameters influence representation quality, which is a significant advancement in the field. The use of a masked autoencoding self-supervised learning framework is appropriate and aligns well with current trends in audio representation. The paper effectively integrates theoretical foundations with empirical analysis, demonstrating a clear power-law relationship between RankMe and representation quality.
The experiments are well-designed, utilizing a large-scale dataset (approximately 100 million audio clips) and a variety of model architectures. The authors conduct extensive evaluations across different hyper-parameter settings, providing a comprehensive analysis of the impacts of data volume, model size, and computational budget on audio representation quality. The results are clearly presented, with figures illustrating the relationships between RankMe and various performance metrics, which strengthens the validity of their findings.
The paper provides sufficient implementation details, including model architectures, training procedures, and hyper-parameter settings, which are crucial for reproducibility. However, the absence of a publicly available code repository or demo limits the ability for others to directly replicate the experiments. Including a link to a GitHub repository or similar would enhance reproducibility significantly.
One limitation of the study is that while RankMe serves as a unifying metric, it may still be a coarse-grained indicator of performance, as noted by the authors. Additionally, the findings are based on empirical observations, and the theoretical underpinnings of the scaling laws could be further explored. The paper also does not address potential biases in the datasets used, which could affect the generalizability of the results.
The implications of this research are significant for the field of audio representation learning, as it provides a systematic framework for understanding how various factors influence model performance. This work could guide future research in scaling audio models and optimizing their architectures, potentially leading to advancements in applications such as speech recognition, music analysis, and environmental sound classification. The paper presents a novel approach to understanding scaling laws in audio representation learning through the lens of embedding effective rank. This contribution is significant as it not only validates classical scaling principles in a new domain but also provides a framework for future research to optimize model performance without requiring labeled data.
Audio-visual speech enhancement (AVSE) has been found to be particularly useful at low signal-to-noise (SNR) ratios due to the immunity of the visual features to acoustic noise. However, a significant gap exists in AVSE methods tailored to enhance spatial audio under low-SNR conditions. The latter is of growing interest with augmented reality applications. To address this gap, we present a multi-channel AVSE framework based on VisualVoice that leverages spatial cues from microphone arrays and visual information for enhancing the target speaker in noisy environments. We also introduce MAVe, a novel database containing multi-channel audio-visual signals in controlled, reproducible room conditions across a wide range of SNR levels. Experiments demonstrate that the proposed method consistently achieves significant gains in SI-SDR, STOI, and PESQ, particularly in low SNRs. Binaural signal analysis further confirms the preservation of spatial cues and intelligibility.
Primary: Ben-Gurion University of the Negev
All Institutions: Ben-Gurion University of the Negev, Deutsche Forschungsgemeinschaft (DFG)
The paper presents a novel multi-channel AVSE framework that effectively enhances speech intelligibility in low-SNR conditions by leveraging visual cues. This contribution is significant as it addresses a critical gap in the field, with potential applications in immersive media and augmented reality.
The proposed methodology introduces a multi-channel audio-visual speech enhancement (AVSE) framework that effectively integrates visual cues from a target speaker's lip movements with spatial audio captured by microphone arrays. The architecture builds upon the existing VisualVoice framework, adapting it to maintain spatial integrity while processing each channel independently. This approach is innovative as it addresses a significant gap in existing literature regarding low-SNR conditions for spatial audio, thus providing a novel contribution to the field.
The experimental setup is robust, utilizing the newly introduced MAVe database, which is well-designed to simulate realistic low-SNR environments. The performance metrics (SI-SDR, STOI, PESQ) demonstrate significant improvements over baseline models, particularly in challenging conditions. The results are comprehensive, showcasing the effectiveness of the proposed method across various SNR levels, and the inclusion of binaural signal analysis adds depth to the evaluation.
While the paper outlines the methodology and experimental setup in detail, it lacks specific implementation details or a publicly accessible code repository, which could hinder reproducibility. The authors mention that the MAVe database will be made publicly available, which is a positive step towards enabling further research.
The paper does not address potential limitations of the proposed approach, such as the dependency on high-quality visual data and the challenges of generalizing the model to different environments or speaker variations. Additionally, the absence of a demo or project URL limits the accessibility of the research for practical applications.
The research has significant implications for augmented reality and virtual reality applications, where enhancing speech intelligibility in noisy environments is crucial. By improving AVSE methods, this work could lead to better communication experiences in immersive media, benefiting fields such as telecommunication, gaming, and assistive technologies. The paper presents a novel multi-channel AVSE framework that effectively enhances speech intelligibility in low-SNR conditions by leveraging visual cues. This contribution is significant as it addresses a critical gap in the field, with potential applications in immersive media and augmented reality.
Effective human-AI collaboration on complex reasoning tasks requires that users understand and interact with the model's process, not just receive an output. However, the monolithic text from methods like Chain-of-Thought (CoT) prevents this, as current interfaces lack real-time verbalization and robust user barge-in. We present AsyncVoice Agent, a system whose asynchronous architecture decouples a streaming LLM backend from a conversational voice frontend. This design allows narration and inference to run in parallel, empowering users to interrupt, query, and steer the model's reasoning process at any time. Objective benchmarks show this approach reduces interaction latency by more than 600x compared to monolithic baselines while ensuring high fidelity and competitive task accuracy. By enabling a two-way dialogue with a model's thought process, AsyncVoice Agent offers a new paradigm for building more effective, steerable, and trustworthy human-AI systems for high-stakes tasks.
Primary: unknown
All Institutions: unknown
The AsyncVoice Agent introduces a novel approach to real-time interaction with LLMs, transforming passive consumption of AI reasoning into an active, collaborative dialogue. This innovative architecture and its implications for user engagement and trust in AI systems mark a significant advancement in the field of human-AI interaction.
The methodology presented in AsyncVoice Agent is innovative, focusing on an asynchronous architecture that separates the reasoning backend from the voice interface. This design allows for real-time interaction, enabling users to interrupt and steer the model's reasoning process. The use of a modular Model Context Protocol (MCP) for backend reasoning and a multi-threaded speech processing pipeline demonstrates a thoughtful approach to enhancing user experience. However, the paper could benefit from clearer descriptions of the integration of components and the specific algorithms used in the reasoning and speech processing.
The experimental evaluation is robust, comparing the AsyncVoice Agent against two well-defined baselines. The metrics used—responsiveness, reasoning quality, and process fidelity—are appropriate for assessing the system's performance. The results show significant improvements in latency, which is a critical factor for real-time applications. However, the paper could provide more detailed statistical analysis of the results and discuss how variations in task complexity might affect performance.
The paper mentions an automated evaluation framework and provides a GitHub link to the foundational RealtimeVoiceChat project, which aids in reproducibility. However, it lacks detailed instructions on how to replicate the experiments, such as specific configurations or datasets used, which could hinder full reproducibility.
The paper acknowledges limitations such as TTS prosody and unidirectional reasoning flow but frames these as engineering challenges rather than fundamental issues. Additionally, the dependency on a single-pass reasoning process may restrict the depth of explanations provided, which could impact user understanding in complex scenarios.
The AsyncVoice Agent has the potential to significantly enhance human-AI collaboration, particularly in high-stakes environments where understanding the reasoning process is crucial. By facilitating real-time dialogue and interaction, it could improve trust and usability in AI systems across various applications, including education, healthcare, and customer service. The AsyncVoice Agent introduces a novel approach to real-time interaction with LLMs, transforming passive consumption of AI reasoning into an active, collaborative dialogue. This innovative architecture and its implications for user engagement and trust in AI systems mark a significant advancement in the field of human-AI interaction.
Unmanned Aerial Vehicles (UAVs) or drones, are increasingly used in search and rescue missions to detect human presence. Existing systems primarily leverage vision-based methods which are prone to fail under low-visibility or occlusion. Drone-based audio perception offers promise but suffers from extreme ego-noise that masks sounds indicating human presence. Existing datasets are either limited in diversity or synthetic, lacking real acoustic interactions, and there are no standardized setups for drone audition. To this end, we present DroneAudioset (The dataset is publicly available at https://huggingface.co/datasets/ahlab-drone-project/DroneAudioSet/ under the MIT license), a comprehensive drone audition dataset featuring 23.5 hours of annotated recordings, covering a wide range of signal-to-noise ratios (SNRs) from -57.2 dB to -2.5 dB, across various drone types, throttles, microphone configurations as well as environments. The dataset enables development and systematic evaluation of noise suppression and classification methods for human-presence detection under challenging conditions, while also informing practical design considerations for drone audition systems, such as microphone placement trade-offs, and development of drone noise-aware audio processing. This dataset is an important step towards enabling design and deployment of drone-audition systems.
Primary: Unknown
All Institutions: Unknown
The paper presents DroneAudioset, a comprehensive audio dataset tailored for drone-based search and rescue, addressing critical gaps in existing datasets. The technical contribution is significant, as it lays the groundwork for advancing audio processing techniques in challenging environments, although further experimental validation is needed to fully realize its potential impact.
The methodology is centered around the creation of a novel audio dataset specifically designed for drone-based search and rescue operations. The authors have taken into account various factors such as signal-to-noise ratios, drone types, and environmental conditions, which are crucial for real-world applications. However, while the dataset is comprehensive, the paper could benefit from a more detailed explanation of the data collection process and the specific algorithms used for annotation and classification.
The evaluation section discusses the potential applications of the dataset in developing noise suppression and classification methods, but it lacks concrete experimental results demonstrating the effectiveness of these methods. Including preliminary results or benchmarks would strengthen the paper significantly.
The dataset is publicly available, which is a positive aspect for reproducibility. However, the paper does not provide enough detail regarding the experimental setup or the algorithms used for processing the audio data, which may hinder full reproducibility of results.
One limitation is the potential bias in the dataset due to the specific environments and conditions under which the audio was recorded. Additionally, the dataset may not cover all possible scenarios encountered in search and rescue missions, limiting its generalizability.
The development of the DroneAudioset has significant implications for improving drone-based search and rescue operations, particularly in challenging conditions where visual methods fail. This could enhance emergency response efforts and save lives by enabling more effective detection of human presence. The paper presents DroneAudioset, a comprehensive audio dataset tailored for drone-based search and rescue, addressing critical gaps in existing datasets. The technical contribution is significant, as it lays the groundwork for advancing audio processing techniques in challenging environments, although further experimental validation is needed to fully realize its potential impact.
The performance of deep learning models for music source separation heavily depends on training data quality. However, datasets are often corrupted by difficult-to-detect artifacts such as audio bleeding and label noise. Since the type and extent of contamination are typically unknown, cleaning methods targeting specific corruptions are often impractical. This paper proposes and evaluates two distinct, noise-agnostic data cleaning methods to address this challenge. The first approach uses data attribution via unlearning to identify and filter out training samples that contribute the least to producing clean outputs. The second leverages the Fr\'echet Audio Distance to measure and remove samples that are perceptually dissimilar to a small and trusted clean reference set. On a dataset contaminated with a simulated distribution of real-world noise, our unlearning-based methods produced a cleaned dataset and a corresponding model that outperforms both the original contaminated data and the small clean reference set used for cleaning. This result closes approximately 66.7\% of the performance gap between the contaminated baseline and a model trained on the same dataset without any contamination. Unlike methods tailored for specific artifacts, our noise-agnostic approaches offer a more generic and broadly applicable solution for curating high-quality training data.
Primary: unknown
All Institutions: unknown
The paper presents a significant advancement in blind data cleaning for music source separation, proposing two innovative methods that enhance model performance by effectively addressing data quality issues. The comprehensive evaluation of these methods demonstrates their potential impact on the field and beyond.
The paper introduces two innovative noise-agnostic data cleaning methods for music source separation: an unlearning-based approach and a distributional metric-based approach using Fréchet Audio Distance. The unlearning method is particularly noteworthy as it flips the traditional influence estimation, focusing on the impact of clean samples to filter out noisy ones. This methodological shift is a significant contribution to the field, offering a more generalized solution to data cleaning that can be applied across various domains. The use of distributional metrics to assess perceptual dissimilarity adds robustness to the approach, allowing for effective cleaning without prior knowledge of specific corruptions.
The experiments are well-structured, utilizing a semi-synthetic dataset to simulate real-world noise and contamination. The results demonstrate that the proposed methods outperform both the contaminated baseline and a small clean reference set, effectively closing the performance gap with a model trained on entirely clean data. The comparative analysis with baseline methods, including a classifier-based approach, highlights the advantages of the proposed noise-agnostic methods in terms of generalizability and robustness.
The paper provides sufficient detail on the methodologies and experimental setup, including the model architecture, training processes, and evaluation metrics. However, the lack of specific URLs for code or datasets limits the ease of reproducibility. Future work could benefit from making the code and datasets publicly available to enhance reproducibility.
The study acknowledges limitations in determining the optimal filtering ratio, which requires retraining and could be automated in future work. Additionally, the methods were only tested on a specific model architecture (Open-Unmix), which may limit their applicability to more complex models. The performance on unseen audio effects also indicates that while the methods are robust, they may not generalize perfectly across all types of noise.
The proposed methods have significant implications for improving data quality in machine learning applications beyond music source separation, such as speech enhancement and sound event detection. By providing a more generic solution to data cleaning, this research could facilitate the development of more robust models in various audio processing tasks. The paper presents a significant advancement in blind data cleaning for music source separation, proposing two innovative methods that enhance model performance by effectively addressing data quality issues. The comprehensive evaluation of these methods demonstrates their potential impact on the field and beyond.
Text-to-audio (TTA) is rapidly advancing, with broad potential in virtual reality, accessibility, and creative media. However, evaluating TTA quality remains difficult: human ratings are costly and limited, while existing objective metrics capture only partial aspects of perceptual quality. To address this gap, we introduce AudioEval, the first large-scale TTA evaluation dataset, containing 4,200 audio samples from 24 systems with 126,000 ratings across five perceptual dimensions, annotated by both experts and non-experts. Based on this resource, we propose Qwen-DisQA, a multimodal scoring model that jointly processes text prompts and generated audio to predict human-like quality ratings. Experiments show its effectiveness in providing reliable and scalable evaluation. The dataset will be made publicly available to accelerate future research.
Primary: Nankai University
All Institutions: Nankai University
The main contribution of this paper is the introduction of AudioEval, the first large-scale dataset for evaluating text-to-audio generation, along with the Qwen-DisQA model that predicts human-like quality ratings across multiple dimensions. This work represents a substantial step forward in addressing the challenges of evaluating TTA systems, with the potential to influence future research and applications in the field.
The methodology presented in the paper is robust, introducing the AudioEval dataset as a pioneering resource for TTA evaluation, which is a significant advancement in the field. The dual-perspective annotation by both experts and non-experts enhances the dataset's reliability and applicability. The Qwen-DisQA model effectively integrates multimodal inputs, addressing the complexities of audio generation and evaluation. The approach of predicting a distribution of ratings rather than a single score is innovative, allowing for a more nuanced understanding of quality.
The experiments are well-structured, with a clear delineation of training, validation, and testing protocols. The comparative analysis against existing models demonstrates the effectiveness of Qwen-DisQA, showing significant improvements in correlation and reliability. The use of multiple evaluation metrics, including MSE and PCC, provides a comprehensive assessment of model performance. However, the paper could benefit from a more detailed discussion of the specific datasets used for comparison.
While the paper outlines the training configuration and evaluation metrics, it lacks detailed implementation specifics such as hyperparameter settings, code availability, or links to the dataset. This omission could hinder reproducibility for other researchers looking to validate or build upon the findings.
One limitation is the reliance on subjective ratings, which, despite being mitigated by dual perspectives, can still introduce variability. Additionally, the dataset's size and diversity, while substantial, may not encompass all possible audio generation scenarios, potentially limiting the model's generalizability. The paper does not address the potential biases in expert versus non-expert ratings.
The work has significant implications for various fields, including virtual reality, accessibility, and creative media, where TTA systems are increasingly relevant. By providing a reliable evaluation framework, the research can accelerate advancements in TTA technology, fostering innovation and enhancing user experiences across applications. The main contribution of this paper is the introduction of AudioEval, the first large-scale dataset for evaluating text-to-audio generation, along with the Qwen-DisQA model that predicts human-like quality ratings across multiple dimensions. This work represents a substantial step forward in addressing the challenges of evaluating TTA systems, with the potential to influence future research and applications in the field.
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. Relevant resources will be open-sourced.
Primary: Nankai University
All Institutions: Nankai University, Microsoft Research Asia
The paper presents a comprehensive framework for interpretable and generalizable speech quality evaluation using LLMs, addressing key challenges in the field and demonstrating strong experimental results across multiple tasks and languages.
The paper introduces a novel approach called SpeechLLM-as-Judges, which leverages large language models (LLMs) for structured and interpretable speech quality evaluation. The methodology includes the development of a large-scale dataset (SpeechEval) with diverse tasks and languages, and the creation of SQ-LLM, a speech-quality-aware LLM trained using chain-of-thought reasoning and reward optimization. This dual-stage training enhances the model's interpretability and generalization capabilities across various evaluation tasks, a significant improvement over traditional scalar scoring methods.
The experimental setup is robust, featuring comprehensive evaluations across multiple tasks, including quality assessment, comparison, improvement suggestion, and deepfake detection. The results demonstrate that SQ-LLM outperforms existing models in terms of accuracy and interpretability, with strong correlations to human judgments. The inclusion of multilingual data and diverse evaluation metrics adds to the rigor of the experiments, showcasing the model's versatility.
The paper provides detailed implementation details, including dataset construction, model architecture, and training protocols. However, the absence of a publicly accessible code repository limits full reproducibility. The authors mention that relevant resources will be open-sourced, which is promising but not yet realized.
The study acknowledges limitations, such as the focus on only four languages and a fixed set of tasks. Future work could expand to include more languages and additional evaluation scenarios, like emotional expressiveness. The model's performance in certain languages is also noted to be less robust, indicating areas for improvement.
This research has significant implications for the field of speech technology, particularly in enhancing the quality evaluation of synthetic speech. The ability to provide interpretable and actionable feedback could advance the development of more reliable speech generation systems, impacting applications in voice assistants, conversational AI, and content creation. The paper presents a comprehensive framework for interpretable and generalizable speech quality evaluation using LLMs, addressing key challenges in the field and demonstrating strong experimental results across multiple tasks and languages.
Self-supervised models such as WavLM have demonstrated strong performance for neural speaker diarization. However, these models are typically pre-trained on single-channel recordings, limiting their effectiveness in multi-channel scenarios. Existing diarization systems built on these models often rely on DOVER-Lap to combine outputs from individual channels. Although effective, this approach incurs substantial computational overhead and fails to fully exploit spatial information. In this work, building on DiariZen, a pipeline that combines WavLM-based local endto-end neural diarization with speaker embedding clustering, we introduce a lightweight approach to make pre-trained WavLM spatially aware by inserting channel communication modules into the early layers. Our method is agnostic to both the number of microphone channels and array topologies, ensuring broad applicability. We further propose to fuse multi-channel speaker embeddings by leveraging spatial attention weights. Evaluations on five public datasets show consistent improvements over single-channel baselines and demonstrate superior performance and efficiency compared with DOVER-Lap. Our source code is publicly available at https://github.com/BUTSpeechFIT/DiariZen.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to multi-channel speaker diarization by enhancing WavLM with spatially aware mechanisms, significantly improving performance and efficiency. The methodology is well-structured and the experimental results are promising, suggesting a meaningful contribution to the field of audio processing and machine learning.
The methodology presented in this paper introduces a novel approach to enhance the WavLM model for multi-channel speaker diarization by incorporating channel communication modules. This allows the model to leverage spatial information effectively, which is a significant advancement over traditional methods that typically process single-channel inputs. The use of spatial attention weights for fusing speaker embeddings is also a noteworthy innovation that enhances the model's performance without requiring additional training. The approach is well-structured, and the authors provide a clear explanation of their modifications to the existing DiariZen framework, making it accessible for replication and further research.
The experimental evaluation is robust, utilizing five diverse public datasets to validate the proposed method. The results show consistent improvements over single-channel baselines and outperform the DOVER-Lap method, which is a common approach in the field. The paper includes a comprehensive analysis of performance metrics, particularly the diarization error rate (DER), and provides insights into the efficiency of the proposed methods compared to existing approaches. However, the paper could benefit from more detailed comparisons with state-of-the-art methods beyond DOVER-Lap to contextualize its contributions further.
The paper provides a link to the source code, which is a positive aspect for reproducibility. The authors describe their experimental setup and configurations, which aids in understanding how to replicate their results. However, the lack of detailed descriptions of the datasets and specific hyperparameters used could hinder full reproducibility for other researchers.
One limitation of the study is the reliance on five specific datasets, which may not cover the full spectrum of real-world scenarios encountered in multi-channel speaker diarization. Additionally, while the proposed method shows improvements in efficiency, it may still be computationally intensive compared to simpler models, which could limit its adoption in resource-constrained environments. The paper also does not address the potential impact of varying microphone configurations on performance, which could be significant in practical applications.
The proposed method has significant implications for real-world applications in areas such as meeting transcription, video conferencing, and any scenario where speaker diarization is critical. By improving the efficiency and accuracy of multi-channel diarization systems, this work could enhance communication technologies and accessibility tools, making them more effective in diverse environments. The paper presents a novel approach to multi-channel speaker diarization by enhancing WavLM with spatially aware mechanisms, significantly improving performance and efficiency. The methodology is well-structured and the experimental results are promising, suggesting a meaningful contribution to the field of audio processing and machine learning.
Music is both an auditory and an embodied phenomenon, closely linked to human motion and naturally expressed through dance. However, most existing audio representations neglect this embodied dimension, limiting their ability to capture rhythmic and structural cues that drive movement. We propose MotionBeat, a framework for motion-aligned music representation learning. MotionBeat is trained with two newly proposed objectives: the Embodied Contrastive Loss (ECL), an enhanced InfoNCE formulation with tempo-aware and beat-jitter negatives to achieve fine-grained rhythmic discrimination, and the Structural Rhythm Alignment Loss (SRAL), which ensures rhythm consistency by aligning music accents with corresponding motion events. Architecturally, MotionBeat introduces bar-equivariant phase rotations to capture cyclic rhythmic patterns and contact-guided attention to emphasize motion events synchronized with musical accents. Experiments show that MotionBeat outperforms state-of-the-art audio encoders in music-to-dance generation and transfers effectively to beat tracking, music tagging, genre and instrument classification, emotion recognition, and audio-visual retrieval. Our project demo page: https://motionbeat2025.github.io/.
Primary: The University of Sydney
All Institutions: The University of Sydney
MotionBeat presents a novel framework for motion-aligned music representation learning that integrates embodied aspects of music with advanced contrastive learning techniques. This work significantly contributes to the field by addressing the gap between audio representations and human motion, paving the way for more intuitive and synchronized interactions in music-related applications.
The methodology presented in MotionBeat is innovative, introducing two novel loss functions—Embodied Contrastive Loss (ECL) and Structural Rhythm Alignment Loss (SRAL)—that specifically target the alignment of music and motion. The architectural innovations, such as bar-equivariant phase rotations and contact-guided attention, are well-conceived to capture the cyclic nature of rhythm and enhance the model's ability to focus on motion events. The approach effectively combines contrastive learning with structural alignment, which is a significant advancement over traditional methods that do not consider the embodied aspect of music.
The experimental evaluation is comprehensive, covering a wide range of tasks including music-to-dance generation, beat tracking, music tagging, genre classification, instrument classification, emotion recognition, and audio-visual retrieval. The results demonstrate that MotionBeat consistently outperforms state-of-the-art audio encoders, providing strong evidence of its effectiveness. The use of ablation studies to analyze the contributions of ECL and SRAL further strengthens the findings, showcasing the importance of each component in achieving superior performance.
The paper provides sufficient implementation details, including architecture specifications, training protocols, and datasets used, which are crucial for reproducibility. However, the absence of a public code repository limits the ability of other researchers to directly replicate the results, which is a common expectation in contemporary ML research.
One limitation is the reliance on a specific dataset (AIST++), which may affect the generalizability of the findings to other types of music or motion data. Additionally, while the model shows strong performance in various tasks, the paper does not extensively discuss potential challenges or failures in scenarios where music and motion may not align as expected.
The implications of this work extend beyond academic research; it has potential applications in fields such as interactive entertainment, dance choreography, and rehabilitation, where understanding the relationship between music and movement can enhance user experiences. The framework could also inspire further research into embodied learning in other multimodal contexts. MotionBeat presents a novel framework for motion-aligned music representation learning that integrates embodied aspects of music with advanced contrastive learning techniques. This work significantly contributes to the field by addressing the gap between audio representations and human motion, paving the way for more intuitive and synchronized interactions in music-related applications.
Aligning pretrained audio encoders and Large Language Models (LLMs) offers a promising, parameter-efficient path to building powerful multimodal agents. However, existing methods often require costly full-model finetuning or rely on static adapters that may lack expressive power. Drawing inspiration from the Platonic Representation Hypothesis, we introduce SteerMoE, a novel and modular framework for audio-language alignment. SteerMoE freezes both the audio encoder and the LLM decoder, training only a lightweight steering module integrated within the encoder's layers. This module uses a Mixture-of-Experts (MoE) router to dynamically select and apply learned steering vectors, progressively transforming continuous audio representations into a space comprehensible to the LLM. By operating entirely in the continuous embedding space, our approach requires no modifications to the LLM's vocabulary and preserves its advanced reasoning and agentic capabilities. We demonstrate through experiments on ASR, audio understanding, and a qualitative function-calling task that SteerMoE achieves strong performance while remaining highly modular and computationally efficient, offering a robust new paradigm for developing sophisticated audio-language systems.
Primary: The University of Hong Kong
All Institutions: The University of Hong Kong
The main contribution of this paper is the introduction of SteerMoE, a novel framework that efficiently aligns audio and language representations using a dynamic steering module, demonstrating strong performance across various tasks while preserving the capabilities of pretrained models. This work represents a meaningful advancement in the field of multimodal AI, providing a robust and flexible approach to audio-language integration.
The methodology presented in SteerMoE is innovative, leveraging a Mixture-of-Experts (MoE) steering module to dynamically adjust audio representations within a frozen audio encoder. This approach allows for parameter-efficient audio-language alignment without the need for extensive finetuning or complex architectural modifications, which is a significant advancement over existing methods. The use of a shared router for expert steering vectors enhances the expressiveness of the model while maintaining modularity, making it adaptable to various pretrained components. The integration of continuous prompting further streamlines the interaction between audio and language modalities.
The experiments conducted on standard benchmarks such as LibriSpeech and AISHELL-2 for ASR, as well as Clotho-AQA for audio understanding, demonstrate the effectiveness of the SteerMoE framework. The results indicate competitive performance compared to existing models, validating the proposed method's ability to align audio and language representations effectively. The qualitative analysis of the model's agentic capabilities adds depth to the evaluation, showcasing its potential in real-world applications.
The paper provides sufficient implementation details, including the choice of pretrained models, training parameters, and data preprocessing steps. However, the absence of a publicly available code repository limits the reproducibility of the results. Future work should consider releasing the code to facilitate further research and validation of the findings.
The paper acknowledges several limitations, including the model's performance in noisy environments and the constraints imposed by the data scale and sequence length during training. Additionally, while the qualitative experiments demonstrate the model's capabilities, further testing on more complex interactions is necessary to fully assess its robustness and versatility.
The SteerMoE framework has significant implications for the development of multimodal AI systems, particularly in enhancing human-computer interaction through audio-language alignment. Its modular design allows for easy integration with existing models, potentially accelerating advancements in applications such as speech recognition, audio understanding, and interactive agents. The main contribution of this paper is the introduction of SteerMoE, a novel framework that efficiently aligns audio and language representations using a dynamic steering module, demonstrating strong performance across various tasks while preserving the capabilities of pretrained models. This work represents a meaningful advancement in the field of multimodal AI, providing a robust and flexible approach to audio-language integration.
Recent advances in unified multimodal models indicate a clear trend towards comprehensive content generation. However, the auditory domain remains a significant challenge, with music and speech often developed in isolation, hindering progress towards universal audio synthesis. This separation stems from inherent task conflicts and severe data imbalances, which impede the development of a truly unified audio generation model. To address this challenge, we propose UniMoE-Audio, a unified speech and music generation model within a novel Dynamic-Capacity Mixture-of-Experts (MoE) framework. Architecturally, UniMoE-Audio introduces a Top-P routing strategy for dynamic expert number allocation, and a hybrid expert design comprising routed experts for domain-specific knowledge, shared experts for domain-agnostic features, and null experts for adaptive computation skipping. To tackle data imbalance, we introduce a three-stage training curriculum: 1) Independent Specialist Training leverages original datasets to instill domain-specific knowledge into each "proto-expert" without interference; 2) MoE Integration and Warmup incorporates these specialists into the UniMoE-Audio architecture, warming up the gate module and shared expert using a subset of balanced dataset; and 3) Synergistic Joint Training trains the entire model end-to-end on the fully balanced dataset, fostering enhanced cross-domain synergy. Extensive experiments show that UniMoE-Audio not only achieves state-of-the-art performance on major speech and music generation benchmarks, but also demonstrates superior synergistic learning, mitigating the performance degradation typically seen in naive joint training. Our findings highlight the substantial potential of specialized MoE architecture and curated training strategies in advancing the field of universal audio generation. Homepage: https://mukioxun.github.io/Uni-MoE-site/home.html
Primary: Harbin Institute of Technology
All Institutions: Harbin Institute of Technology, Shenzhen Loop Area Institute
The main contribution of this paper is the introduction of UniMoE-Audio, a novel framework that effectively integrates speech and music generation through a Dynamic-Capacity Mixture-of-Experts architecture, demonstrating state-of-the-art performance and addressing critical challenges in the field. The technical contributions, particularly the innovative routing strategy and training curriculum, position this work as a significant advancement in universal audio generation.
The proposed methodology of UniMoE-Audio is innovative, utilizing a Dynamic-Capacity Mixture-of-Experts (MoE) framework that effectively addresses the challenges of data imbalance and task conflicts in audio generation. The introduction of a Top-P routing strategy for dynamic expert allocation is particularly noteworthy, as it allows the model to adaptively select the number of experts based on the input, enhancing efficiency and performance. The three-stage training curriculum is well-structured, ensuring that domain-specific knowledge is effectively integrated into the model while also promoting cross-domain synergy. However, the complexity of the architecture may pose challenges in terms of implementation and understanding.
The experiments conducted demonstrate a thorough evaluation of the model against major benchmarks in speech and music generation. The results indicate that UniMoE-Audio achieves state-of-the-art performance, showcasing its effectiveness in both domains. The paper provides sufficient details on the datasets used and the evaluation metrics, which strengthens the credibility of the findings. However, it would benefit from additional comparative analysis with other recent models to further contextualize its performance.
The paper lacks detailed implementation specifics, such as hyperparameter settings and training configurations, which are crucial for reproducibility. While the methodology is described, the absence of a publicly available code repository limits the ability of other researchers to replicate the results. The provided demo URL offers some insight into the model's capabilities, but a comprehensive code release would significantly enhance reproducibility.
One limitation of the proposed model is its complexity, which may hinder practical deployment in real-world applications. Additionally, while the model addresses data imbalance, the effectiveness of the three-stage training curriculum in various real-world scenarios remains to be fully validated. The reliance on a balanced dataset in the latter training stages may not always be feasible, and the performance in highly imbalanced scenarios could be a concern.
The potential applications of UniMoE-Audio are significant, spanning various fields such as entertainment, education, and accessibility. The ability to generate high-quality speech and music in a unified framework could lead to advancements in audio synthesis technologies, enhancing user experiences in multimedia applications. Furthermore, the insights gained from this research could inform future studies on multimodal learning and audio generation. The main contribution of this paper is the introduction of UniMoE-Audio, a novel framework that effectively integrates speech and music generation through a Dynamic-Capacity Mixture-of-Experts architecture, demonstrating state-of-the-art performance and addressing critical challenges in the field. The technical contributions, particularly the innovative routing strategy and training curriculum, position this work as a significant advancement in universal audio generation.
This paper introduces a new paradigm for generative error correction (GER) framework in audio-visual speech recognition (AVSR) that reasons over modality-specific evidences directly in the language space. Our framework, DualHyp, empowers a large language model (LLM) to compose independent N-best hypotheses from separate automatic speech recognition (ASR) and visual speech recognition (VSR) models. To maximize the effectiveness of DualHyp, we further introduce RelPrompt, a noise-aware guidance mechanism that provides modality-grounded prompts to the LLM. RelPrompt offers the temporal reliability of each modality stream, guiding the model to dynamically switch its focus between ASR and VSR hypotheses for an accurate correction. Under various corruption scenarios, our framework attains up to 57.7% error rate gain on the LRS2 benchmark over standard ASR baseline, contrary to single-stream GER approaches that achieve only 10% gain. To facilitate research within our DualHyp framework, we release the code and the dataset comprising ASR and VSR hypotheses at https://github.com/sungnyun/dualhyp.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the DualHyp framework for audio-visual speech error correction, which significantly outperforms traditional single-stream approaches. This work represents a meaningful advancement in the field of audio-visual speech recognition, combining innovative methodologies with practical implications for future research and applications.
The proposed DualHyp framework innovatively integrates modality-specific evidence from both ASR and VSR systems to enhance error correction in audio-visual speech recognition. By leveraging a large language model to generate independent hypotheses and employing the RelPrompt mechanism for noise-aware guidance, the methodology showcases a thoughtful approach to addressing the challenges of modality dependency and error correction. However, the complexity of the architecture and the reliance on high-quality upstream models may limit its practical applicability.
The experiments conducted on the LRS2 benchmark demonstrate a significant improvement in error rates compared to standard ASR baselines, achieving a 57.7% error rate gain. The evaluation under various corruption scenarios provides a robust assessment of the framework's effectiveness. However, the paper could benefit from a more detailed analysis of the experimental setup, including the specific metrics used and comparisons with other state-of-the-art methods.
The authors have made the code and dataset publicly available, which is a positive step towards reproducibility. However, the paper lacks detailed implementation instructions and hyperparameter settings, which may hinder other researchers from replicating the results easily.
The paper acknowledges two primary limitations: the dependency on the quality of the upstream speech recognition models and the computational latency introduced by the multi-module structure. These limitations highlight challenges in real-time applications and adaptation to multilingual contexts, which could restrict the framework's broader applicability.
The proposed framework has significant potential applications in improving speech recognition systems, particularly in scenarios where audio and visual cues can be leveraged for better accuracy. Its implications extend to assistive technologies, communication aids, and enhancing user experiences in various interactive systems. However, the identified limitations may pose challenges in scaling the approach to diverse languages and real-time environments. The main contribution of this paper is the introduction of the DualHyp framework for audio-visual speech error correction, which significantly outperforms traditional single-stream approaches. This work represents a meaningful advancement in the field of audio-visual speech recognition, combining innovative methodologies with practical implications for future research and applications.
Recent advances in diffusion-based generative models have enabled high-quality text-to-audio synthesis, but fine-grained acoustic control remains a significant challenge in open-source research. We present Audio Palette, a diffusion transformer (DiT) based model that extends the Stable Audio Open architecture to address this "control gap" in controllable audio generation. Unlike prior approaches that rely solely on semantic conditioning, Audio Palette introduces four time-varying control signals: loudness, pitch, spectral centroid, and timbre, for precise and interpretable manipulation of acoustic features. The model is efficiently adapted for the nuanced domain of Foley synthesis using Low-Rank Adaptation (LoRA) on a curated subset of AudioSet, requiring only 0.85 percent of the original parameters to be trained. Experiments demonstrate that Audio Palette achieves fine-grained, interpretable control of sound attributes. Crucially, it accomplishes this novel controllability while maintaining high audio quality and strong semantic alignment to text prompts, with performance on standard metrics such as Frechet Audio Distance (FAD) and LAION-CLAP scores remaining comparable to the original baseline model. We provide a scalable, modular pipeline for audio research, emphasizing sequence-based conditioning, memory efficiency, and a three-scale classifier-free guidance mechanism for nuanced inference-time control. This work establishes a robust foundation for controllable sound design and performative audio synthesis in open-source settings, enabling a more artist-centric workflow.
Primary: New York University
All Institutions: New York University
Audio Palette introduces a diffusion transformer-based model for controllable audio generation, significantly advancing the state of the art in Foley synthesis. The combination of multi-signal conditioning and efficient fine-tuning strategies positions this work as a valuable resource for both academic research and practical applications in sound design.
The methodology presented in Audio Palette is robust, leveraging a diffusion transformer architecture with a novel multi-signal conditioning framework that integrates four time-varying acoustic control signals. The use of Low-Rank Adaptation (LoRA) for fine-tuning enhances efficiency, allowing the model to adapt to the specific domain of Foley synthesis with minimal computational overhead. The integration of a three-scale classifier-free guidance mechanism for controlling different aspects of audio generation is particularly innovative, providing users with a nuanced interface for sound design.
The experiments conducted are thorough, utilizing a well-curated dataset from AudioSet specifically focused on Foley sounds. The evaluation metrics, including Fréchet Audio Distance (FAD) and LAION-CLAP scores, are appropriate for assessing audio quality and semantic alignment. The quantitative results indicate that while there is a slight trade-off in audio quality and semantic adherence, the model achieves significant gains in controllability, which is the primary goal of the research.
The paper provides sufficient implementation details, including the architecture, dataset, and training procedures, which would allow other researchers to reproduce the results. However, the absence of a publicly available demo or project URL limits immediate accessibility for practical application and experimentation by the community.
The model's reliance on reference audio for control signal extraction is a notable limitation, as it restricts the model's ability to generate audio purely from text descriptions. Additionally, the potential for artifacts with extreme guidance values suggests that user tuning is necessary, which could complicate the user experience.
The work has significant implications for the fields of sound design and audio synthesis, particularly in professional settings such as film and game production. By bridging the gap between traditional Foley artistry and modern machine learning techniques, Audio Palette enables a more artist-centric workflow that could enhance creative processes in audio production. Audio Palette introduces a diffusion transformer-based model for controllable audio generation, significantly advancing the state of the art in Foley synthesis. The combination of multi-signal conditioning and efficient fine-tuning strategies positions this work as a valuable resource for both academic research and practical applications in sound design.
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.
Primary: ByteDance Inc
All Institutions: ByteDance Inc
The paper introduces DiSTAR, a zero-shot TTS framework that effectively combines autoregressive and diffusion modeling in a discrete code space, achieving state-of-the-art performance while addressing critical issues in previous approaches. The innovative methodology and comprehensive experimental validation position this work as a significant contribution to the field of speech synthesis.
The paper presents a novel approach to zero-shot text-to-speech (TTS) synthesis by integrating an autoregressive (AR) language model with a masked diffusion model, operating entirely within a discrete residual vector quantization (RVQ) code space. This method effectively addresses the limitations of previous continuous-latent models, such as exposure bias and robustness under distribution shifts. The architecture is designed to allow for block-level parallelism and intra-frame depth modeling, which enhances the quality and coherence of the generated speech. The use of RVQ enables explicit control during inference, a significant improvement over traditional methods that rely on continuous representations.
The authors conducted extensive experiments on standard zero-shot TTS benchmarks, demonstrating that their model outperforms existing state-of-the-art systems in terms of robustness, naturalness, and speaker/style consistency. The evaluation metrics included both objective measures (like Word Error Rate and speaker similarity) and subjective assessments (mean opinion scores), providing a comprehensive view of the model's performance. The results indicate a strong scaling behavior with model capacity, suggesting that the proposed method is not only effective but also scalable.
The paper includes detailed implementation specifics, including the architecture, training procedures, and optimization techniques used. It specifies the use of 64 NVIDIA A100 GPUs and provides insights into the training dataset and evaluation metrics. However, the absence of a publicly available code repository limits the reproducibility of the results, as other researchers may find it challenging to replicate the experiments without access to the code.
The study acknowledges that its results are based solely on an English corpus of approximately 50k hours, which may limit the generalizability of the findings to multilingual or multi-style settings. Additionally, while the model shows promise in robustness and quality, the potential for misuse in generating impersonations and disinformation is a significant ethical concern that the authors briefly address.
The advancements in high-fidelity zero-shot TTS have significant implications for accessibility, education, and creative industries. However, the ability to closely mimic speaker timbre raises ethical issues, including risks of impersonation and misuse. The authors advocate for responsible deployment practices, including consent-first policies and audio watermarking to mitigate potential harms. The paper introduces DiSTAR, a zero-shot TTS framework that effectively combines autoregressive and diffusion modeling in a discrete code space, achieving state-of-the-art performance while addressing critical issues in previous approaches. The innovative methodology and comprehensive experimental validation position this work as a significant contribution to the field of speech synthesis.
Recent attempts to interleave autoregressive (AR) sketchers with diffusion-based refiners over continuous speech representations have shown promise, but they remain brittle under distribution shift and offer limited levers for controllability. We introduce DISTAR, a zero-shot text-to-speech framework that operates entirely in a discrete residual vector quantization (RVQ) code space and tightly couples an AR language model with a masked diffusion model, without forced alignment or a duration predictor. Concretely, DISTAR drafts block-level RVQ tokens with an AR language model and then performs parallel masked-diffusion infilling conditioned on the draft to complete the next block, yielding long-form synthesis with blockwise parallelism while mitigating classic AR exposure bias. The discrete code space affords explicit control at inference: DISTAR produces high-quality audio under both greedy and sample-based decoding using classifier-free guidance, supports trade-offs between robustness and diversity, and enables variable bit-rate and controllable computation via RVQ layer pruning at test time. Extensive experiments and ablations demonstrate that DISTAR surpasses state-of-the-art zero-shot TTS systems in robustness, naturalness, and speaker/style consistency, while maintaining rich output diversity. Audio samples are provided on https://anonymous.4open.science/w/DiSTAR_demo.
Primary: ByteDance Inc
All Institutions: ByteDance Inc
The paper presents DiSTAR, a zero-shot TTS framework that innovatively integrates autoregressive and diffusion models in a discrete code space, achieving state-of-the-art performance while addressing significant challenges in the field. The comprehensive evaluation and methodological rigor position this work as a noteworthy contribution to the advancement of speech synthesis technologies.
The paper introduces DiSTAR, a novel zero-shot text-to-speech (TTS) framework that innovatively combines an autoregressive (AR) language model with a masked diffusion model in a discrete residual vector quantization (RVQ) code space. This approach addresses challenges such as exposure bias and the need for explicit duration predictors, which are common in traditional TTS systems. The methodology is well-structured, leveraging a patch-wise factorization strategy that allows for block-level parallelism and efficient synthesis. The use of classifier-free guidance and RVQ layer pruning for controllability further enhances the model's flexibility and robustness.
The experiments are comprehensive, comparing DiSTAR against state-of-the-art zero-shot TTS systems across multiple benchmarks. The results demonstrate significant improvements in robustness, naturalness, and speaker/style consistency, with detailed metrics provided for both objective (WER, SIM, UTMOS) and subjective (CMOS, SMOS) evaluations. The ablation studies effectively highlight the contributions of various components, reinforcing the robustness of the proposed architecture.
The paper provides detailed implementation specifics, including training configurations, model architecture, and optimization strategies. However, the absence of a public code repository may hinder full reproducibility, as external researchers would need to replicate the training environment and dataset preparation independently.
The study acknowledges limitations in its evaluation scope, particularly the reliance on a single language corpus (English) and the potential need for further validation in multilingual and multi-style contexts. Additionally, while the model shows promise, its performance may vary based on the RVQ depth and codebook design, which could limit its generalizability.
The advancements in zero-shot TTS have substantial implications for accessibility, education, and creative industries. However, the ability to closely mimic speaker timbre raises ethical concerns regarding impersonation and misuse. The authors suggest responsible deployment strategies, including consent-first policies and audio watermarking, to mitigate potential risks associated with the technology. The paper presents DiSTAR, a zero-shot TTS framework that innovatively integrates autoregressive and diffusion models in a discrete code space, achieving state-of-the-art performance while addressing significant challenges in the field. The comprehensive evaluation and methodological rigor position this work as a noteworthy contribution to the advancement of speech synthesis technologies.
Unified architectures in multimodal large language models (MLLM) have shown promise in handling diverse tasks within a single framework. In the text-to-speech (TTS) task, current MLLM-based approaches rely on discrete token representations, which disregard the inherently continuous nature of speech and can lead to loss of fine-grained acoustic information.In this work, we investigate the TTS within the MLLM paradigm using continuous speech representations. We design a dual-head architecture and implement two complementary training strategies for a robust model. (1) A diffusion head generating continuous speech representations is added on the MLLM, which is on frame-level and strictly autoregressive. (2) The original language model head is retained to preserve multitask capability and to control the start and end of speech synthesis. (3) Masked training is employed to address exposure bias in autoregressive decoding. (4) To stabilize optimization, we propose a two-stage scheme where the LM is frozen in the second stage, ensuring the diffusion head learns from a fixed input distribution. Evaluations on LibriSpeech(PC) test-clean show that our approach achieves state-of-the-art autoregressive performance, with a WER of 1.95%, speaker similarity of 0.54, and UTMOS of 4.00. The two-stage training yields a 46% relative WER reduction over the one-stage training baseline. These results highlight the effectiveness of combining autoregressive modeling with continuous-token diffusion, supported by a two-stage training procedure.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach to TTS by integrating continuous-token diffusion into an autoregressive framework, achieving state-of-the-art results while addressing key challenges in the field. The contributions are significant, with implications for future research and applications in multimodal language models.
The paper introduces a dual-head architecture that integrates a continuous-token diffusion head with an autoregressive language model (LM) for text-to-speech (TTS) applications. The methodology is innovative as it addresses the limitations of discrete token representations in TTS by utilizing continuous speech representations, which are inherently more aligned with the nature of speech. The proposed two-stage training strategy is a significant contribution, as it stabilizes the optimization process and improves the model's performance. The use of masked training to mitigate exposure bias is also a noteworthy enhancement to the training methodology.
The experimental evaluation is robust, utilizing the LibriSpeech dataset for testing, which is a standard benchmark in the TTS domain. The reported results demonstrate state-of-the-art performance with a WER of 1.95%, showcasing the effectiveness of the proposed approach. The paper includes a thorough comparison with existing methods, highlighting the advantages of the proposed model in terms of both performance and efficiency, given its relatively smaller parameter count.
The paper provides detailed implementation information, including the architecture, training configurations, and evaluation metrics. However, the lack of a publicly available code repository or demo URL limits the reproducibility of the results. The methodology is described comprehensively, which aids in understanding the approach, but actual implementation details would be necessary for full reproducibility.
One limitation is the absence of a direct comparison with more recent models that may have emerged after the paper's submission. Additionally, while the two-stage training approach is effective, it may increase the complexity of the training process, which could be a barrier for practical applications. The paper also does not discuss the computational resources required for training, which could be significant given the model's architecture.
The integration of continuous-token diffusion in TTS has the potential to significantly enhance the naturalness and fidelity of synthesized speech, making it applicable in various domains such as virtual assistants, audiobooks, and accessibility technologies. The methodology could pave the way for more unified multimodal models that can handle diverse tasks, thus impacting the broader field of artificial intelligence and human-computer interaction. The paper presents a novel approach to TTS by integrating continuous-token diffusion into an autoregressive framework, achieving state-of-the-art results while addressing key challenges in the field. The contributions are significant, with implications for future research and applications in multimodal language models.
This paper presents the Deep learning-based Perceptual Audio Quality metric (DeePAQ) for evaluating general audio quality. Our approach leverages metric learning together with the music foundation model MERT, guided by surrogate labels, to construct an embedding space that captures distortion intensity in general audio. To the best of our knowledge, DeePAQ is the first in the general audio quality domain to leverage weakly supervised labels and metric learning for fine-tuning a music foundation model with Low-Rank Adaptation (LoRA), a direction not yet explored by other state-of-the-art methods. We benchmark the proposed model against state-of-the-art objective audio quality metrics across listening tests spanning audio coding and source separation. Results show that our method surpasses existing metrics in detecting coding artifacts and generalizes well to unseen distortions such as source separation, highlighting its robustness and versatility.
Primary: unknown
All Institutions: unknown
DeePAQ introduces a novel perceptual audio quality metric that leverages weakly supervised learning and metric learning to adapt a music foundation model for audio quality assessment. The innovative methodology, robust experimental validation, and potential applications highlight its significance in advancing the field of audio quality evaluation.
The methodology presented in DeePAQ is innovative, leveraging weakly supervised learning and metric learning to adapt a music foundation model (MERT) for perceptual audio quality assessment. The use of Low-Rank Adaptation (LoRA) to fine-tune the model is particularly noteworthy, as it allows for effective adaptation with a limited number of parameters. The approach of using surrogate labels for training is a clever solution to the scarcity of subjective ratings in the audio domain, although the potential noise introduced by these labels is acknowledged. The RnC loss function is a significant contribution that enables the model to learn a quality-related embedding space effectively. However, the paper could benefit from a more thorough exploration of the limitations of the surrogate labels and their impact on the model's performance.
The experimental evaluation is robust, with a comprehensive benchmarking against state-of-the-art audio quality metrics across multiple listening tests. The results demonstrate that DeePAQ consistently outperforms existing metrics, particularly in detecting coding artifacts and generalizing to unseen distortions. The use of diverse datasets and listening tests enhances the credibility of the findings. However, the paper could provide more detailed statistical analysis of the results to further substantiate the claims of superiority over existing methods.
The paper provides a detailed description of the training setup, including the dataset composition, training parameters, and evaluation metrics, which aids in reproducibility. However, the lack of a publicly available implementation or code repository limits the ability for others to replicate the results independently. Including a link to a GitHub repository or similar would significantly enhance reproducibility.
The primary limitation identified is the reliance on surrogate labels, which may introduce noise and affect the model's performance. Additionally, while the model shows strong performance in audio coding tasks, its effectiveness in source separation is less pronounced, indicating potential areas for improvement. The paper also does not address the scalability of the approach to larger datasets or different audio contexts.
The development of DeePAQ has significant implications for the field of audio quality assessment, particularly in applications where subjective evaluation is impractical. The ability to effectively evaluate audio quality using a computational metric can enhance various domains, including music streaming, audio coding, and telecommunications. The approach could also pave the way for further research into the use of foundation models in other audio-related tasks. DeePAQ introduces a novel perceptual audio quality metric that leverages weakly supervised learning and metric learning to adapt a music foundation model for audio quality assessment. The innovative methodology, robust experimental validation, and potential applications highlight its significance in advancing the field of audio quality evaluation.
Deepfake speech attribution remains challenging for existing solutions. Classifier-based solutions often fail to generalize to domain-shifted samples, and watermarking-based solutions are easily compromised by distortions like codec compression or malicious removal attacks. To address these issues, we propose FakeMark, a novel watermarking framework that injects artifact-correlated watermarks associated with deepfake systems rather than pre-assigned bitstring messages. This design allows a detector to attribute the source system by leveraging both injected watermark and intrinsic deepfake artifacts, remaining effective even if one of these cues is elusive or removed. Experimental results show that FakeMark improves generalization to cross-dataset samples where classifier-based solutions struggle and maintains high accuracy under various distortions where conventional watermarking-based solutions fail.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of FakeMark, a novel watermarking framework that enhances deepfake speech attribution by leveraging artifact-correlated watermarks, significantly improving robustness and generalization in challenging scenarios. The comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the potential of this approach to advance the state of deepfake detection and attribution.
The methodology presented in FakeMark is innovative, utilizing a dual approach of watermarking correlated with deepfake artifacts to enhance attribution robustness. The integration of both watermarking and artifact detection is a significant advancement over traditional classifier-based approaches, which struggle with domain shifts. The detailed description of the pipeline, including the generator and detector architectures, is well-articulated, showcasing a thoughtful design that balances robustness and perceptual quality. The use of multiple training objectives to align watermark embeddings with artifact patterns is particularly noteworthy.
The experimental evaluation is comprehensive, employing both in-domain and cross-dataset assessments to validate the effectiveness of FakeMark against various distortions and removal attacks. The results convincingly demonstrate that FakeMark outperforms existing methods, particularly in challenging scenarios, indicating a strong generalization capability. The use of diverse datasets and a thorough evaluation of attribution accuracy and audio quality metrics adds robustness to the findings.
While the paper provides a detailed description of the methodologies and experimental setups, the absence of specific implementation details, such as code availability or links to datasets, limits reproducibility. The paper mentions the use of certain architectures and training procedures but does not provide sufficient information for independent verification.
The primary limitation noted is the focus on fully seen architectures during training and evaluation, which restricts the applicability of the proposed method to unseen deepfake systems. Additionally, the trade-off between watermark robustness and speech quality suggests that further refinement is needed to optimize both aspects.
This research has significant implications for the fields of audio forensics and digital media integrity, as it addresses the growing concern of deepfake technologies and their potential misuse. By enhancing attribution methods, FakeMark could aid in the development of more secure and reliable systems for identifying synthetic speech, thus contributing to the fight against misinformation and copyright violations. The main contribution of this paper is the introduction of FakeMark, a novel watermarking framework that enhances deepfake speech attribution by leveraging artifact-correlated watermarks, significantly improving robustness and generalization in challenging scenarios. The comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the potential of this approach to advance the state of deepfake detection and attribution.
Deepfake speech attribution remains challenging for existing solutions. Classifier-based solutions often fail to generalize to domain-shifted samples, and watermarking-based solutions are easily compromised by distortions like codec compression or malicious removal attacks. To address these issues, we propose FakeMark, a novel watermarking framework that injects artifact-correlated watermarks associated with deepfake systems rather than pre-assigned bitstring messages. This design allows a detector to attribute the source system by leveraging both injected watermark and intrinsic deepfake artifacts, remaining effective even if one of these cues is elusive or removed. Experimental results show that FakeMark improves generalization to cross-dataset samples where classifier-based solutions struggle and maintains high accuracy under various distortions where conventional watermarking-based solutions fail.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of FakeMark, a novel watermarking framework that leverages artifact-correlated watermarks for robust deepfake speech attribution. This work significantly advances the state of the art in deepfake detection, providing a promising avenue for future research and application in audio forensics.
The methodology presented in FakeMark is innovative, focusing on a dual approach that combines watermarking with deepfake artifact correlation. The proposed framework allows for robust attribution even when either the watermark or the artifacts are compromised. The architecture is well-defined, with clear stages for both watermark generation and detection, and the use of perceptual loss functions enhances the quality of the output. However, the paper could benefit from a more detailed explanation of the training process and the specific architectures used for the generator and detector.
The experimental evaluation is thorough, utilizing multiple datasets and a variety of distortions to test the robustness of the FakeMark framework. The results demonstrate significant improvements in attribution accuracy compared to baseline models, particularly under challenging conditions. The paper provides a clear comparison with existing methods, highlighting the advantages of FakeMark. However, the lack of evaluation on unseen architectures limits the generalizability of the findings.
The paper includes sufficient implementation details and references to existing models, which aids reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. The training and evaluation protocols are well-documented, but access to the actual implementation would enhance reproducibility.
Key limitations include the focus on seen architectures during training and evaluation, which may not reflect real-world scenarios where unseen deepfake systems are prevalent. Additionally, the trade-off between watermark robustness and speech quality is acknowledged but not fully addressed, suggesting that further research is needed to optimize this balance.
The proposed FakeMark framework has significant implications for the field of deepfake detection and attribution, particularly in applications related to copyright protection and misinformation mitigation. By enhancing the robustness of attribution methods, this work contributes to the ongoing efforts to combat the malicious use of synthetic speech technologies. The main contribution of this paper is the introduction of FakeMark, a novel watermarking framework that leverages artifact-correlated watermarks for robust deepfake speech attribution. This work significantly advances the state of the art in deepfake detection, providing a promising avenue for future research and application in audio forensics.
Recent advancements in large multimodal models (LMMs) have shown strong capabilities in audio understanding. However, most systems rely solely on end-to-end reasoning, limiting interpretability and accuracy for tasks that require structured knowledge or specialized signal analysis. In this work, we present Audio-Maestro -- a tool-augmented audio reasoning framework that enables audio-language models to autonomously call external tools and integrate their timestamped outputs into the reasoning process. This design allows the model to analyze, transform, and interpret audio signals through specialized tools rather than relying solely on end-to-end inference. Experiments show that Audio-Maestro consistently improves general audio reasoning performance: Gemini-2.5-flash's average accuracy on MMAU-Test rises from 67.4% to 72.1%, DeSTA-2.5 from 58.3% to 62.8%, and GPT-4o from 60.8% to 63.9%. To our knowledge, Audio-Maestro is the first framework to integrate structured tool output into the large audio language model reasoning process.
Primary: National Taiwan University
All Institutions: National Taiwan University, ASUS Open Cloud Infrastructure Software Center
Audio-Maestro introduces a novel framework for tool-augmented reasoning in audio-language models, significantly enhancing their interpretability and accuracy. The comprehensive evaluation of its methodology, experimental results, and broader implications underscores its potential to advance the field of audio understanding.
The methodology presented in Audio-Maestro is innovative, focusing on tool-augmented reasoning in the audio domain. The two-phase design allows the model to decide when to invoke external tools, which is a significant advancement over traditional end-to-end models. The integration of structured outputs from tools into the reasoning process is well-articulated, and the decision-making mechanism is grounded in both semantic understanding and acoustic cues, enhancing interpretability.
The experiments are robust, utilizing the MMAU benchmark to evaluate the framework's performance across multiple models. The results clearly demonstrate improvements in accuracy when using the tool-augmented approach, with detailed comparisons against baseline models. The analysis of error cases provides valuable insights into the limitations of the current approach, particularly regarding tool output errors.
The paper provides a GitHub repository for the complete codebase, which is essential for reproducibility. However, the details regarding the implementation of the tools and the exact experimental setup could be more thoroughly documented to facilitate easier replication of the results.
The paper acknowledges two main limitations: increased inference time due to tool integration and the dependency on the accuracy of external tools. These factors could hinder real-time applications and indicate areas for future improvement, particularly in enhancing tool robustness.
The implications of this work are significant for various applications in audio processing, including speech recognition, music analysis, and environmental sound understanding. By bridging high-level reasoning with low-level acoustic analysis, Audio-Maestro opens avenues for more interpretable and accurate audio-language models, potentially impacting fields such as human-computer interaction and automated audio analysis. Audio-Maestro introduces a novel framework for tool-augmented reasoning in audio-language models, significantly enhancing their interpretability and accuracy. The comprehensive evaluation of its methodology, experimental results, and broader implications underscores its potential to advance the field of audio understanding.
Contrastive audio-language pretraining yields powerful joint representations, yet a persistent audio-text modality gap limits the benefits of coupling multimodal encoders with large language models (LLMs). We present Diffusion-Link, a diffusion-based modality-bridging module that generatively maps audio embeddings into the text-embedding distribution. The module is trained at the output embedding from the frozen multimodal encoder and implemented as a lightweight network with three residual MLP blocks. To assess the effect of Diffusion-Link on multimodal encoder-LLM coupling, we evaluate on Automatic Audio Captioning (AAC); to our knowledge, this is the first application of diffusion-based modality bridging to AAC. We report two results. (1) Modality-gap analysis: on similarity and geometric criteria, Diffusion-Link reduces the modality gap the most among prior diffusion-based methods and shows a collective migration of audio embeddings toward the text distribution. (2) Downstream AAC: attaching Diffusion-Link to the same multimodal LLM baseline achieves state-of-the-art on AudioCaps in both zero-shot and fully supervised captioning without external knowledge, with relative gains up to 52.5% and 7.5%, respectively. These findings show that closing the modality gap is pivotal for effective coupling between multimodal encoders and LLMs, and diffusion-based modality bridging offers a promising direction beyond knowledge-retrieval-centric designs. Code will be released upon acceptance https://github.com/DevKiHyun/Diffusion-Link
Primary: University of Seoul
All Institutions: University of Seoul, Korea Advanced Institute of Science and Technology
The main contribution of this paper is the introduction of Diffusion-Link, a novel diffusion-based module that successfully bridges the audio-text modality gap, leading to significant improvements in Automatic Audio Captioning tasks. This work represents a meaningful advancement in multimodal machine learning, showcasing the potential of diffusion models in enhancing the integration of different data modalities.
The methodology presented in the paper is innovative, introducing Diffusion-Link as a lightweight diffusion-based module that effectively bridges the audio-text modality gap. The use of residual MLP blocks for this purpose is a clever design choice that balances complexity and performance. The authors provide a clear explanation of how the module operates, leveraging generative mapping of audio embeddings to text distributions. However, the paper could benefit from a more detailed discussion on the training process and hyperparameter tuning, which are crucial for replicating the results.
The experimental evaluation is robust, with the authors conducting a thorough modality-gap analysis and demonstrating significant improvements in Automatic Audio Captioning (AAC) tasks. The results show that Diffusion-Link outperforms previous methods, achieving state-of-the-art performance on the AudioCaps dataset in both zero-shot and fully supervised settings. The reported relative gains are substantial, indicating the effectiveness of the proposed method. Nevertheless, the paper could enhance its credibility by including additional datasets or benchmarks to validate the generalizability of the results.
The authors mention that code will be released upon acceptance, which is a positive step toward ensuring reproducibility. However, the paper lacks detailed implementation specifics, such as the exact architecture of the multimodal encoder used and the training configurations. Providing these details would greatly aid other researchers in replicating the study and building upon the work.
One limitation of the study is the reliance on a single dataset (AudioCaps) for evaluating the performance of Diffusion-Link. While the results are impressive, they may not fully represent the method's effectiveness across diverse audio-text tasks. Additionally, the paper does not address potential scalability issues or computational costs associated with deploying the proposed model in real-world applications.
The potential applications of this research are significant, particularly in fields such as multimedia content generation, accessibility technologies, and human-computer interaction. By effectively bridging the audio-text modality gap, the proposed method could enhance the capabilities of multimodal systems, leading to more intuitive and responsive applications. The findings suggest a promising direction for future research in multimodal representation learning. The main contribution of this paper is the introduction of Diffusion-Link, a novel diffusion-based module that successfully bridges the audio-text modality gap, leading to significant improvements in Automatic Audio Captioning tasks. This work represents a meaningful advancement in multimodal machine learning, showcasing the potential of diffusion models in enhancing the integration of different data modalities.
Human communication is multimodal, with speech and gestures tightly coupled, yet most computational methods for generating speech and gestures synthesize them sequentially, weakening synchrony and prosody alignment. We introduce Gelina, a unified framework that jointly synthesizes speech and co-speech gestures from text using interleaved token sequences in a discrete autoregressive backbone, with modality-specific decoders. Gelina supports multi-speaker and multi-style cloning and enables gesture-only synthesis from speech inputs. Subjective and objective evaluations demonstrate competitive speech quality and improved gesture generation over unimodal baselines.
Primary: Autonomous Systems and Software Program (WASP)
All Institutions: Autonomous Systems and Software Program (WASP), Knut and Alice Wallenberg Foundation, GENCI-IDRIS
Gelina represents a significant advancement in the field of multimodal synthesis, offering a unified framework that effectively integrates speech and gesture generation. The innovative methodology, combined with strong experimental validation, positions this work as a valuable contribution to the ongoing research in human behavior synthesis and multimodal communication.
The methodology presented in Gelina is innovative, introducing an interleaved token autoregressive architecture that allows for simultaneous synthesis of speech and gestures. This approach addresses the limitations of traditional sequential synthesis methods, enhancing synchrony and prosody alignment. The use of a conditional flow-matching decoder to improve gesture quality further demonstrates a sophisticated understanding of multimodal generation. The training strategy that leverages large unimodal datasets for generalization under sparse paired data is particularly commendable, showcasing a clever adaptation to existing data limitations.
The experimental evaluation is robust, utilizing both objective metrics and subjective user studies to assess the performance of Gelina. The choice of datasets, including GigaSpeech and BEAT2, is appropriate for the tasks at hand, and the comparative analysis against strong unimodal baselines highlights the effectiveness of the proposed model. The results indicate that Gelina achieves competitive performance in both speech quality and gesture generation, which is a significant achievement given the complexity of joint synthesis.
The paper provides detailed implementation specifics, including architecture choices, training procedures, and evaluation metrics, which enhance reproducibility. However, the absence of a publicly available code repository limits the ability for others to fully replicate the results without additional effort.
The primary limitations identified include the model's focus on body gestures, excluding finer details such as finger movements and facial expressions. Additionally, the speech quality is constrained by the tokenizer used, which could be improved in future iterations. The reliance on large-scale datasets may also pose challenges in terms of accessibility for researchers with limited resources.
The implications of Gelina extend to various applications in human-computer interaction, such as embodied conversational agents and social robotics, where natural multimodal communication is essential. The ability to generate synchronized speech and gestures could significantly enhance user experience and engagement in these domains. Gelina represents a significant advancement in the field of multimodal synthesis, offering a unified framework that effectively integrates speech and gesture generation. The innovative methodology, combined with strong experimental validation, positions this work as a valuable contribution to the ongoing research in human behavior synthesis and multimodal communication.
Recent advances in the audio language modeling (ALM) domain tackle audio understanding and text-to-audio generation as separate tasks. Very few studies attempt to unify these tasks -- an essential step toward advanced multimodal reasoning. This paper introduces U}nified Audio Language Model (UALM), which aims to unify audio understanding, text-to-audio generation, and multimodal reasoning in a single model. To achieve this goal, we first present UALM-Gen, a text-to-audio language model that directly predicts audio tokens and is comparable to state-of-the-art diffusion-based models. We then demonstrate, using proper data blending, training recipes, and inference techniques, that our single UALM model matches the quality of state-of-the-art specialized models in audio understanding, text-to-audio generation, and text reasoning. Furthermore, we present UALM-Reason, a multimodal reasoning model that utilizes both text and audio in the intermediate thinking steps to facilitate complex generation tasks. To our knowledge, this is the first demonstration in audio research of cross-modal generative reasoning, with its effectiveness confirmed by subjective evaluations.
Primary: NVIDIA
All Institutions: CMU, NVIDIA, UMD
The paper presents the Unified Audio Language Model (UALM), which innovatively integrates audio understanding, text-to-audio generation, and multimodal reasoning into a single framework, marking a significant advancement in the field of audio language modeling. The comprehensive methodology and rigorous evaluation demonstrate the model's potential to enhance audio AI applications significantly.
The methodology presented in this paper is robust and innovative, focusing on the unification of audio understanding, text-to-audio generation, and multimodal reasoning into a single model. The authors effectively leverage a combination of techniques such as data blending, classifier-free guidance, and a structured training recipe to enhance the model's performance across multiple tasks. The introduction of rich captions as an intermediate representation for audio generation is a novel approach that facilitates nuanced understanding and generation, showcasing a thoughtful integration of cognitive processes into the model's architecture.
The experimental evaluation is comprehensive, utilizing both objective and subjective metrics to assess the performance of UALM and its components. The authors provide detailed comparisons with state-of-the-art models, demonstrating competitive results in audio generation and understanding tasks. The use of large-scale datasets and rigorous evaluation protocols strengthens the credibility of the findings, although the reliance on subjective evaluations for reasoning capabilities introduces some variability in results.
The paper includes sufficient implementation details, including architecture specifications, training configurations, and data sources, which are crucial for reproducibility. However, the absence of explicit links to datasets used for training and evaluation may hinder full reproducibility for external researchers.
One limitation is the potential overfitting to the specific datasets used, particularly given the large scale of the training data. Additionally, while the subjective evaluations provide insights into the model's reasoning capabilities, they may not fully capture the nuances of audio generation quality. The paper also acknowledges the need for better quality assessment methods for synthetic audio captions, indicating an area for improvement.
The work has significant implications for the development of advanced audio AI systems, particularly in applications such as music composition, sound design, and interactive audio experiences. By unifying understanding, generation, and reasoning, UALM paves the way for more intelligent and responsive audio applications, enhancing user interaction and creativity in audio production. The paper presents the Unified Audio Language Model (UALM), which innovatively integrates audio understanding, text-to-audio generation, and multimodal reasoning into a single framework, marking a significant advancement in the field of audio language modeling. The comprehensive methodology and rigorous evaluation demonstrate the model's potential to enhance audio AI applications significantly.
Scaling laws have profoundly shaped our understanding of model performance in computer vision and natural language processing, yet their application to general audio representation learning remains underexplored. A key challenge lies in the multifactorial nature of general audio representation-representation quality is jointly influenced by variables such as audio length, embedding dimensionality, model depth, model architecture, data volume, etc., many of which are difficult to isolate or express analytically. In this work, we present a systematic study of scaling laws for general audio representations by utilizing embedding effective rank (RankMe) as a unifying metric that encapsulates the impact of diverse variables on representation quality. RankMe enables a label-free, information-theoretic quantification of audio embeddings, allowing us to examine scaling behaviors across a wide hyper-parameter space, including model size, training data volume, computational budget, architectural configurations, etc. Our empirical findings reveal a consistent power-law relationship between RankMe and representation quality, suggesting that embedding effective rank serves as a reliable proxy for assessing and predicting model performance in audio representation learning. This work not only validates the applicability of classical scaling principles to the general audio domain but also offers a theoretically grounded and empirically robust framework for guiding future model scaling strategies in audio foundation models.
Primary: National University of Defense Technology
All Institutions: National University of Defense Technology
The paper presents a novel approach to understanding scaling laws in audio representation learning through the lens of embedding effective rank. This contribution is significant as it not only validates classical scaling principles in a new domain but also provides a framework for future research to optimize model performance without requiring labeled data.
The methodology presented in the paper is robust, employing the concept of embedding effective rank (RankMe) as a unifying metric for analyzing the scaling laws in audio representation learning. The authors systematically investigate how various hyper-parameters influence representation quality, which is a significant advancement in the field. The use of a masked autoencoding self-supervised learning framework is appropriate and aligns well with current trends in audio representation. The paper effectively integrates theoretical foundations with empirical analysis, demonstrating a clear power-law relationship between RankMe and representation quality.
The experiments are well-designed, utilizing a large-scale dataset (approximately 100 million audio clips) and a variety of model architectures. The authors conduct extensive evaluations across different hyper-parameter settings, providing a comprehensive analysis of the impacts of data volume, model size, and computational budget on audio representation quality. The results are clearly presented, with figures illustrating the relationships between RankMe and various performance metrics, which strengthens the validity of their findings.
The paper provides sufficient implementation details, including model architectures, training procedures, and hyper-parameter settings, which are crucial for reproducibility. However, the absence of a publicly available code repository or demo limits the ability for others to directly replicate the experiments. Including a link to a GitHub repository or similar would enhance reproducibility significantly.
One limitation of the study is that while RankMe serves as a unifying metric, it may still be a coarse-grained indicator of performance, as noted by the authors. Additionally, the findings are based on empirical observations, and the theoretical underpinnings of the scaling laws could be further explored. The paper also does not address potential biases in the datasets used, which could affect the generalizability of the results.
The implications of this research are significant for the field of audio representation learning, as it provides a systematic framework for understanding how various factors influence model performance. This work could guide future research in scaling audio models and optimizing their architectures, potentially leading to advancements in applications such as speech recognition, music analysis, and environmental sound classification. The paper presents a novel approach to understanding scaling laws in audio representation learning through the lens of embedding effective rank. This contribution is significant as it not only validates classical scaling principles in a new domain but also provides a framework for future research to optimize model performance without requiring labeled data.
Audio-Visual Embodied Navigation aims to enable agents to autonomously navigate to sound sources in unknown 3D environments using auditory cues. While current AVN methods excel on in-distribution sound sources, they exhibit poor cross-source generalization: navigation success rates plummet and search paths become excessively long when agents encounter unheard sounds or unseen environments. This limitation stems from the lack of explicit alignment mechanisms between auditory signals and corresponding visual regions. Policies tend to memorize spurious \enquote{acoustic fingerprint-scenario} correlations during training, leading to blind exploration when exposed to novel sound sources. To address this, we propose the AGVP framework, which transforms sound from policy-memorable acoustic fingerprint cues into spatial guidance. The framework first extracts global auditory context via audio self-attention, then uses this context as queries to guide visual feature attention, highlighting sound-source-related regions at the feature level. Subsequent temporal modeling and policy optimization are then performed. This design, centered on interpretable cross-modal alignment and region reweighting, reduces dependency on specific acoustic fingerprints. Experimental results demonstrate that AGVP improves both navigation efficiency and robustness while achieving superior cross-scenario generalization on previously unheard sounds.
Primary: Xinjiang University
All Institutions: Xinjiang University, Tsinghua University, Tianjin University of Technology
The paper presents a significant contribution to the field of audio-visual navigation by introducing the AGVP framework, which enhances navigation efficiency and robustness through innovative cross-modal alignment techniques. The methodology and experimental results indicate a strong potential for real-world applications, although further work is needed to improve reproducibility and address limitations in dynamic environments.
The proposed AGVP framework introduces a novel approach to audio-visual navigation by leveraging audio self-attention to guide visual feature attention, which is a significant advancement over existing methods that rely on late-stage fusion. The methodology emphasizes cross-modal alignment and region reweighting, effectively addressing the limitations of memorization-based policies that struggle with unseen sounds. The use of self-attention mechanisms for both audio and visual modalities enhances the model's ability to capture long-range dependencies and spatial correlations, which is a robust design choice.
The experimental evaluation is comprehensive, utilizing two well-known datasets (Replica and Matterport3D) and comparing the AGVP framework against multiple state-of-the-art methods. The results demonstrate significant improvements in navigation efficiency and robustness, particularly in scenarios involving unheard sounds. The metrics used (SPL, SR, SNA) are appropriate for the task and provide a clear picture of the framework's performance. The ablation studies further validate the importance of the proposed components, strengthening the claims made by the authors.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as hyperparameters, training protocols, and code availability, which are critical for reproducibility. The absence of a project URL or demo also limits the ability of other researchers to replicate the findings.
One limitation is the reliance on specific datasets, which may not fully capture the diversity of real-world environments. Additionally, while the AGVP framework improves generalization to unseen sounds, the paper does not address how it performs in highly dynamic environments or with rapidly changing sound sources. The potential computational complexity of the proposed model may also pose challenges for real-time applications.
The advancements in audio-visual navigation have significant implications for robotics, virtual reality, and assistive technologies, where understanding and responding to auditory cues in complex environments is crucial. The AGVP framework could enhance the capabilities of autonomous agents in various applications, including search and rescue operations, navigation in smart homes, and interactive gaming. The paper presents a significant contribution to the field of audio-visual navigation by introducing the AGVP framework, which enhances navigation efficiency and robustness through innovative cross-modal alignment techniques. The methodology and experimental results indicate a strong potential for real-world applications, although further work is needed to improve reproducibility and address limitations in dynamic environments.
Autoregressive (AR) frameworks have recently achieved remarkable progress in zero-shot text-to-speech (TTS) by leveraging discrete speech tokens and large language model techniques. Despite their success, existing AR-based zero-shot TTS systems face two critical limitations: (i) an inherent speed-quality trade-off, as sequential token generation either reduces frame rates at the cost of expressiveness or enriches tokens at the cost of efficiency, and (ii) a text-oriented supervision mismatch, as cross-entropy loss penalizes token errors uniformly without considering the fine-grained acoustic similarity among adjacent tokens. To address these challenges, we propose BridgeTTS, a novel AR-TTS framework built upon the dual speech representation paradigm BridgeCode. BridgeTTS reduces AR iterations by predicting sparse tokens while reconstructing rich continuous features for high-quality synthesis. Joint optimization of token-level and feature-level objectives further enhances naturalness and intelligibility. Experiments demonstrate that BridgeTTS achieves competitive quality and speaker similarity while significantly accelerating synthesis. Speech demos are available at https://test1562.github.io/demo/.
Primary: South China University of Technology
All Institutions: South China University of Technology
The paper presents BridgeTTS, a novel autoregressive framework that leverages a dual speech representation paradigm to improve the efficiency and quality of zero-shot text-to-speech synthesis. The technical contributions are substantial, addressing critical limitations in existing methods and demonstrating competitive performance through rigorous experimentation.
The proposed methodology introduces a dual speech representation paradigm, BridgeCode, which innovatively combines sparse tokens and dense continuous features to address the speed-quality trade-off inherent in autoregressive text-to-speech synthesis. The architecture includes two bridging modules that facilitate bidirectional conversion, enhancing the model's efficiency and synthesis quality. The joint optimization of token-level and feature-level objectives is a significant advancement, providing a more nuanced training approach that considers acoustic similarities, which is often overlooked in traditional methods. This dual representation and the bridging mechanism are well-conceived and demonstrate a solid understanding of the limitations of existing approaches.
The experiments are robust, utilizing the LibriTTS dataset, which is a standard benchmark in TTS research. The paper presents both subjective and objective evaluations, including Mean Opinion Scores for naturalness and similarity, and demonstrates competitive performance against state-of-the-art methods. The inclusion of an ablation study strengthens the findings by isolating the contributions of key components, such as the feature loss and the BridgeCode architecture. However, the paper could benefit from more detailed statistical analysis of the results to substantiate claims of superiority over existing methods.
The implementation details are sufficiently described, including training parameters and dataset specifics, which aids in reproducibility. However, the lack of a publicly available code repository limits the ability for other researchers to fully replicate the work. Providing access to the trained models or code would enhance reproducibility significantly.
One limitation is the reliance on a specific dataset (LibriTTS), which may not generalize to other languages or dialects. Additionally, while the model shows improvements in speed and quality, the paper does not address potential trade-offs in other aspects, such as the model's ability to handle diverse speech styles or emotional tones. The subjective evaluation also relies on a relatively small number of raters, which may introduce variability in the results.
The advancements in zero-shot TTS synthesis have significant implications for applications in voice synthesis, accessibility technologies, and personalized voice assistants. By improving the efficiency and quality of TTS systems, this research could enhance user experiences in various domains, including entertainment, education, and assistive technologies. The potential for real-time applications could revolutionize how we interact with machines through speech. The paper presents BridgeTTS, a novel autoregressive framework that leverages a dual speech representation paradigm to improve the efficiency and quality of zero-shot text-to-speech synthesis. The technical contributions are substantial, addressing critical limitations in existing methods and demonstrating competitive performance through rigorous experimentation.
Cross-lingual emotional text-to-speech (TTS) aims to produce speech in one language that captures the emotion of a speaker from another language while maintaining the target voice's timbre. This process of cross-lingual emotional speech synthesis presents a complex challenge, necessitating flexible control over emotion, timbre, and language. However, emotion and timbre are highly entangled in speech signals, making fine-grained control challenging. To address this issue, we propose EMM-TTS, a novel two-stage cross-lingual emotional speech synthesis framework based on perturbed self-supervised learning (SSL) representations. In the first stage, the model explicitly and implicitly encodes prosodic cues to capture emotional expressiveness, while the second stage restores the timbre from perturbed SSL representations. We further investigate the effect of different speaker perturbation strategies-formant shifting and speaker anonymization-on the disentanglement of emotion and timbre. To strengthen speaker preservation and expressive control, we introduce Speaker Consistency Loss (SCL) and Speaker-Emotion Adaptive Layer Normalization (SEALN) modules. Additionally, we find that incorporating explicit acoustic features (e.g., F0, energy, and duration) alongside pretrained latent features improves voice cloning performance. Comprehensive multi-metric evaluations, including both subjective and objective measures, demonstrate that EMM-TTS achieves superior naturalness, emotion transferability, and timbre consistency across languages.
Primary: Tianjin University
All Institutions: Tianjin University, Institute of Artificial Intelligence (TeleAI), Duke Kunshan University, Northwestern Polytechnical University, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
The main contribution of this paper is the development of EMM-TTS, a two-stage cross-lingual emotional TTS system that effectively disentangles emotion and timbre through innovative modeling techniques. This work significantly advances the field of speech synthesis by addressing the complex challenges associated with emotional expressiveness and speaker identity preservation in multilingual contexts.
The proposed EMM-TTS framework introduces a two-stage modeling approach that effectively disentangles emotion and timbre in cross-lingual emotional TTS. The first stage focuses on capturing emotional expressiveness through explicit and implicit prosodic cues, while the second stage restores timbre from perturbed self-supervised representations. The introduction of Speaker Consistency Loss (SCL) and Speaker-Emotion Adaptive Layer Normalization (SEALN) modules enhances the model's ability to maintain speaker identity while transferring emotional characteristics. The methodology is well-structured and addresses significant challenges in the field, particularly the entanglement of emotion and timbre.
The experimental evaluation is comprehensive, utilizing both subjective and objective metrics to assess the performance of the EMM-TTS model against baseline models. The datasets used are relevant and diverse, including both monolingual and cross-lingual scenarios. The results indicate that EMM-TTS outperforms existing models in terms of naturalness, emotion transferability, and timbre consistency, demonstrating the effectiveness of the proposed methods. The ablation studies provide valuable insights into the contributions of various components of the model.
The paper provides sufficient details regarding the experimental setup, including datasets, model configurations, and evaluation metrics, which would enable other researchers to reproduce the results. However, the lack of a publicly available code repository limits full reproducibility.
One limitation is the reliance on specific datasets, which may not generalize across all languages and emotional expressions. Additionally, while the model shows promise in cross-lingual scenarios, the performance degradation when synthesizing speech across different languages suggests that further improvements are needed in this area.
The proposed EMM-TTS framework has significant implications for applications in multilingual speech synthesis, virtual assistants, and emotional AI. By enabling more natural and emotionally expressive speech across languages, this work could enhance human-computer interaction and accessibility in various domains. The main contribution of this paper is the development of EMM-TTS, a two-stage cross-lingual emotional TTS system that effectively disentangles emotion and timbre through innovative modeling techniques. This work significantly advances the field of speech synthesis by addressing the complex challenges associated with emotional expressiveness and speaker identity preservation in multilingual contexts.
Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited -- they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) -- a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.
Primary: Wuhan University
All Institutions: Tencent AI Lab, Wuhan University
This paper introduces VCB Bench, a comprehensive evaluation benchmark for audio-grounded large language models, significantly enhancing the assessment of Chinese voice conversational agents. The methodology is thorough, the experimental evaluation is rigorous, and the implications for future research are substantial, marking a meaningful contribution to the field.
The methodology is robust, presenting a comprehensive evaluation framework (VCB Bench) that addresses the limitations of existing benchmarks by focusing on real human speech and encompassing multiple evaluation dimensions (instruction following, knowledge understanding, and robustness). The dataset construction is well-detailed, ensuring high-quality data through various sources and rigorous quality checks.
The experiments are extensive, evaluating state-of-the-art LALMs across multiple tasks and providing a thorough analysis of their performance. The results highlight significant performance gaps and areas for improvement, contributing valuable insights for future research.
The paper emphasizes reproducibility by providing a standardized methodology and making code and data available on GitHub, which is crucial for the research community to validate and build upon the findings.
The paper acknowledges limitations such as the need for continuous updates to the benchmark to include newly released models and the potential inadequacy of prompts used in experiments to fully exploit model capabilities.
The proposed benchmark has the potential to significantly advance the field of audio-grounded conversational agents, particularly for Chinese language models, addressing a critical gap in the evaluation landscape and fostering improvements in model robustness and adaptability. This paper introduces VCB Bench, a comprehensive evaluation benchmark for audio-grounded large language models, significantly enhancing the assessment of Chinese voice conversational agents. The methodology is thorough, the experimental evaluation is rigorous, and the implications for future research are substantial, marking a meaningful contribution to the field.
Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.
Primary: Shanghai AI Laboratory
All Institutions: Shanghai AI Laboratory
The main contribution of this paper is the introduction of MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. This dataset addresses a critical gap in existing resources, enabling a variety of foundational tasks that can significantly impact the development of immersive audio technologies.
The methodology for creating the MRSAudio dataset is robust, involving a well-structured approach to data collection across various real-world scenarios. The inclusion of synchronized binaural and ambisonic audio, alongside exocentric and egocentric video, provides a comprehensive framework for spatial audio research. The refined annotations, including phoneme boundaries and transcripts, enhance the dataset's utility for multiple tasks. The establishment of five foundational tasks demonstrates a clear application of the dataset, although the paper could benefit from a more detailed explanation of the methodology used for data collection and annotation processes.
The experimental evaluation is well-defined, showcasing the dataset's capabilities through the proposed foundational tasks. The results indicate that MRSAudio supports high-quality spatial modeling, which is crucial for advancements in spatial audio technologies. However, the paper lacks quantitative metrics or comparisons with existing datasets to substantiate the claims of superiority or uniqueness. More detailed experimental results, including performance benchmarks, would strengthen the findings.
The paper provides a link to the dataset and demos, which is a positive aspect for reproducibility. However, it does not include sufficient implementation details or code repositories to allow for complete reproducibility of the experiments. Future work should consider providing code or detailed guidelines for researchers to replicate the experiments.
One limitation is the lack of comparative analysis with existing multimodal datasets, which would help contextualize the contributions of MRSAudio. Additionally, the paper does not address potential biases in the dataset or the limitations of the tasks defined. The scope of the dataset may also be limited to specific environments, which could affect generalizability.
The MRSAudio dataset has significant potential applications in immersive technologies such as VR/AR, gaming, and assistive technologies. By providing a rich resource for spatial audio research, it can facilitate advancements in sound source localization, audio rendering, and multimodal interaction. The implications for enhancing user experiences in virtual environments are substantial, making this dataset a valuable contribution to the field. The main contribution of this paper is the introduction of MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. This dataset addresses a critical gap in existing resources, enabling a variety of foundational tasks that can significantly impact the development of immersive audio technologies.
Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for TTS applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and multi-dimensional quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies and to serve as a template for other low-resource languages. The ParsVoice dataset is publicly available at ParsVoice (https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice).
Primary: University of Tehran
All Institutions: University of Tehran
The main contribution of this paper is the introduction of ParsVoice, the largest high-quality Persian speech corpus specifically designed for TTS applications, alongside a novel automated pipeline for its creation. This work addresses a critical gap in the availability of resources for Persian language processing and sets a precedent for future efforts in low-resource language datasets.
The methodology presented in this paper is robust, featuring a comprehensive automated pipeline for transforming raw audiobook data into a high-quality TTS corpus. The use of a BERT-based sentence completion detector and a binary search boundary optimization method demonstrates innovation in addressing specific challenges related to Persian language processing. The multi-dimensional quality assessment frameworks tailored for Persian further enhance the methodology's depth and applicability. However, while the pipeline is well-structured, the paper could benefit from more detailed descriptions of the algorithms used in the quality assessment phases.
The authors processed a substantial amount of data, resulting in 1,804 hours of high-quality speech from over 470 speakers. The dataset's size and speaker diversity are significant improvements over existing Persian datasets. The paper includes quantitative metrics regarding the quality of the audio and text, which are crucial for validating the dataset's usability for TTS applications. However, the paper lacks detailed experimental results demonstrating the effectiveness of the TTS models trained on this dataset, which would provide a clearer picture of its impact.
The authors have made the dataset publicly available, which is a positive step towards reproducibility. However, the paper does not provide sufficient details on the implementation of the automated pipeline or the specific configurations used for the models, which could hinder reproducibility for researchers looking to replicate or build upon this work.
One limitation is the reliance on a single source for audiobook content, which may introduce biases in the dataset. Additionally, while the quality assessment frameworks are comprehensive, the paper does not discuss potential limitations in the algorithms used, such as the impact of noise in the original recordings or the effectiveness of the sentence completion model across different dialects of Persian.
The introduction of ParsVoice is a significant advancement for Persian language technologies, potentially accelerating research and development in TTS systems for Persian and other low-resource languages. The dataset can serve as a template for creating similar resources for other underrepresented languages, helping to bridge the digital divide in language technology. The main contribution of this paper is the introduction of ParsVoice, the largest high-quality Persian speech corpus specifically designed for TTS applications, alongside a novel automated pipeline for its creation. This work addresses a critical gap in the availability of resources for Persian language processing and sets a precedent for future efforts in low-resource language datasets.
Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for text-to-speech(TTS) applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and audio-text quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. To validate the dataset, we fine-tuned XTTS for Persian, achieving a naturalness Mean Opinion Score (MOS) of 3.6/5 and a Speaker Similarity Mean Opinion Score (SMOS) of 4.0/5 demonstrating ParsVoice's effectiveness for training multi-speaker TTS systems. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.
Primary: University of Tehran
All Institutions: University of Tehran
The main contribution of this paper is the introduction of ParsVoice, the largest high-quality Persian speech corpus for TTS applications, along with a comprehensive automated pipeline for its creation. This work significantly enhances the resources available for Persian language technology, addressing a critical gap in the field and paving the way for future advancements in TTS systems for low-resource languages.
The paper presents a well-structured methodology for creating a large-scale Persian speech corpus, addressing the unique challenges of TTS data generation in low-resource languages. The automated pipeline includes innovative components such as a BERT-based sentence completion model, a binary search boundary optimization method, and a comprehensive quality assessment framework tailored to Persian. The segmentation and alignment techniques are particularly noteworthy, as they ensure high-quality audio-text pairs essential for TTS applications. However, the methodology could benefit from a clearer explanation of the integration of these components and their individual contributions to the overall quality of the dataset.
The authors conducted a thorough evaluation of the ParsVoice dataset by fine-tuning the XTTS model, achieving competitive MOS scores that demonstrate the dataset's effectiveness for TTS applications. The results indicate that the corpus is capable of producing natural-sounding speech and maintaining speaker similarity, which are critical metrics for TTS systems. However, the evaluation could be strengthened by including more detailed comparisons with existing datasets and models to contextualize the performance of ParsVoice.
The paper provides a clear description of the dataset creation process and the evaluation methodology, which aids in reproducibility. The availability of the dataset on Hugging Face is a significant step towards ensuring that other researchers can replicate the findings and build upon the work. However, more information on the specific configurations and hyperparameters used during model training would enhance reproducibility.
One limitation of the study is the reliance on a single data source (IranSeda), which may introduce biases in speaker representation and content diversity. Additionally, while the dataset is large, the quality assessment metrics may not capture all nuances of speech quality, and further subjective evaluations could provide a more comprehensive understanding of the dataset's strengths and weaknesses.
The introduction of ParsVoice has the potential to significantly advance Persian speech technologies, enabling researchers and developers to create more robust TTS systems. This work not only addresses the data scarcity issue for Persian but also sets a precedent for similar efforts in other low-resource languages. The public availability of the dataset encourages collaboration and innovation in the field of speech synthesis. The main contribution of this paper is the introduction of ParsVoice, the largest high-quality Persian speech corpus for TTS applications, along with a comprehensive automated pipeline for its creation. This work significantly enhances the resources available for Persian language technology, addressing a critical gap in the field and paving the way for future advancements in TTS systems for low-resource languages.
General-purpose ASR underperforms for atypical speakers, such as L2 learners, reinforcing bias and limiting use in education and accessibility. Using the CEFR-graded Speak and Improve corpus, we show that naive fine-tuning of Whisper reduces average WER but simultaneously widens disparities and disproportionately harms lower-level learners. To address this, we propose two strategies: (i) proficiency-aware multitask learning, jointly optimizing ASR with proficiency classification, and (ii) targeted augmentation, applying spectrogram masking to low-proficiency speech to counter imbalance. These approaches reduce WER by up to 29.4 percent (relative) and insertion/deletion errors by as much as 58.6 percent (relative). Crucially, despite the severe imbalance of the dataset reflecting real-world distributions, both strategies consistently narrow proficiency gaps, advancing equitable ASR for L2 learners.
Primary: Indiana University
All Institutions: Indiana University
The paper makes a substantial contribution by addressing the performance disparities in ASR for L2 learners through innovative methodologies. Its findings underscore the importance of proficiency awareness in developing equitable ASR systems, which could have far-reaching impacts on education and accessibility.
The paper presents a well-structured methodology that includes proficiency-aware multitask learning and targeted data augmentation. The use of the Speak & Improve corpus is particularly relevant, as it allows for a nuanced understanding of how ASR performance varies with proficiency levels. The authors effectively apply Low Rank Adaptation (LoRA) to fine-tune the Whisper model, which is a sound choice for maintaining model integrity while adapting to new data. The introduction of spectrogram masking for low-proficiency speech is innovative and addresses the class imbalance effectively. However, the methodology could benefit from more detailed descriptions of hyperparameter settings and training procedures to enhance reproducibility.
The experiments are comprehensive, utilizing a robust dataset that reflects real-world distributions of L2 learners. The results demonstrate clear improvements in WER across proficiency groups, particularly for lower-proficiency speakers, which is a significant contribution to the field. The statistical analysis, including paired sign tests, adds rigor to the evaluation of results. However, the paper could enhance its findings by including more diverse evaluation metrics beyond WER, such as user satisfaction or real-world applicability.
While the methodology is sound, the paper lacks detailed implementation specifics, such as code availability or hyperparameter settings, which are crucial for reproducibility. Providing a link to a code repository or supplementary materials would greatly enhance the ability of other researchers to replicate the study.
The paper acknowledges the limitations of class imbalance in the dataset, particularly for lower-proficiency groups. Additionally, the reliance on a single ASR model (Whisper) may limit the generalizability of the findings. Future work should explore the applicability of these methods across different ASR architectures and datasets.
The research has significant implications for educational technologies and accessibility tools, as improving ASR for L2 learners can enhance language learning experiences and provide equitable access to technology. The focus on proficiency-aware adaptation addresses a critical gap in ASR research, potentially leading to more inclusive applications in various domains. The paper makes a substantial contribution by addressing the performance disparities in ASR for L2 learners through innovative methodologies. Its findings underscore the importance of proficiency awareness in developing equitable ASR systems, which could have far-reaching impacts on education and accessibility.
Humans rely on multisensory integration to perceive spatial environments, where auditory cues enable sound source localization in three-dimensional space. Despite the critical role of spatial audio in immersive technologies such as VR/AR, most existing multimodal datasets provide only monaural audio, which limits the development of spatial audio generation and understanding. To address these challenges, we introduce MRSAudio, a large-scale multimodal spatial audio dataset designed to advance research in spatial audio understanding and generation. MRSAudio spans four distinct components: MRSLife, MRSSpeech, MRSMusic, and MRSSing, covering diverse real-world scenarios. The dataset includes synchronized binaural and ambisonic audio, exocentric and egocentric video, motion trajectories, and fine-grained annotations such as transcripts, phoneme boundaries, lyrics, scores, and prompts. To demonstrate the utility and versatility of MRSAudio, we establish five foundational tasks: audio spatialization, and spatial text to speech, spatial singing voice synthesis, spatial music generation and sound event localization and detection. Results show that MRSAudio enables high-quality spatial modeling and supports a broad range of spatial audio research. Demos and dataset access are available at https://mrsaudio.github.io.
Primary: [Institution not explicitly mentioned in the provided text]
All Institutions: [National Key R&D Program of China, National Natural Science Foundation of China]
The main contribution of this paper is the introduction of MRSAudio, a comprehensive multimodal dataset that significantly advances the field of spatial audio understanding and generation. This work addresses a critical gap in existing research by providing a large-scale resource that supports a variety of foundational tasks, thereby fostering further exploration and innovation in spatial audio technologies.
The methodology presented in MRSAudio is robust, as it details the creation of a large-scale multimodal dataset that integrates various audio and visual components. The inclusion of synchronized binaural and ambisonic audio, along with exocentric and egocentric video, is particularly noteworthy. The dataset's design allows for a comprehensive exploration of spatial audio, addressing a significant gap in existing datasets that typically focus on monaural audio. The establishment of foundational tasks further demonstrates the dataset's versatility and potential applications in spatial audio research.
The paper outlines five foundational tasks that utilize the MRSAudio dataset, showcasing its applicability in real-world scenarios. While the results indicate that MRSAudio enables high-quality spatial modeling, the paper could benefit from more detailed quantitative results and comparisons with existing datasets to substantiate its claims. The experiments should ideally include baseline comparisons to highlight the advantages of using MRSAudio over other datasets.
The paper mentions that demos and dataset access are available online, which is a positive aspect for reproducibility. However, it lacks detailed implementation specifics, such as the exact methodologies used for data collection and annotation processes. Providing code or scripts for dataset generation and task execution would enhance reproducibility.
One limitation is the potential bias in the dataset due to the specific real-world scenarios chosen for recording. Additionally, the paper does not address the challenges of generalizing the findings across different environments or the limitations of the audio capture technology used. Furthermore, while the dataset is extensive, it may not cover all possible spatial audio scenarios, which could limit its applicability.
The MRSAudio dataset has significant implications for the fields of virtual reality (VR) and augmented reality (AR), as it provides a rich resource for developing spatial audio technologies. The ability to localize sound sources in three-dimensional space can enhance user experiences in immersive environments. Moreover, the dataset can facilitate advancements in machine learning models for audio processing, potentially leading to innovations in sound design, gaming, and assistive technologies for the hearing impaired. The main contribution of this paper is the introduction of MRSAudio, a comprehensive multimodal dataset that significantly advances the field of spatial audio understanding and generation. This work addresses a critical gap in existing research by providing a large-scale resource that supports a variety of foundational tasks, thereby fostering further exploration and innovation in spatial audio technologies.
Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of MARS-Sep, a multimodal-aligned reinforced sound separation framework that significantly improves the semantic quality of separated audio through innovative reinforcement learning techniques. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in sound separation and offering a robust methodology for future research.
The paper introduces MARS-Sep, a novel reinforcement learning framework that reformulates sound separation as a decision-making problem, optimizing a factorized Beta mask policy. The use of multimodal rewards derived from an audio-text-vision encoder is innovative, as it directly incentivizes semantic consistency with user queries. The progressive alignment strategy for fine-tuning the encoder enhances cross-modal discriminability, which is a significant methodological advancement. The approach effectively addresses the limitations of traditional signal-level metrics by incorporating semantic fidelity into the training process.
The experiments are comprehensive, utilizing multiple benchmarks (VGGSound and MUSIC) to validate the effectiveness of MARS-Sep. The results demonstrate consistent improvements across various metrics, including SDR, SIR, SAR, and CLAP scores, indicating that the proposed method not only enhances signal quality but also improves semantic alignment. The rigorous evaluation against strong baselines provides a solid foundation for the claims made in the paper.
The paper provides detailed implementation information, including training parameters, dataset descriptions, and evaluation metrics. The availability of code and audio samples enhances reproducibility, although the lack of specific institutional affiliations may limit the accessibility of the research for some readers.
One limitation is the potential complexity of the proposed method, which may hinder its application in real-time scenarios or on devices with limited computational resources. Additionally, while the paper addresses the metric dilemma, the subjective nature of semantic quality could lead to variability in user experiences across different contexts.
The advancements in sound separation have significant implications for various applications, including speech recognition, sound event detection, and augmented reality. By improving the semantic fidelity of separated audio, MARS-Sep can enhance user experiences in multimedia applications and contribute to the development of more robust audio processing systems. The main contribution of this paper is the introduction of MARS-Sep, a multimodal-aligned reinforced sound separation framework that significantly improves the semantic quality of separated audio through innovative reinforcement learning techniques. This work represents a meaningful advancement in the field of audio processing, addressing critical challenges in sound separation and offering a robust methodology for future research.
The automated analysis of phonocardiograms is vital for the early diagnosis of cardiovascular disease, yet supervised deep learning is often constrained by the scarcity of expert-annotated data. In this paper, we propose the Self-Supervised Dual-Path Prototypical Network (SS-DPPN), a foundation model for cardiac audio representation and classification from unlabeled data. The framework introduces a dual-path contrastive learning based architecture that simultaneously processes 1D waveforms and 2D spectrograms using a novel hybrid loss. For the downstream task, a metric-learning approach using a Prototypical Network was used that enhances sensitivity and produces well-calibrated and trustworthy predictions. SS-DPPN achieves state-of-the-art performance on four cardiac audio benchmarks. The framework demonstrates exceptional data efficiency with a fully supervised model on three-fold reduction in labeled data. Finally, the learned representations generalize successfully across lung sound classification and heart rate estimation. Our experiments and findings validate SS-DPPN as a robust, reliable, and scalable foundation model for physiological signals.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of SS-DPPN, a self-supervised dual-path foundation model that effectively learns cardiac audio representations from unlabeled data, demonstrating state-of-the-art performance and significant data efficiency. The innovative methodology and promising experimental results position this work as a valuable addition to the field of machine learning in medical audio analysis.
The proposed SS-DPPN framework introduces a self-supervised dual-path architecture that effectively processes both 1D waveforms and 2D spectrograms. The combination of contrastive learning and a hybrid loss function is innovative, particularly in the context of medical audio analysis where labeled data is scarce. The use of a Prototypical Network for metric learning enhances the model's ability to generalize across different tasks, which is a significant methodological advancement. However, the paper could benefit from a more detailed explanation of the hybrid loss function and how it specifically contributes to the model's performance.
The experiments are robust, demonstrating state-of-the-art performance across four cardiac audio benchmarks. The paper provides detailed metrics, including validation accuracy and confusion matrices, which offer insights into the model's performance on imbalanced datasets. The inclusion of AUROC and AUPRC curves is commendable, as these metrics are crucial for evaluating performance in medical applications. However, the paper lacks a comparative analysis with other existing models, which would strengthen the claims of superiority.
The paper does not provide sufficient details regarding the implementation of the SS-DPPN model, such as hyperparameters, training procedures, or data preprocessing steps. This lack of transparency may hinder reproducibility. Including a supplementary material section with code or detailed methodologies would significantly enhance the reproducibility of the results.
One identified limitation is the potential overfitting to the specific datasets used for training and validation, particularly given the high performance metrics reported. Additionally, the model's performance on noisy or real-world data is not thoroughly evaluated, which is critical for practical applications in clinical settings. The reliance on synthetic data for some experiments may not fully represent real-world scenarios.
The SS-DPPN model has significant implications for the field of cardiovascular diagnostics, particularly in improving access to automated analysis of phonocardiograms. By reducing the need for labeled data, this approach could enable broader deployment of machine learning solutions in healthcare, potentially leading to earlier diagnosis and better patient outcomes. The generalizability of the model to other physiological signals also opens avenues for further research and application in related fields. The main contribution of this paper is the introduction of SS-DPPN, a self-supervised dual-path foundation model that effectively learns cardiac audio representations from unlabeled data, demonstrating state-of-the-art performance and significant data efficiency. The innovative methodology and promising experimental results position this work as a valuable addition to the field of machine learning in medical audio analysis.