Accurate modeling of spatial acoustics is critical for immersive and intelligible audio in confined, resonant environments such as car cabins. Current tuning methods are manual, hardware-intensive, and static, failing to account for frequency selective behaviors and dynamic changes like passenger presence or seat adjustments. To address this issue, we propose INFER: Implicit Neural Frequency Response fields, a frequency-domain neural framework that is jointly conditioned on source and receiver positions, orientations to directly learn complex-valued frequency response fields inside confined, resonant environments like car cabins. We introduce three key innovations over current neural acoustic modeling methods: (1) novel end-to-end frequency-domain forward model that directly learns the frequency response field and frequency-specific attenuation in 3D space; (2) perceptual and hardware-aware spectral supervision that emphasizes critical auditory frequency bands and deemphasizes unstable crossover regions; and (3) a physics-based Kramers-Kronig consistency constraint that regularizes frequency-dependent attenuation and delay. We evaluate our method over real-world data collected in multiple car cabins. Our approach significantly outperforms time- and hybrid-domain baselines on both simulated and real-world automotive datasets, cutting average magnitude and phase reconstruction errors by over 39% and 51%, respectively. INFER sets a new state-of-the-art for neural acoustic modeling in automotive spaces
Primary: University of Maryland
All Institutions: University of Maryland, Dolby Laboratories
The main contribution of this paper is the introduction of INFER, a novel frequency-domain neural framework for modeling complex acoustic environments in confined spaces, which significantly advances the state-of-the-art in neural acoustic modeling. The comprehensive analysis of the technical contributions, innovative methodology, and substantial experimental validation underscores its significance to the field of machine learning and audio processing.
The proposed INFER framework introduces a novel end-to-end frequency-domain neural model that learns complex-valued frequency response fields, addressing the limitations of existing acoustic modeling methods. The methodology is well-grounded in physical principles, utilizing Kramers-Kronig relations to ensure causality and consistency between amplitude and phase responses. The incorporation of perceptual and hardware-aware spectral supervision is a significant advancement, allowing the model to prioritize critical auditory frequency bands while downweighting less stable regions. The approach's reliance on implicit neural representations (INRs) to model acoustic fields in confined spaces is innovative, particularly in its ability to capture frequency-dependent behaviors and dynamic changes.
The experiments are robust, involving both simulated and real-world datasets collected from various car cabins. The evaluation metrics are comprehensive, covering both magnitude and phase reconstruction errors, and the results demonstrate significant improvements over state-of-the-art methods. The paper provides clear quantitative results, showing reductions in reconstruction errors by over 39% and 51%, respectively. Qualitative assessments further validate the model's performance, showcasing its ability to accurately reproduce complex acoustic phenomena in confined spaces.
The paper includes detailed implementation information, including model architecture, training procedures, and data collection methods. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Clear guidelines for replicating the experiments would enhance the paper's impact.
While the proposed method shows promise, it may face challenges in generalizing to highly variable acoustic environments beyond car cabins. The reliance on specific hardware configurations for data collection might also limit the applicability of the findings. Additionally, the complexity of the model may pose challenges in real-time applications, which are critical for automotive audio systems.
The INFER framework has significant implications for the automotive audio industry, potentially enhancing the quality of in-vehicle audio experiences. Its applications extend to adaptive noise cancellation, spatial audio rendering, and personalized audio experiences, which are increasingly relevant in modern vehicles. The methodology could also inspire further research in acoustic modeling for other confined environments, such as theaters or small auditoriums. The main contribution of this paper is the introduction of INFER, a novel frequency-domain neural framework for modeling complex acoustic environments in confined spaces, which significantly advances the state-of-the-art in neural acoustic modeling. The comprehensive analysis of the technical contributions, innovative methodology, and substantial experimental validation underscores its significance to the field of machine learning and audio processing.
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
The paper introduces AQEval, a novel benchmark for Audio Question Answering (AQA) metrics, which is a significant advancement in evaluating open-ended responses in audio contexts. The methodology employs a combination of human annotations and a new metric, AURA, which integrates reasoning capabilities of large language models (LLMs) with an audio entailment component. This dual approach allows for a more nuanced evaluation of responses, addressing the limitations of existing metrics that primarily focus on surface-level similarity.
The experimental setup is robust, utilizing a dataset of 10k annotated responses that allows for systematic benchmarking of AQA metrics. The authors provide a comprehensive analysis of existing metrics, demonstrating their weak correlation with human judgments, particularly for longer answers. AURA is shown to outperform traditional metrics significantly, achieving state-of-the-art correlation with human ratings. The ablation studies further validate the effectiveness of the proposed methodology.
The paper includes detailed descriptions of the dataset construction, annotation process, and experimental setup, which enhances reproducibility. However, the reliance on specific LLMs for scoring may limit the generalizability of the results to other models or contexts.
While the paper addresses significant gaps in AQA evaluation, it does not explore the potential biases in human annotations or the limitations of the LLMs used. Additionally, the performance of AURA in real-world applications remains to be fully validated.
The introduction of AQEval and AURA has the potential to significantly influence future research in audio-language models and their evaluation. By providing a more accurate assessment of model responses, this work can lead to improvements in the development of ALMs and their applications in various domains, including accessibility, education, and content creation. The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University, Tsinghua University
The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
The proposed methodology introduces a novel framework, Siren, which utilizes multiple isolated transformers with causal conditioning and anti-causal alignment. This approach effectively addresses the limitations of existing RVQ tokenizers in T2A generation by mitigating gradient conflicts and enhancing audio reconstruction fidelity. The use of reinforcement learning for alignment is innovative, although the complexity of the architecture may pose challenges for implementation and scalability.
The experiments are extensive and demonstrate that Siren outperforms both existing LM-based and diffusion-based systems, achieving state-of-the-art results. However, the paper mentions the use of a curated dataset smaller than those in prior work, which raises questions about the generalizability of the results. The evaluation metrics, particularly in terms of fidelity, are well-defined, but further comparisons with a broader range of benchmarks would strengthen the findings.
The paper provides a GitHub repository link for the implementation, which is crucial for reproducibility. However, details on the training process, hyperparameters, and specific datasets used are somewhat limited, which could hinder replication efforts by other researchers.
The authors acknowledge several limitations, including training efficiency due to the sequential training of transformer modules, the trade-off between model size and semantic richness, and the need for larger, more diverse datasets. Addressing these limitations in future work will be essential for advancing the field.
The work has significant implications for multi-modal generation frameworks, potentially enabling more cohesive integration of audio and text. By repositioning LMs as competitive in T2A tasks, it opens pathways for applications in content creation, gaming, and accessibility technologies. The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
Artificial Intelligence (AI) for music generation is undergoing rapid developments, with recent symbolic models leveraging sophisticated deep learning and diffusion model algorithms. One drawback with existing models is that they lack structural cohesion, particularly on harmonic-melodic structure. Furthermore, such existing models are largely "black-box" in nature and are not musically interpretable. This paper addresses these limitations via a novel generative music framework that incorporates concepts of Schenkerian analysis (SchA) in concert with a diffusion modeling framework. This framework, which we call ProGress (Prolongation-enhanced DiGress), adapts state-of-the-art deep models for discrete diffusion (in particular, the DiGress model of Vignac et al., 2023) for interpretable and structured music generation. Concretely, our contributions include 1) novel adaptations of the DiGress model for music generation, 2) a novel SchA-inspired phrase fusion methodology, and 3) a framework allowing users to control various aspects of the generation process to create coherent musical compositions. Results from human experiments suggest superior performance to existing state-of-the-art methods.
Primary: Duke University
All Institutions: Duke University
The main contribution of this paper is the introduction of ProGress, a novel framework for structured music generation that combines graph diffusion modeling with Schenkerian analysis, significantly enhancing the interpretability and coherence of AI-generated music. This work represents a meaningful advancement in the intersection of machine learning and music theory, addressing critical limitations of existing generative models.
The proposed methodology, ProGress, integrates Schenkerian analysis with a discrete graph diffusion model to enhance music generation. The use of a structured approach to music composition, grounded in music theory, is a significant advancement over traditional black-box models. The methodology is well-detailed, providing a clear workflow from phrase generation through to fusion based on harmonic principles. However, the reliance on specific theoretical frameworks may limit its applicability across diverse musical genres.
The experiments are robust, including human evaluations and ablation studies that validate the effectiveness of the proposed model. The results indicate that ProGress outperforms existing models in terms of musical coherence and user enjoyment, which is a strong indicator of its practical utility. The sample size of 45 participants, while reasonable, may benefit from a larger cohort for more generalizable results.
The paper provides sufficient implementation details, including model architecture and training parameters, which enhances reproducibility. The availability of code on GitHub is a positive aspect, allowing other researchers to replicate and build upon the work. However, the paper could benefit from clearer documentation on the dataset used and the specific preprocessing steps taken.
One limitation is the potential overfitting to the specific structures derived from Schenkerian analysis, which might not generalize well to all musical styles. Additionally, the subjective nature of musical enjoyment can introduce variability in human evaluations, which may not fully capture the model's capabilities across different listener demographics.
The integration of music theory with AI has the potential to transform music composition, making it more accessible to non-experts while also providing tools for professional composers. The implications extend to various applications in music technology, including automated composition tools for film, gaming, and educational purposes. The main contribution of this paper is the introduction of ProGress, a novel framework for structured music generation that combines graph diffusion modeling with Schenkerian analysis, significantly enhancing the interpretability and coherence of AI-generated music. This work represents a meaningful advancement in the intersection of machine learning and music theory, addressing critical limitations of existing generative models.
In real-world scenarios, speech signals are inevitably corrupted by various types of interference, making speech enhancement (SE) a critical task for robust speech processing. However, most existing SE methods only handle a limited range of distortions, such as additive noise, reverberation, or band limitation, while the study of SE under multiple simultaneous distortions remains limited. This gap affects the generalization and practical usability of SE methods in real-world environments.To address this gap, this paper proposes a novel Universal Discrete-domain SE model called UDSE.Unlike regression-based SE models that directly predict clean speech waveform or continuous features, UDSE redefines SE as a discrete-domain classification task, instead predicting the clean discrete tokens quantized by the residual vector quantizer (RVQ) of a pre-trained neural speech codec.Specifically, UDSE first extracts global features from the degraded speech. Guided by these global features, the clean token prediction for each VQ follows the rules of RVQ, where the prediction of each VQ relies on the results of the preceding ones. Finally, the predicted clean tokens from all VQs are decoded to reconstruct the clean speech waveform. During training, the UDSE model employs a teacher-forcing strategy, and is optimized with cross-entropy loss. Experimental results confirm that the proposed UDSE model can effectively enhance speech degraded by various conventional and unconventional distortions, e.g., additive noise, reverberation, band limitation, clipping, phase distortion, and compression distortion, as well as their combinations. These results demonstrate the superior universality and practicality of UDSE compared to advanced regression-based SE methods.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, National Engineering Research Center of Speech and Language Information Processing, National Institute of Informatics, National University of Defense Technology, Baidu Speech Department
The paper presents a novel approach to speech enhancement that effectively addresses the limitations of existing models by leveraging discrete-domain classification techniques. This advancement holds substantial promise for enhancing the robustness and adaptability of speech processing technologies in real-world applications.
The proposed UDSE model innovatively redefines speech enhancement as a discrete-domain classification task, utilizing a pre-trained neural speech codec and a residual vector quantizer (RVQ). This approach contrasts with traditional regression-based models, allowing for enhanced adaptability to various distortions. The methodology is well-structured, with clear delineation of the global feature extraction, token prediction, and speech decoding processes. The use of teacher-forcing during training is a notable strength, as it mitigates error propagation and enhances learning efficiency.
The experiments are comprehensive, evaluating UDSE against multiple baseline models across various conventional and unconventional distortion types. The authors provide both objective and subjective metrics, demonstrating UDSE's superior performance in restoring speech quality. The dataset construction is robust, incorporating diverse speech distortions, which strengthens the validity of the results. However, the paper could benefit from a more detailed discussion of the statistical significance of the results across all tasks.
The implementation details are sufficiently described, including the architecture, training strategy, and dataset construction. The authors provide a demo URL and mention the use of a specific neural speech codec, which aids in reproducibility. However, the paper would benefit from sharing code or detailed instructions for replicating the experiments.
While the UDSE model shows promise, it may still struggle with certain types of distortions that were not extensively covered in the experiments. Additionally, the reliance on a pre-trained neural speech codec may limit the model's applicability in scenarios where such codecs are not available or feasible. The subjective evaluation could also be influenced by listener biases, which may not be fully accounted for.
The UDSE model has significant implications for real-world applications in speech processing, including telecommunications, hearing aids, and automatic speech recognition systems. Its ability to handle a wide range of distortions enhances its practicality in diverse environments, potentially leading to improved communication clarity in challenging acoustic settings. The paper presents a novel approach to speech enhancement that effectively addresses the limitations of existing models by leveraging discrete-domain classification techniques. This advancement holds substantial promise for enhancing the robustness and adaptability of speech processing technologies in real-world applications.
Automatic Speech Recognition (ASR) has undergone a profound transformation over the past decade, driven by advances in deep learning. This survey provides a comprehensive overview of the modern era of ASR, charting its evolution from traditional hybrid systems, such as Gaussian Mixture Model-Hidden Markov Models (GMM-HMMs) and Deep Neural Network-HMMs (DNN-HMMs), to the now-dominant end-to-end neural architectures. We systematically review the foundational end-to-end paradigms: Connectionist Temporal Classification (CTC), attention-based encoder-decoder models, and the Recurrent Neural Network Transducer (RNN-T), which established the groundwork for fully integrated speech-to-text systems. We then detail the subsequent architectural shift towards Transformer and Conformer models, which leverage self-attention to capture long-range dependencies with high computational efficiency. A central theme of this survey is the parallel revolution in training paradigms. We examine the progression from fully supervised learning, augmented by techniques like SpecAugment, to the rise of self-supervised learning (SSL) with foundation models such as wav2vec 2.0, which drastically reduce the reliance on transcribed data. Furthermore, we analyze the impact of largescale, weakly supervised models like Whisper, which achieve unprecedented robustness through massive data diversity. The paper also covers essential ecosystem components, including key datasets and benchmarks (e.g., LibriSpeech, Switchboard, CHiME), standard evaluation metrics (e.g., Word Error Rate), and critical considerations for real-world deployment, such as streaming inference, on-device efficiency, and the ethical imperatives of fairness and robustness. We conclude by outlining open challenges and future research directions.
Primary: Delhi Technological University (DTU)
All Institutions: Delhi Technological University (DTU), National University of Bangladesh
This survey paper provides a structured overview of the advancements in Automatic Speech Recognition, detailing the evolution of architectures and training paradigms while addressing critical ethical considerations. Its comprehensive analysis of the current state of ASR makes it a significant contribution to the field, offering valuable insights for researchers and practitioners alike.
The paper provides a comprehensive survey of the evolution of Automatic Speech Recognition (ASR) systems, detailing the transition from traditional hybrid models to modern end-to-end architectures. It systematically reviews key methodologies, including Connectionist Temporal Classification (CTC), attention-based encoder-decoder models, and the Recurrent Neural Network Transducer (RNN-T), culminating in the discussion of Transformer and Conformer models. The authors effectively highlight the significance of self-supervised learning and large-scale weak supervision in reducing reliance on transcribed data, showcasing a clear understanding of the current landscape and its challenges.
While the paper does not present original experimental results, it synthesizes existing literature and benchmarks, providing a valuable overview of performance metrics such as Word Error Rate (WER) and latency. The discussion of datasets like LibriSpeech and Whisper's performance in zero-shot settings adds depth to the evaluation of current ASR systems. However, the lack of original experiments limits the ability to assess the proposed methodologies' effectiveness directly.
The paper mentions several open-source toolkits that facilitate reproducibility in ASR research, such as Kaldi and ESPnet. However, it does not provide specific implementation details or code repositories for the methodologies discussed, which could enhance reproducibility for readers interested in practical applications.
One limitation of the paper is its reliance on existing literature without presenting new experimental findings or novel methodologies. Additionally, while it addresses ethical considerations, the discussion could be expanded to include more specific strategies for mitigating bias and ensuring fairness in ASR systems.
The survey highlights the potential of ASR technology to improve human-computer interaction across various applications, emphasizing the importance of robustness and fairness. The authors call attention to the ethical implications of ASR systems, particularly regarding demographic bias and data privacy, which are critical as ASR becomes more integrated into everyday life. This survey paper provides a structured overview of the advancements in Automatic Speech Recognition, detailing the evolution of architectures and training paradigms while addressing critical ethical considerations. Its comprehensive analysis of the current state of ASR makes it a significant contribution to the field, offering valuable insights for researchers and practitioners alike.
Text-to-audio (TTA) generation with fine-grained control signals, e.g., precise timing control or intelligible speech content, has been explored in recent works. However, constrained by data scarcity, their generation performance at scale is still compromised. In this study, we recast controllable TTA generation as a multi-task learning problem and introduce a progressive diffusion modeling approach, ControlAudio. Our method adeptly fits distributions conditioned on more fine-grained information, including text, timing, and phoneme features, through a step-by-step strategy. First, we propose a data construction method spanning both annotation and simulation, augmenting condition information in the sequence of text, timing, and phoneme. Second, at the model training stage, we pretrain a diffusion transformer (DiT) on large-scale text-audio pairs, achieving scalable TTA generation, and then incrementally integrate the timing and phoneme features with unified semantic representations, expanding controllability. Finally, at the inference stage, we propose progressively guided generation, which sequentially emphasizes more fine-grained information, aligning inherently with the coarse-to-fine sampling nature of DiT. Extensive experiments show that ControlAudio achieves state-of-the-art performance in terms of temporal accuracy and speech clarity, significantly outperforming existing methods on both objective and subjective evaluations. Demo samples are available at: https://control-audio.github.io/Control-Audio.
Primary: University of Science and Technology of China
All Institutions: University of Science and Technology of China, Tsinghua University, Monash University
The main contribution of this paper is the introduction of ControlAudio, a progressive diffusion modeling approach that significantly enhances text-to-audio generation by integrating fine-grained control signals for timing and intelligibility, thereby setting a new standard for performance in this domain. The comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the potential of this work to advance the state-of-the-art in audio generation.
The methodology presented in ControlAudio is innovative in its approach to tackle the challenges of text-to-audio generation with fine-grained control over timing and intelligibility. The authors effectively recast the problem as a multi-task learning scenario and utilize a progressive diffusion model that integrates various control signals in a structured manner. The data construction method is particularly noteworthy, as it combines both annotation and simulation to create a rich dataset for training, which addresses the data scarcity issue prevalent in previous works. The structured prompt design for encoding text, timing, and phoneme features is a significant advancement that enhances the model's ability to generate coherent and contextually relevant audio outputs.
The experiments conducted are extensive and robust, demonstrating the effectiveness of ControlAudio across various benchmarks. The paper reports both objective and subjective evaluations, showing significant improvements in temporal accuracy and speech clarity compared to existing methods. The use of multiple datasets for evaluation strengthens the findings, and the ablation studies provide insights into the contributions of different components of the model. However, the lack of a clearly defined baseline for comparison in some cases may limit the interpretability of the results.
The paper provides a detailed description of the model architecture, training procedures, and datasets used, which is essential for reproducibility. However, the absence of a publicly available code repository limits the ability of other researchers to replicate the results fully. The authors mention the use of various datasets but do not provide explicit access to all datasets used, which could hinder reproducibility efforts.
The paper acknowledges several limitations, including the lack of explicit mechanisms to manipulate stylistic attributes such as emotion and prosody. Additionally, the model's performance is constrained by the availability of high-quality, richly annotated datasets, which are still scarce. The potential trade-off between generating high-quality general audio versus intelligible speech is another concern that may affect the model's versatility in complex scenarios.
The advancements made in controllable TTA generation have significant implications for various applications, including film production, gaming, and virtual reality, where high-quality audio generation is crucial. However, the potential for misuse in creating deceptive content or voice impersonations raises ethical concerns that need to be addressed through robust detection methods and responsible AI governance. The work highlights the importance of developing technologies that balance innovation with ethical considerations. The main contribution of this paper is the introduction of ControlAudio, a progressive diffusion modeling approach that significantly enhances text-to-audio generation by integrating fine-grained control signals for timing and intelligibility, thereby setting a new standard for performance in this domain. The comprehensive analysis of the technical contributions, methodology, and significance to the field underscores the potential of this work to advance the state-of-the-art in audio generation.
Blind speech separation (BSS) aims to recover multiple speech sources from multi-channel, multi-speaker mixtures under unknown array geometry and room impulse responses. In unsupervised setup where clean target speech is not available for model training, UNSSOR proposes a mixture consistency (MC) loss for training deep neural networks (DNN) on over-determined training mixtures to realize unsupervised speech separation. However, when the number of microphones of the training mixtures decreases, the MC constraint weakens and the separation performance falls dramatically. To address this, we propose VM-UNSSOR, augmenting the observed training mixture signals recorded by a limited number of microphones with several higher-SNR virtual-microphone (VM) signals, which are obtained by applying linear spatial demixers (such as IVA and spatial clustering) to the observed training mixtures. As linear projections of the observed mixtures, the virtual-microphone signals can typically increase the SNR of each source and can be leveraged to compute extra MC losses to improve UNSSOR and address the frequency permutation problem in UNSSOR. On the SMS-WSJ dataset, in the over-determined six-microphone, two-speaker separation setup, VM-UNSSOR reaches 17.1 dB SI-SDR, while UNSSOR only obtains 14.7 dB; and in the determined two-microphone, two-speaker case, UNSSOR collapses to -2.7 dB SI-SDR, while VM-UNSSOR achieves 10.7 dB.
Primary: Southern University of Science and Technology
All Institutions: Southern University of Science and Technology
The main contribution of this paper is the introduction of VM-UNSSOR, an innovative unsupervised speech separation algorithm that utilizes higher-SNR virtual microphones to enhance separation performance. This work significantly advances the field of audio signal processing by addressing key challenges in unsupervised learning and demonstrating effective solutions through rigorous experimentation.
The proposed VM-UNSSOR method introduces a novel approach to unsupervised speech separation by leveraging virtual microphones derived from linear spatial demixers. This method enhances the mixture consistency loss (MC loss) by augmenting the training data with higher-SNR virtual signals, which is a significant improvement over the original UNSSOR framework. The methodology is well-structured, clearly explaining the process of creating virtual microphones and how they contribute to the training of the deep neural networks. The re-weighting of the MC loss to balance contributions from physical and virtual microphones is a thoughtful addition that addresses potential biases in the training process.
The experiments conducted on the SMS-WSJ dataset provide strong empirical support for the proposed method. The results demonstrate significant improvements in SI-SDR scores, particularly in challenging scenarios with fewer microphones. The paper effectively compares VM-UNSSOR against various baselines, including the original UNSSOR, and highlights the advantages of using virtual microphones. The use of both over-determined and determined setups showcases the versatility of the proposed approach.
The paper provides sufficient details regarding the experimental setup, including the datasets used, training configurations, and evaluation metrics. However, the absence of a publicly available implementation or code repository limits reproducibility. Including a link to a project page or code repository would enhance the ability of other researchers to replicate the findings.
One limitation of the study is the reliance on linear spatial demixers, which may not always perform optimally in all acoustic environments. The paper also does not address the potential computational overhead introduced by the additional virtual microphones, which could be a concern in real-time applications. Furthermore, the performance gains are primarily demonstrated on a specific dataset, and further validation on diverse datasets would strengthen the claims.
The VM-UNSSOR method has significant implications for real-world applications such as smart speakers, hearing aids, and other audio processing systems where robust speech separation is crucial. By enabling effective unsupervised learning without the need for labeled data or additional hardware, this approach can facilitate advancements in various speech processing technologies, making them more accessible and adaptable to diverse environments. The main contribution of this paper is the introduction of VM-UNSSOR, an innovative unsupervised speech separation algorithm that utilizes higher-SNR virtual microphones to enhance separation performance. This work significantly advances the field of audio signal processing by addressing key challenges in unsupervised learning and demonstrating effective solutions through rigorous experimentation.
Voice Conversion (VC) aims to modify a speaker's timbre while preserving linguistic content. While recent VC models achieve strong performance, most struggle in real-time streaming scenarios due to high latency, dependence on ASR modules, or complex speaker disentanglement, which often results in timbre leakage or degraded naturalness. We present SynthVC, a streaming end-to-end VC framework that directly learns speaker timbre transformation from synthetic parallel data generated by a pre-trained zero-shot VC model. This design eliminates the need for explicit content-speaker separation or recognition modules. Built upon a neural audio codec architecture, SynthVC supports low-latency streaming inference with high output fidelity. Experimental results show that SynthVC outperforms baseline streaming VC systems in both naturalness and speaker similarity, achieving an end-to-end latency of just 77.1 ms.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University
SynthVC represents a notable advancement in the field of voice conversion by effectively addressing the challenges of real-time processing and speaker timbre transformation through innovative use of synthetic data and neural audio codecs. The comprehensive methodology and robust experimental validation highlight its potential impact on both academic research and practical applications in audio processing.
The methodology presented in SynthVC is innovative in its approach to voice conversion by leveraging synthetic data generated from a pre-trained zero-shot VC model. This circumvents the need for traditional ASR models and disentanglement strategies, which are often prone to latency issues and timbre leakage. The architecture is built on a neural audio codec, allowing for low-latency streaming while maintaining high fidelity. The introduction of a dedicated speaker transformation module in the latent space is a significant improvement over previous methods, enhancing the model's ability to capture speaker-specific characteristics without compromising audio quality.
The experimental evaluation is thorough, utilizing both subjective and objective metrics to assess the performance of SynthVC against established baselines. The results demonstrate that SynthVC outperforms other models in terms of naturalness and speaker similarity, achieving a competitive end-to-end latency of 77.1 ms. The use of a diverse dataset and the implementation of a two-stage training strategy further bolster the reliability of the findings. However, the paper could benefit from a more extensive discussion on the statistical significance of the results.
The paper provides sufficient details about the training configurations, datasets, and evaluation metrics, which are crucial for reproducibility. The use of open-source models like Seed-VC as a data generator is a positive aspect, as it allows other researchers to replicate the synthetic data generation process. However, the specific hyperparameters and training settings could be more explicitly detailed to facilitate exact replication.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world voice conversion scenarios. Additionally, while the model shows promise in terms of latency and quality, the subjective evaluation scores, particularly for the smaller models, suggest that there may still be trade-offs in performance that need to be addressed. The paper does not explore the potential impact of different languages or dialects on the model's performance, which could be an important consideration for broader applications.
The implications of SynthVC are significant, particularly in real-time applications such as live broadcasting, video conferencing, and interactive voice response systems. The ability to convert voices with low latency while maintaining high fidelity opens up new possibilities in entertainment, accessibility, and privacy. Moreover, the approach could inspire further research into the use of synthetic data in other areas of machine learning, potentially leading to advancements in various domains. SynthVC represents a notable advancement in the field of voice conversion by effectively addressing the challenges of real-time processing and speaker timbre transformation through innovative use of synthetic data and neural audio codecs. The comprehensive methodology and robust experimental validation highlight its potential impact on both academic research and practical applications in audio processing.
Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
Primary: China Mobile Communications Corporation
All Institutions: China Mobile Communications Corporation
The paper presents DiTSinger, a novel approach to Singing Voice Synthesis that effectively scales model and data while improving alignment robustness, marking a significant contribution to the field of audio synthesis. The innovative methodology and strong experimental validation position this work as a valuable resource for future research and applications in music technology.
The paper introduces a two-stage data construction pipeline that effectively addresses the challenges of data scarcity and model scalability in Singing Voice Synthesis (SVS). By leveraging a compact seed set of human-sung recordings paired with LLM-generated lyrics, the authors create a large-scale dataset that enhances phonetic coverage and melodic alignment. The proposed Diffusion Transformer (DiTSinger) incorporates novel architectural elements like rotary positional encoding (RoPE) and qk-norm, which are systematically scaled for improved fidelity. Additionally, the implicit alignment mechanism is a significant innovation, allowing the model to operate without phoneme-level duration labels, thus enhancing robustness against timing variability. This methodology is well-structured and demonstrates a clear understanding of the challenges in the field.
The experiments are extensive and well-designed, utilizing a dataset of over 500 hours of singing data from professional vocalists. The evaluation metrics, including MCD, FFE, and F0RMSE, are appropriate for assessing the quality of the synthesized singing. The comparisons with state-of-the-art methods, such as DiffSinger and StyleSinger, show that DiTSinger achieves superior performance, particularly in subjective measures like MOS. However, the paper could benefit from more detailed ablation studies to further validate the contributions of individual components.
The paper provides sufficient implementation details, including training configurations, dataset sizes, and evaluation protocols, which are crucial for reproducibility. However, the lack of a publicly accessible code repository or demo URL limits the practical reproducibility of the results. Future work should consider releasing the model and code to facilitate community engagement and validation.
The primary limitation of this work is its focus on Chinese singing data, which may restrict the generalizability of the findings to other languages or singing styles. Additionally, the model does not account for various singing techniques, which could impact the quality of synthesized voices in diverse musical contexts. The authors acknowledge these limitations and suggest future work to expand the dataset and incorporate additional conditions.
The advancements in SVS presented in this paper have significant implications for the music industry, particularly in areas such as music production, entertainment, and education. The ability to generate high-fidelity singing voices from text opens new avenues for creative expression and accessibility in music creation. Furthermore, the methodologies developed could be adapted for other audio synthesis tasks, broadening the impact of this research beyond singing voice synthesis. The paper presents DiTSinger, a novel approach to Singing Voice Synthesis that effectively scales model and data while improving alignment robustness, marking a significant contribution to the field of audio synthesis. The innovative methodology and strong experimental validation position this work as a valuable resource for future research and applications in music technology.
We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., "a distant thunder") that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.
Primary: University of Catania
All Institutions: University of Catania
The main contribution of this paper is the introduction of a novel framework for audio-to-image generation that effectively aligns audio, language, and vision without requiring paired data. This work represents a significant advancement in the field of multimodal machine learning, showcasing a unique approach that has the potential to influence future research and applications in audio-visual generation.
The methodology presented in SeeingSounds is innovative, leveraging a dual alignment approach that integrates audio, language, and vision without the need for paired audio-visual data. The use of frozen language encoders and vision-language models to project audio into a semantic space is particularly noteworthy, as it reflects a sophisticated understanding of cross-modal associations. The lightweight adapters for fine-tuning the model enhance its scalability and efficiency, making it suitable for practical applications.
The experiments conducted across standard benchmarks are extensive and demonstrate the effectiveness of the proposed method. The authors provide a thorough comparison with existing methods, showing that SeeingSounds achieves state-of-the-art performance in both zero-shot and supervised settings. The results are well-documented, and the metrics used for evaluation are appropriate for the task.
While the paper outlines the methodology and presents results, it lacks sufficient details regarding the implementation, such as specific hyperparameters, training protocols, and dataset descriptions. This could hinder reproducibility, as other researchers may struggle to replicate the results without access to the code or detailed guidelines.
One limitation of the study is the reliance on frozen models, which may restrict the adaptability of the framework to new audio-visual tasks that require more dynamic learning. Additionally, the paper does not address potential biases in the datasets used for evaluation, which could impact the generalizability of the findings.
The potential applications of SeeingSounds are significant, particularly in fields such as multimedia content creation, accessibility for the hearing impaired, and interactive media. The ability to generate visual content from audio inputs could revolutionize how we interact with multimedia, making it more inclusive and engaging. The main contribution of this paper is the introduction of a novel framework for audio-to-image generation that effectively aligns audio, language, and vision without requiring paired data. This work represents a significant advancement in the field of multimodal machine learning, showcasing a unique approach that has the potential to influence future research and applications in audio-visual generation.
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
Primary: University of California
All Institutions: University of California, Liaoning University, The University of Queensland, vivo Mobile Communication Co
This paper meaningfully contributes to machine learning research by defining and addressing Insertion Hallucination in Video-to-Audio generation, proposing a novel evaluation framework and mitigation strategy that enhances the fidelity of generated audio in multimedia contexts.
The paper introduces a novel concept of Insertion Hallucination (IH) in Video-to-Audio generation, which is a significant contribution to the field. The methodology includes a systematic evaluation framework that utilizes a majority-voting ensemble of audio event detectors, along with the introduction of two new metrics (IH@vid and IH@dur) to quantify hallucinations. The proposed Posterior Feature Correction (PFC) method is innovative and effectively addresses the identified problem without requiring model retraining, showcasing a practical approach to mitigating hallucinations during inference.
The experiments are comprehensive, validating the IH detection pipeline against human annotations and assessing the effectiveness of the PFC method across multiple benchmarks. The results demonstrate a significant reduction in hallucination prevalence and duration, while maintaining or improving conventional audio quality metrics. The use of diverse datasets strengthens the findings, although the paper could benefit from more extensive ablation studies and comparisons with a wider range of existing methods.
The paper provides detailed descriptions of datasets, models, and evaluation metrics, which enhances reproducibility. However, the lack of a publicly available code repository limits the ability for others to fully replicate the experiments. The authors mention that they will release code and configuration files, which is a positive step towards reproducibility.
One limitation is the potential subjectivity in human annotations for hallucination detection, which may introduce variability in the evaluation. Additionally, while PFC shows promise, its effectiveness may vary across different types of videos and audio events, suggesting that further research is needed to generalize the approach. The paper could also expand on the implications of hallucinations in practical applications of V2A systems.
The work has significant implications for improving the reliability of automatic sound generation systems, which are increasingly used in various multimedia applications. By addressing a critical failure mode, the research paves the way for more trustworthy and immersive audio-visual experiences, potentially influencing future developments in the field. This paper meaningfully contributes to machine learning research by defining and addressing Insertion Hallucination in Video-to-Audio generation, proposing a novel evaluation framework and mitigation strategy that enhances the fidelity of generated audio in multimedia contexts.
Zero-shot voice conversion (VC) aims to transfer timbre from a source speaker to any unseen target speaker while preserving linguistic content. Growing application scenarios demand models with streaming inference capabilities. This has created a pressing need for models that are simultaneously fast, lightweight, and high-fidelity. However, existing streaming methods typically rely on either autoregressive (AR) or non-autoregressive (NAR) frameworks, which either require large parameter sizes to achieve strong performance or struggle to generalize to unseen speakers. In this study, we propose MeanVC, a lightweight and streaming zero-shot VC approach. MeanVC introduces a diffusion transformer with a chunk-wise autoregressive denoising strategy, combining the strengths of both AR and NAR paradigms for efficient streaming processing. By introducing mean flows, MeanVC regresses the average velocity field during training, enabling zero-shot VC with superior speech quality and speaker similarity in a single sampling step by directly mapping from the start to the endpoint of the flow trajectory. Additionally, we incorporate diffusion adversarial post-training to mitigate over-smoothing and further enhance speech quality. Experimental results demonstrate that MeanVC significantly outperforms existing zero-shot streaming VC systems, achieving superior conversion quality with higher efficiency and significantly fewer parameters. Audio demos and code are publicly available at https://aslp-lab.github.io/MeanVC.
Primary: Geely Automobile Research Institute (Ningbo) Company Ltd
All Institutions: Geely Automobile Research Institute (Ningbo) Company Ltd, School of Computer Science
The main contribution of this paper is the introduction of MeanVC, a lightweight and efficient framework for streaming zero-shot voice conversion that significantly improves audio quality and computational efficiency. The work is a meaningful addition to the field, addressing critical challenges in voice conversion technology while paving the way for practical applications in real-time scenarios.
The proposed MeanVC framework effectively combines autoregressive and non-autoregressive paradigms through a chunk-wise autoregressive denoising strategy and mean flows for efficient spectrogram synthesis. The introduction of diffusion adversarial post-training is a notable enhancement aimed at addressing over-smoothing artifacts, which is a common issue in generative models. The methodology is well-structured, leveraging existing architectures while innovatively addressing their limitations, particularly in terms of efficiency and quality in zero-shot voice conversion.
The experiments are comprehensive, utilizing a substantial dataset (10,000 hours of Mandarin data) and a well-defined evaluation setup that includes both subjective and objective metrics. The results demonstrate that MeanVC outperforms existing models in terms of audio quality, efficiency, and parameter count, which is critical for real-time applications. The comparisons with baseline models are thorough, providing a clear picture of MeanVC's advantages.
The paper provides sufficient detail regarding the architecture, training procedures, and evaluation metrics, which supports reproducibility. However, the absence of a public code repository limits the ease of reproduction, despite the demo URL being available.
While the paper presents significant advancements, it acknowledges that MeanVC's performance in DNSMOS is lower than that of Seed-VC, which could be attributed to its smaller parameter size. Additionally, the reliance on chunk sizes may introduce challenges in maintaining contextual integrity, particularly in real-time applications.
The advancements in zero-shot voice conversion have implications for various applications, including personalized voice assistants, dubbing in media, and privacy-preserving technologies. The lightweight nature of MeanVC makes it suitable for real-time applications, potentially broadening its adoption in commercial products. The main contribution of this paper is the introduction of MeanVC, a lightweight and efficient framework for streaming zero-shot voice conversion that significantly improves audio quality and computational efficiency. The work is a meaningful addition to the field, addressing critical challenges in voice conversion technology while paving the way for practical applications in real-time scenarios.
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50\% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
Primary: Liaoning University
All Institutions: Liaoning University, The University of Queensland, University of California, vivo Mobile Communication Co
The paper effectively defines and addresses Insertion Hallucination in video-to-audio generation, proposing a systematic evaluation framework and a novel correction method that significantly enhances the reliability of V2A models. This work is poised to influence future research directions and practical applications in the field of audio generation.
The paper introduces a novel concept of Insertion Hallucination (IH) in Video-to-Audio (V2A) generation, which is a significant advancement in addressing a previously unrecognized failure mode in audio generation models. The methodology is robust, employing a systematic evaluation framework that combines multiple audio event detectors through a majority-voting ensemble approach. The introduction of two new metrics (IH@vid and IH@dur) to quantify hallucination prevalence and severity is innovative and adds depth to the evaluation of V2A models. The proposed Posterior Feature Correction (PFC) method is particularly noteworthy as it operates without retraining and effectively reduces hallucinations by masking unreliable visual features, demonstrating a thoughtful approach to addressing the identified problem.
The experiments are comprehensive, validating the IH detection pipeline against human annotations and applying it to multiple state-of-the-art V2A models. The results clearly show that existing models suffer from significant hallucination issues, and the PFC method is effective in mitigating these issues while maintaining conventional performance metrics. The use of various benchmarks (Kling-Audio-Eval, VGGSound, AVE) strengthens the findings, and the ablation studies provide insight into the effectiveness of the proposed methods compared to alternative strategies.
The paper provides a detailed account of the datasets, models, and evaluation metrics used, which supports reproducibility. However, the lack of URLs for code or data repositories limits the ease of access for other researchers who may wish to replicate the study. The mention of a human-annotated validation set adds credibility but also raises questions about the availability of this resource for further validation by the community.
One limitation is the reliance on human annotations for validating the detection pipeline, which may introduce subjectivity and variability. Additionally, while the PFC method shows promise, it may not be universally applicable across all types of audio events, particularly those that are not speech or music. The paper also does not address potential ethical implications of generative audio, which could be a concern in real-world applications.
The findings of this research have significant implications for the development of more reliable and realistic V2A models, which could enhance the quality of multimedia content in various fields, including film, gaming, and virtual reality. By addressing hallucination issues, the work contributes to the broader goal of creating trustworthy AI systems that align closely with human expectations and experiences. The paper effectively defines and addresses Insertion Hallucination in video-to-audio generation, proposing a systematic evaluation framework and a novel correction method that significantly enhances the reliability of V2A models. This work is poised to influence future research directions and practical applications in the field of audio generation.
Flow-based generative models have greatly improved text-to-speech (TTS) synthesis quality, but inference speed remains limited by the iterative sampling process and multiple function evaluations (NFE). The recent MeanFlow model accelerates generation by modeling average velocity instead of instantaneous velocity. However, its direct application to TTS encounters challenges, including GPU memory overhead from Jacobian-vector products (JVP) and training instability due to self-bootstrap processes. To address these issues, we introduce IntMeanFlow, a framework for few-step speech generation with integral velocity distillation. By approximating average velocity with the teacher's instantaneous velocity over a temporal interval, IntMeanFlow eliminates the need for JVPs and self-bootstrap, improving stability and reducing GPU memory usage. We also propose the Optimal Step Sampling Search (O3S) algorithm, which identifies the model-specific optimal sampling steps, improving speech synthesis without additional inference overhead. Experiments show that IntMeanFlow achieves 1-NFE inference for token-to-spectrogram and 3-NFE for text-to-spectrogram tasks while maintaining high-quality synthesis. Demo samples are available at https://vvwangvv.github.io/intmeanflow.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of IntMeanFlow, a framework that improves few-step speech generation through integral velocity distillation, achieving significant efficiency gains in TTS synthesis. This work represents a meaningful advancement in the field, addressing critical challenges in generative modeling while maintaining high-quality output.
The paper introduces IntMeanFlow, a novel framework that leverages integral velocity distillation to improve the efficiency of text-to-speech (TTS) generation. The methodology effectively addresses the limitations of the MeanFlow model, particularly in terms of GPU memory usage and training stability. By approximating average velocity over a temporal interval rather than relying on instantaneous velocity, the authors enhance the training process and model performance. The introduction of the Optimal Step Sampling Search (O3S) algorithm is a significant methodological advancement, allowing for model-specific optimization of sampling steps, which is a crucial aspect of generative modeling. Overall, the methodology is well-structured, innovative, and provides a clear improvement over existing approaches.
The experiments conducted are thorough and demonstrate the effectiveness of the proposed IntMeanFlow framework across two widely used TTS models (F5-TTS and CosyVoice2). The results show a significant reduction in the number of function evaluations (NFE) while maintaining high-quality synthesis, which is a critical metric in TTS systems. The use of multiple evaluation metrics, including Word Error Rate (WER) and speaker similarity, adds robustness to the findings. However, the paper lacks comparative results against the original MeanFlow model for the text2mel task, which could have provided additional context regarding the improvements made.
The paper provides a clear description of the experimental setup, including datasets and evaluation metrics, which aids in reproducibility. However, the absence of specific implementation details or a code repository limits the ability for others to fully replicate the results. The authors mention demo samples available online, which is a positive aspect, but a more comprehensive project URL would enhance reproducibility.
One limitation of the work is the reliance on a teacher model for the distillation process, which may not always be available or feasible in practical applications. Additionally, while the paper addresses memory overhead and training instability, it does not explore the trade-offs between model complexity and performance in depth. The lack of comparisons with other state-of-the-art methods in the TTS domain also limits the contextual understanding of the contributions.
The advancements presented in this paper have the potential to significantly enhance the efficiency of TTS systems, making them more accessible for real-time applications. The reduction in inference time while maintaining synthesis quality could lead to broader adoption of TTS technologies in various fields, including virtual assistants, audiobooks, and accessibility tools. The methodologies developed could also inspire further research in generative modeling and distillation techniques across other domains. The main contribution of this paper is the introduction of IntMeanFlow, a framework that improves few-step speech generation through integral velocity distillation, achieving significant efficiency gains in TTS synthesis. This work represents a meaningful advancement in the field, addressing critical challenges in generative modeling while maintaining high-quality output.
Accurate modeling of spatial acoustics is critical for immersive and intelligible audio in confined, resonant environments such as car cabins. Current tuning methods are manual, hardware-intensive, and static, failing to account for frequency selective behaviors and dynamic changes like passenger presence or seat adjustments. To address this issue, we propose INFER: Implicit Neural Frequency Response fields, a frequency-domain neural framework that is jointly conditioned on source and receiver positions, orientations to directly learn complex-valued frequency response fields inside confined, resonant environments like car cabins. We introduce three key innovations over current neural acoustic modeling methods: (1) novel end-to-end frequency-domain forward model that directly learns the frequency response field and frequency-specific attenuation in 3D space; (2) perceptual and hardware-aware spectral supervision that emphasizes critical auditory frequency bands and deemphasizes unstable crossover regions; and (3) a physics-based Kramers-Kronig consistency constraint that regularizes frequency-dependent attenuation and delay. We evaluate our method over real-world data collected in multiple car cabins. Our approach significantly outperforms time- and hybrid-domain baselines on both simulated and real-world automotive datasets, cutting average magnitude and phase reconstruction errors by over 39% and 51%, respectively. INFER sets a new state-of-the-art for neural acoustic modeling in automotive spaces
Primary: University of Maryland
All Institutions: University of Maryland, Dolby Laboratories
The main contribution of this paper is the introduction of INFER, a novel frequency-domain neural framework for modeling complex acoustic environments in confined spaces, which significantly advances the state-of-the-art in neural acoustic modeling. The comprehensive analysis of the technical contributions, innovative methodology, and substantial experimental validation underscores its significance to the field of machine learning and audio processing.
The proposed INFER framework introduces a novel end-to-end frequency-domain neural model that learns complex-valued frequency response fields, addressing the limitations of existing acoustic modeling methods. The methodology is well-grounded in physical principles, utilizing Kramers-Kronig relations to ensure causality and consistency between amplitude and phase responses. The incorporation of perceptual and hardware-aware spectral supervision is a significant advancement, allowing the model to prioritize critical auditory frequency bands while downweighting less stable regions. The approach's reliance on implicit neural representations (INRs) to model acoustic fields in confined spaces is innovative, particularly in its ability to capture frequency-dependent behaviors and dynamic changes.
The experiments are robust, involving both simulated and real-world datasets collected from various car cabins. The evaluation metrics are comprehensive, covering both magnitude and phase reconstruction errors, and the results demonstrate significant improvements over state-of-the-art methods. The paper provides clear quantitative results, showing reductions in reconstruction errors by over 39% and 51%, respectively. Qualitative assessments further validate the model's performance, showcasing its ability to accurately reproduce complex acoustic phenomena in confined spaces.
The paper includes detailed implementation information, including model architecture, training procedures, and data collection methods. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Clear guidelines for replicating the experiments would enhance the paper's impact.
While the proposed method shows promise, it may face challenges in generalizing to highly variable acoustic environments beyond car cabins. The reliance on specific hardware configurations for data collection might also limit the applicability of the findings. Additionally, the complexity of the model may pose challenges in real-time applications, which are critical for automotive audio systems.
The INFER framework has significant implications for the automotive audio industry, potentially enhancing the quality of in-vehicle audio experiences. Its applications extend to adaptive noise cancellation, spatial audio rendering, and personalized audio experiences, which are increasingly relevant in modern vehicles. The methodology could also inspire further research in acoustic modeling for other confined environments, such as theaters or small auditoriums. The main contribution of this paper is the introduction of INFER, a novel frequency-domain neural framework for modeling complex acoustic environments in confined spaces, which significantly advances the state-of-the-art in neural acoustic modeling. The comprehensive analysis of the technical contributions, innovative methodology, and substantial experimental validation underscores its significance to the field of machine learning and audio processing.
As advances in synthetic voice generation accelerate, an increasing variety of fake voice generators have emerged, producing audio that is often indistinguishable from real human speech. This evolution poses new and serious threats across sectors where audio recordings serve as critical evidence. Although fake voice detectors are also advancing, the arms race between fake voice generation and detection has become more intense and complex. In this work, we present the first large-scale, cross-domain evaluation of fake voice detectors, benchmarking 8 state-of-the-art models against datasets synthesized by 20 different fake voice generation systems. To the best of our knowledge, this is the most comprehensive cross-domain assessment conducted to date. Our study reveals substantial security vulnerabilities in current fake voice detection systems, underscoring critical gaps in their real-world robustness. To advance the field, we propose a unified and effective metric that consolidates the diverse and often inconsistent evaluation criteria previously used across different studies. This metric enables standardized, straightforward comparisons of the robustness of fake voice detectors. We conclude by offering actionable recommendations for building more resilient fake voice detection technologies, with the broader goal of reinforcing the foundations of AI security and trustworthiness.
Primary: Vanderbilt University
All Institutions: Vanderbilt University
This paper presents a comprehensive benchmarking study of fake voice detection systems, revealing critical vulnerabilities and proposing a unified evaluation metric to enhance the robustness of detection technologies. The methodology is innovative and addresses a pressing issue in AI security, making it a valuable contribution to the field.
The methodology presented in this paper is robust, featuring a comprehensive cross-domain evaluation framework that benchmarks eight state-of-the-art fake voice detectors against datasets synthesized by twenty different fake voice generation systems. The introduction of a unified metric for evaluating detector performance is a significant advancement, as it addresses inconsistencies in previous evaluation criteria. The one-to-one evaluation protocol allows for a nuanced understanding of the interactions between generators and detectors, revealing unique vulnerabilities and performance variations. The integration of explainability analysis further enhances the methodology, providing insights into the reasons behind detection performance discrepancies.
The experimental design is thorough, utilizing a diverse set of fake voice generators and detectors. The paper evaluates the performance of detectors across various generator types, which is crucial for understanding the robustness of detection systems in real-world scenarios. The use of established datasets, such as ASVspoof, enhances the credibility of the results. However, the paper could benefit from more detailed statistical analysis of the results to quantify the significance of the findings.
While the paper outlines the experimental setup and methodology, it lacks specific implementation details that would facilitate reproducibility. Providing access to code or datasets would significantly enhance the ability of other researchers to replicate the study and validate the findings.
One limitation is the potential bias introduced by the selection of datasets and models, which may not fully represent the diversity of fake voice generation techniques. Additionally, the paper does not address the computational resources required for the experiments, which could be a barrier for some researchers. The focus on performance metrics may overlook other important factors such as user experience and ethical considerations in deploying detection systems.
The findings of this study have significant implications for sectors where audio recordings serve as critical evidence, such as law enforcement and financial services. By identifying vulnerabilities in current detection systems, the paper highlights the urgent need for more robust solutions to counteract the threats posed by advanced synthetic voice generation technologies. The proposed recommendations for improving detector resilience could inform future research and development in AI security and trustworthiness. This paper presents a comprehensive benchmarking study of fake voice detection systems, revealing critical vulnerabilities and proposing a unified evaluation metric to enhance the robustness of detection technologies. The methodology is innovative and addresses a pressing issue in AI security, making it a valuable contribution to the field.
Modern generative and multimodal models increasingly rely on compact latent representations that trade and balance semantic richness with high-fidelity reconstruction. We introduce SALAD-VAE, a continuous and highly compact semantic Audio Variational Autoencoder, which operates in the frequency domain and achieves state-of-the-art compression with very low latent frame rate (7.8 Hz) while surfacing semantic structure and producing high audio quality. We enhance the standard VAE semantic losses and augmentation, specifically contrastive learning and CLAP-based embedding distillation, enabling it to generalize across diverse audio domains. With a significantly less computational complex architecture than comparable state-of-the-art VAEs, SALAD-VAE matches their reconstruction quality while it consistently outperforms them on a wide range of classification benchmarks. Furthermore, the proposed additional loss function provides a trained CLAP projection layer, which can be used zero-shot audio captioning and classification matching pretrained CLAP audio-text embeddings.
Primary: Microsoft Research
All Institutions: Microsoft Research
The main contribution of this paper is the introduction of SALAD-VAE, a novel audio VAE that achieves high-quality audio compression while enabling semantic audio processing capabilities through innovative training techniques. This work significantly advances the state of the art in audio representation, providing a practical solution for integrating audio and language models.
The methodology presented in SALAD-VAE is innovative, leveraging a continuous frequency-domain VAE architecture that incorporates advanced techniques such as contrastive learning and CLAP-based embedding distillation. The use of polyphonic data augmentation and a denoising autoencoder principle enhances generalization across diverse audio domains. The proposed contrastive loss and CLAP loss contribute significantly to semantic representation, enabling zero-shot classification and caption generation, which are notable advancements in the field of audio processing.
The experimental evaluation is robust, utilizing a comprehensive set of metrics for both reconstruction quality and latent space representation. The authors compare their model against strong baselines, demonstrating superior performance in latent space probing and competitive reconstruction quality. The use of diverse datasets like AudioSet and thorough evaluation of zero-shot capabilities adds to the credibility of the results. However, the paper could benefit from more extensive ablation studies to quantify the impact of each proposed loss function more clearly.
The paper provides detailed implementation details, including architecture specifications, training data, and loss functions, which facilitate reproducibility. However, the absence of a publicly accessible code repository limits the ability for independent verification of results. Future work should consider releasing the model and code to enhance reproducibility.
One limitation is the potential trade-off between reconstruction quality and latent space representation when combining multiple loss functions, as indicated in the results. Additionally, while the model performs well across various audio types, its performance on more complex audio tasks or real-world applications remains to be thoroughly validated. The reliance on specific datasets may also limit the generalizability of the findings.
The advancements made by SALAD-VAE have significant implications for audio processing applications, particularly in areas requiring efficient audio representation, such as speech recognition, music generation, and audio classification. The ability to perform zero-shot classification and generate captions opens new avenues for multimodal applications, enhancing accessibility and usability in various domains. The main contribution of this paper is the introduction of SALAD-VAE, a novel audio VAE that achieves high-quality audio compression while enabling semantic audio processing capabilities through innovative training techniques. This work significantly advances the state of the art in audio representation, providing a practical solution for integrating audio and language models.
Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer architecture, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron (MLP) in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a powerful universal approximator based on the Kolmogorov-Arnold representation theorem. Our experimental results on ASVspoof2021 demonstrate that the integration of KAN to XLSR-Conformer model can improve the performance by 60.55% relatively in Equal Error Rate (EER) LA and DF sets, further achieving 0.70% EER on the 21LA set. Besides, the proposed replacement is also robust to various SSL architectures. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.
Primary: Hanoi University of Science and Technology
All Institutions: Hanoi University of Science and Technology, Institute for Infocomm Research (I2R), A*STAR
The main contribution of this work is the introduction of the XLSR-Kanformer model, which effectively integrates Kolmogorov-Arnold Networks into the XLSR-Conformer architecture, resulting in substantial improvements in synthetic speech detection performance. This innovative approach not only enhances the technical capabilities of existing models but also addresses critical challenges in the field of automatic speaker verification, paving the way for future research in robust speech processing systems.
The paper introduces a novel architecture, XLSR-Kanformer, which replaces traditional MLPs with KANs in the XLSR-Conformer model. This approach leverages the Kolmogorov-Arnold representation theorem to enhance feature learning in synthetic speech detection. The methodology is well-structured, detailing the theoretical foundations of KANs and their integration into existing architectures. The modifications made to the Conformer architecture are clearly articulated, and the proposed Kanformer block is innovative in its use of learnable univariate activation functions, which potentially improves the model's ability to handle high-dimensional data.
The authors conduct extensive experiments on the ASVspoof2021 dataset, demonstrating significant improvements in performance metrics such as Equal Error Rate (EER). The results show a relative improvement of 60.55% in EER on specific evaluation sets, establishing the XLSR-Kanformer as a state-of-the-art model. The experiments are thorough, including ablation studies that assess the impact of KAN integration across various SSL architectures, which adds robustness to the findings.
The paper provides sufficient detail on the experimental setup, including data preprocessing, model training configurations, and evaluation metrics. However, the absence of a publicly available code repository limits the reproducibility of the results. Future work could benefit from sharing the implementation to facilitate validation by the research community.
While the proposed model shows promising results, the paper does not address potential limitations such as the computational complexity introduced by KANs compared to traditional MLPs. Additionally, the generalizability of the findings across different domains of synthetic speech detection could be further explored.
The advancements in synthetic speech detection have significant implications for security in automatic speaker verification systems. By enhancing the robustness of these systems against sophisticated spoofing attacks, the research contributes to improving security measures in various applications, including financial transactions and access control. The main contribution of this work is the introduction of the XLSR-Kanformer model, which effectively integrates Kolmogorov-Arnold Networks into the XLSR-Conformer architecture, resulting in substantial improvements in synthetic speech detection performance. This innovative approach not only enhances the technical capabilities of existing models but also addresses critical challenges in the field of automatic speaker verification, paving the way for future research in robust speech processing systems.
Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University
The main contribution of this paper is the introduction of AudioMarathon, a benchmark designed to evaluate long-context audio understanding and efficiency in LALMs, addressing a critical gap in current audio processing research. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for driving advancements in audio understanding models.
The paper introduces AudioMarathon, a benchmark that addresses the limitations of existing audio benchmarks by focusing on long-form audio processing. The methodology is well-structured, emphasizing the need for long-context inputs and complex reasoning. The authors provide a clear framework for evaluating LALMs, which includes diverse tasks and a comprehensive approach to assessing both understanding and efficiency. The exploration of acceleration techniques such as token pruning and KV cache eviction adds depth to the methodology, demonstrating a thoughtful approach to optimizing model performance.
The experiments are robust, involving state-of-the-art LALMs and a variety of tasks that reflect real-world audio processing challenges. The results clearly indicate performance drops with increasing audio length, which is a significant finding that underscores the current limitations of LALMs. The analysis of trade-offs in acceleration techniques provides valuable insights into the practical implications of model efficiency, though further quantitative details on the performance metrics would enhance the evaluation.
The paper lacks specific implementation details and code availability, which are critical for reproducibility in machine learning research. While the methodology is sound, the absence of a publicly accessible implementation or dataset limits the ability of other researchers to replicate the findings and build upon this work.
One limitation is the lack of a comprehensive comparison with existing benchmarks, which could provide a clearer context for the performance of LALMs on AudioMarathon. Additionally, the paper does not address potential biases in the dataset or the implications of model performance across different audio domains, which could affect generalizability.
The introduction of AudioMarathon has the potential to significantly influence the audio and multimodal research communities by providing a standardized benchmark for long-context audio understanding. This could lead to advancements in model architectures and techniques that improve audio processing capabilities, ultimately benefiting applications in various fields such as speech recognition, music analysis, and sound event detection. The main contribution of this paper is the introduction of AudioMarathon, a benchmark designed to evaluate long-context audio understanding and efficiency in LALMs, addressing a critical gap in current audio processing research. The comprehensive analysis of the technical contributions, methodology, and significance to the field highlights the potential for driving advancements in audio understanding models.
The speech of people with Parkinson's Disease (PD) has been shown to hold important clues about the presence and progression of the disease. We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different speech tasks: phonations, sentence repetition, reading, recall, and picture description. We make comparisons by conducting listening tests to determine clinicians accuracy at recognizing signs of PD from audio alone, and we conduct experiments with a machine learning system for detection based on Whisper. Across tasks, Whisper performs on par or better than human experts when only audio is available, especially on challenging but important subgroups of the data: younger patients, mild cases, and female patients. Whisper's ability to recognize acoustic cues in difficult cases complements the multimodal and contextual strengths of human experts.
Primary: Concordia University
All Institutions: Concordia University, McGill University, Nouvelle Voix, CRBLM, Mila Quebec AI Institute, Montreal Neurological Institute
The main contribution of this paper is the comparative analysis of human expert and machine learning performance in detecting Parkinson's Disease from speech samples, demonstrating that the Whisper model can match or exceed human accuracy in specific demographic groups. This work is significant as it bridges the gap between clinical expertise and machine learning capabilities, highlighting the potential for AI to enhance diagnostic processes in healthcare.
The methodology is well-structured, combining human expert evaluations with machine learning experiments using a frozen Whisper model. The authors effectively designed listening tests to gather qualitative insights from experienced clinicians, which adds depth to the analysis. The use of a minimal configuration on the Whisper model to preserve pretraining effects is a thoughtful approach, although the paper could benefit from a more detailed description of the training process and hyperparameter tuning. The inclusion of data augmentation techniques is commendable, as it helps mitigate overfitting and enhances model robustness.
The experiments are comprehensive, utilizing a well-defined dataset from the Quebec Parkinson Network. The performance comparisons across various tasks and demographic groups provide valuable insights into the strengths and weaknesses of both human experts and the Whisper model. However, the paper lacks detailed statistical analysis or significance testing for the reported results, which would strengthen the claims made regarding performance differences.
While the paper outlines the experimental setup and model architecture, it lacks sufficient detail for complete reproducibility. Key aspects such as the exact training procedure, parameter settings, and data preprocessing steps are not fully elaborated. Providing a supplementary material or a GitHub repository with code and data would enhance reproducibility.
The study has several limitations, including the small sample size and potential biases in the dataset. The reliance on audio alone for diagnosis may not fully capture the complexities of Parkinson's Disease, as clinicians typically integrate multimodal information. Additionally, the model's "black box" nature raises concerns about interpretability and accountability in clinical settings.
This research has significant implications for the early detection and monitoring of Parkinson's Disease, potentially improving access to diagnostic care. The findings suggest that machine learning models like Whisper can complement human expertise, particularly in challenging cases. However, the integration of such models into clinical practice will require careful consideration of ethical and interpretative challenges. The main contribution of this paper is the comparative analysis of human expert and machine learning performance in detecting Parkinson's Disease from speech samples, demonstrating that the Whisper model can match or exceed human accuracy in specific demographic groups. This work is significant as it bridges the gap between clinical expertise and machine learning capabilities, highlighting the potential for AI to enhance diagnostic processes in healthcare.
Automatic speech recognition (ASR) systems often struggle with
domain-specific terminology, especially in specialized settings such as
academic lectures. To address this, we define the SlideASR task, which
leverages the rich visual information from presentation slides to improve
transcription accuracy. Existing pipeline methods for this task tend to be
complex and underperform. Although omni-modal large language models (OLLMs)
provide a promising end-to-end framework, they frequently fail in practice by
degenerating into simple optical character recognition (OCR) systems. To
overcome this, we propose Visually-Anchored Policy Optimization (VAPO), a novel
post-training method designed to control the model's reasoning process. Drawing
on the Chain-of-Thought reasoning paradigm, VAPO enforces a structured "Look
before Transcription" procedure using a
Primary: Unisound
All Institutions: Unisound
The main contribution of this paper is the introduction of a novel end-to-end ASR framework that leverages visual context to improve transcription accuracy in domain-specific settings. The combination of visual anchoring and reinforcement learning represents a significant advancement in the field of automatic speech recognition, particularly for academic environments.
The proposed method, Visually-Anchored Policy Optimization (VAPO), introduces a structured approach to ASR by integrating visual context from presentation slides. The use of a
The experiments are comprehensive, utilizing both synthetic and real-world datasets, which is commendable. The establishment of the SlideASR-Bench benchmark is a significant contribution that could facilitate future research in this domain. The results demonstrate a clear improvement in recognizing domain-specific terms, which is a critical aspect of ASR in specialized settings. However, the paper could enhance its credibility by including more comparative analyses against state-of-the-art methods and providing ablation studies to dissect the contributions of each component of the VAPO method.
The paper lacks detailed implementation specifics, such as hyperparameters, model architecture, and training procedures, which are essential for reproducibility. While it mentions extensive experiments, without clear guidelines or code availability, it may be challenging for other researchers to replicate the results.
One limitation is the reliance on the quality of the OCR component, which can vary based on the slide content and presentation style. Additionally, the method may not generalize well to ASR tasks outside of the defined SlideASR context. The paper does not address potential biases in the datasets used, which could affect the model's performance in real-world applications.
The integration of visual information into ASR systems has the potential to significantly enhance the accuracy of transcriptions in academic and professional settings, where domain-specific terminology is prevalent. This work could pave the way for more robust ASR systems that are better suited for specialized tasks, ultimately improving accessibility and information dissemination. The main contribution of this paper is the introduction of a novel end-to-end ASR framework that leverages visual context to improve transcription accuracy in domain-specific settings. The combination of visual anchoring and reinforcement learning represents a significant advancement in the field of automatic speech recognition, particularly for academic environments.
Speech emotion recognition (SER) is pivotal for enhancing human-machine interactions. This paper introduces "EmoHRNet", a novel adaptation of High-Resolution Networks (HRNet) tailored for SER. The HRNet structure is designed to maintain high-resolution representations from the initial to the final layers. By transforming audio samples into spectrograms, EmoHRNet leverages the HRNet architecture to extract high-level features. EmoHRNet's unique architecture maintains high-resolution representations throughout, capturing both granular and overarching emotional cues from speech signals. The model outperforms leading models, achieving accuracies of 92.45% on RAVDESS, 80.06% on IEMOCAP, and 92.77% on EMOVO. Thus, we show that EmoHRNet sets a new benchmark in the SER domain.
Primary: Stony Brook University
All Institutions: Stony Brook University
The main contribution of this paper is the introduction of EmoHRNet, a novel high-resolution neural network architecture for speech emotion recognition that achieves state-of-the-art performance across multiple datasets. This work significantly advances the field of SER by effectively capturing emotional nuances through its innovative architecture and methodological rigor.
The methodology presented in EmoHRNet is robust, leveraging the HRNet architecture to maintain high-resolution representations throughout the network. The transformation of audio signals into Mel-spectrograms is a well-established approach in SER, but the adaptation of HRNet for this specific task is innovative. The use of data augmentation techniques, such as frequency and time masking, is appropriate and enhances the model's ability to generalize across different emotional expressions. The architecture's design, which includes high-resolution input modules and multi-resolution stages, is well thought out and addresses the challenges of capturing emotional nuances in speech.
The experimental evaluation is thorough, utilizing three benchmark datasets (RAVDESS, IEMOCAP, and EMOVO) to validate the model's performance. The reported accuracies are impressive, particularly the 92.45% on RAVDESS, which suggests that EmoHRNet significantly outperforms existing models. The comparison with state-of-the-art techniques is comprehensive, providing a clear context for the model's performance. However, the paper could benefit from additional details on the experimental setup, such as the specific training and validation splits used.
The paper provides a reasonable level of detail regarding the training process, including the optimizer settings and loss function. However, the absence of a publicly available code repository or demo limits reproducibility. Future iterations should consider sharing the implementation to facilitate further research and validation.
While EmoHRNet demonstrates strong performance, the paper does not address potential limitations such as the model's computational efficiency or real-time applicability in practical scenarios. Additionally, the reliance on specific datasets may limit the generalizability of the findings to other contexts or languages.
The implications of EmoHRNet are significant for applications in human-machine interaction, particularly in enhancing the emotional intelligence of AI systems. Improved SER capabilities can lead to more empathetic and effective communication in various domains, including customer service, mental health support, and interactive entertainment. The research sets a new benchmark in SER, paving the way for future advancements in the field. The main contribution of this paper is the introduction of EmoHRNet, a novel high-resolution neural network architecture for speech emotion recognition that achieves state-of-the-art performance across multiple datasets. This work significantly advances the field of SER by effectively capturing emotional nuances through its innovative architecture and methodological rigor.
Recent LLM-based TTS systems achieve strong quality and zero-shot ability, but lack fine-grained emotional control due to their reliance on discrete speech tokens. Existing approaches either limit emotions to categorical labels or cannot generalize to LLM-based architectures. We propose EMORL-TTS (Fine-grained Emotion-controllable TTS with Reinforcement Learning), a framework that unifies global intensity control in the VAD space with local emphasis regulation. Our method combines supervised fine-tuning with reinforcement learning guided by task-specific rewards for emotion category, intensity, and emphasis. Moreover, we further investigate how emphasis placement modulates fine-grained emotion intensity. Experiments show that EMORL-TTS improves emotion accuracy, intensity differentiation, and emphasis clarity, while preserving synthesis quality comparable to strong LLM-based baselines.
Primary: Hangzhou Institute for Advanced Study
All Institutions: Hangzhou Institute for Advanced Study, National Natural Science Foundation of China, Zhejiang Provincial Natural Science Foundation of China
The paper presents a novel approach to fine-grained emotion control in LLM-based TTS systems, leveraging reinforcement learning to enhance emotional expressiveness while maintaining synthesis quality. The combination of global and local prosody control mechanisms represents a significant advancement in the field, with promising implications for future research and applications.
The proposed EMORL-TTS framework effectively integrates supervised fine-tuning with reinforcement learning to achieve fine-grained emotional control in LLM-based TTS systems. The unification of global intensity control in the VAD space with local emphasis regulation is a significant methodological advancement. The use of task-specific rewards tailored for emotion category, intensity, and emphasis enhances the model's ability to synthesize emotionally expressive speech. The methodology is well-structured, with clear stages of SFT and GRPO, although the reliance on discrete speech tokens presents inherent challenges that the authors address through innovative reinforcement learning strategies.
The experimental setup is robust, utilizing both objective and subjective evaluation metrics to assess the performance of EMORL-TTS. The use of multiple emotional corpora and the design of comprehensive evaluation tasks, such as Emotion Accuracy Test and Emphasis Accuracy Test, provide a thorough assessment of the model's capabilities. Results indicate significant improvements in emotional accuracy and emphasis clarity compared to baseline models, demonstrating the effectiveness of the proposed method. However, the lack of detailed statistical analysis of the results may limit the depth of the findings.
The paper provides a reasonable level of detail regarding the experimental setup, including training epochs, batch sizes, and learning rates. However, the absence of a publicly available code repository or detailed implementation instructions may hinder full reproducibility. The authors mention that synthesized samples are available online, which is a positive aspect for validation but does not fully address reproducibility concerns.
One limitation of the study is the potential challenge in generalizing the findings across different languages and cultural contexts, as the experiments are conducted solely in English. Additionally, while the model shows improvements in emotional expressiveness, the reliance on discrete token representations may still restrict the model's ability to capture the full spectrum of emotional nuances. The paper also does not address the computational complexity of the proposed method, which could be a concern for practical applications.
The advancements in fine-grained emotional control in TTS systems have significant implications for various applications, including virtual assistants, audiobooks, and interactive gaming. By enhancing the expressiveness of synthesized speech, EMORL-TTS can lead to more engaging and human-like interactions in technology. The potential for cross-lingual extensions and multimodal integration further broadens the scope of its impact, making it a valuable contribution to the field of machine learning and audio synthesis. The paper presents a novel approach to fine-grained emotion control in LLM-based TTS systems, leveraging reinforcement learning to enhance emotional expressiveness while maintaining synthesis quality. The combination of global and local prosody control mechanisms represents a significant advancement in the field, with promising implications for future research and applications.
Although audio generation has been widely studied over recent years, video-aligned audio generation still remains a relatively unexplored frontier. To address this gap, we introduce StereoSync, a novel and efficient model designed to generate audio that is both temporally synchronized with a reference video and spatially aligned with its visual context. Moreover, StereoSync also achieves efficiency by leveraging pretrained foundation models, reducing the need for extensive training while maintaining high-quality synthesis. Unlike existing methods that primarily focus on temporal synchronization, StereoSync introduces a significant advancement by incorporating spatial awareness into video-aligned audio generation. Indeed, given an input video, our approach extracts spatial cues from depth maps and bounding boxes, using them as cross-attention conditioning in a diffusion-based audio generation model. Such an approach allows StereoSync to go beyond simple synchronization, producing stereo audio that dynamically adapts to the spatial structure and movement of a video scene. We evaluate StereoSync on Walking The Maps, a curated dataset comprising videos from video games that feature animated characters walking through diverse environments. Experimental results demonstrate the ability of StereoSync to achieve both temporal and spatial alignment, advancing the state of the art in video-to-audio generation and resulting in a significantly more immersive and realistic audio experience.
Primary: Sapienza University of Rome
All Institutions: Sapienza University of Rome, Sony AI, Sony Group Corporation
The main contribution of this paper is the introduction of StereoSync, a novel framework for generating spatially-aware stereo audio from video, which significantly enhances the quality and immersion of audio-visual experiences. The technical contributions, particularly the integration of depth and bounding box information into the audio generation process, represent a meaningful advancement in the field of machine learning and audio synthesis.
The methodology presented in StereoSync is innovative, leveraging pretrained foundation models for efficient audio generation that is spatially aware and temporally synchronized with video content. The integration of depth maps and bounding boxes as cross-attention conditioning signals in a diffusion-based audio generation model is a notable advancement. The authors effectively combine various modalities to enhance the audio generation process, ensuring that the generated audio reflects the spatial dynamics of the video scene. However, the paper could benefit from a more detailed explanation of the conditioning mechanisms and the specific architecture of the diffusion model used.
The experimental evaluation is robust, utilizing a well-defined dataset (Walking The Maps) that is appropriate for the task of video-to-audio generation. The metrics employed, including FAD, FAVD, and Spatial AV-Align, provide a comprehensive assessment of audio quality, semantic alignment, and spatial coherence. The results demonstrate that StereoSync achieves significant improvements over a baseline model without spatial conditioning, indicating the effectiveness of the proposed approach. However, the paper lacks a comparative analysis with existing state-of-the-art methods, which would strengthen the claims of advancement.
The paper provides sufficient details about the training process, including the use of specific models and parameters, which aids in reproducibility. However, the lack of publicly available code or a demo URL limits the ability of other researchers to replicate the results directly. Providing access to the trained models or a code repository would enhance reproducibility.
One limitation noted is the reliance on a relatively small dataset, which may affect the generalization of the model. Additionally, while the Spatial AV-Align metric is useful, it may not fully capture the nuances of spatial audio generation, as acknowledged by the authors. Future work should address these limitations by exploring larger datasets and refining evaluation metrics.
The implications of this work are significant for fields such as film production, video game design, and virtual reality, where immersive audio experiences are crucial. By advancing the state of video-to-audio generation, StereoSync could enhance the quality of sound design in multimedia applications, leading to more engaging and realistic experiences for users. The main contribution of this paper is the introduction of StereoSync, a novel framework for generating spatially-aware stereo audio from video, which significantly enhances the quality and immersion of audio-visual experiences. The technical contributions, particularly the integration of depth and bounding box information into the audio generation process, represent a meaningful advancement in the field of machine learning and audio synthesis.
Machine Speech Chain, simulating the human perception-production loop, proves effective in jointly improving ASR and TTS. We propose TokenChain, a fully discrete speech chain coupling semantic-token ASR with a two-stage TTS: an autoregressive text-to-semantic model co-trained with ASR and a masked-generative semantic-to-acoustic model for synthesis only. End-to-end feedback across the text interface is enabled with straight-through argmax/Gumbel-Softmax and balanced with supervised ASR via dynamic weight averaging. Ablations examine optimal temperature schedules for in- and cross-domain transfer. Evaluation reveals TokenChain surpasses baseline accuracy 2-6 epochs earlier and yields 5-13% lower equal-epoch error with stable T2S on LibriSpeech, and reduces relative ASR WER by 56% and T2S WER by 31% on TED-LIUM with minimal forgetting, showing that chain learning remains effective with token interfaces and models.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
TokenChain introduces a novel discrete speech chain framework that effectively integrates semantic-token ASR with a two-stage TTS system, demonstrating significant improvements in performance and convergence. This work represents a meaningful advancement in the field of speech processing, with potential applications across various domains.
The methodology presented in TokenChain is innovative, leveraging a fully discrete speech chain that integrates semantic-token ASR with a two-stage TTS system. The authors employ advanced techniques such as straight-through estimators and Gumbel-Softmax to facilitate end-to-end feedback, which is a significant improvement over traditional continuous intermediate approaches. The dynamic weight averaging for balancing the ASR and TTS components is a noteworthy addition that enhances the training process.
The experimental evaluation is rigorous, utilizing well-established datasets such as LibriSpeech and TED-LIUM. The results demonstrate that TokenChain surpasses baseline models in terms of accuracy and convergence speed, achieving improvements in word error rates (WER) and character error rates (CER). The ablation studies on temperature schedules for in- and cross-domain transfer further strengthen the findings, showcasing a comprehensive approach to model evaluation.
The paper provides sufficient details regarding the model architecture, training procedures, and evaluation metrics, which would allow other researchers to replicate the experiments. However, the absence of a public code repository or demo URL limits the ease of reproducibility.
One limitation is the reliance on specific datasets, which may not generalize across all speech recognition and synthesis tasks. Additionally, the paper does not address potential computational overheads associated with the two-stage TTS system, which could affect real-time applications.
The implications of this work are significant for the fields of automatic speech recognition and text-to-speech synthesis, particularly in enhancing the efficiency and effectiveness of machine speech systems. The approach could lead to more robust applications in voice assistants, accessibility tools, and language learning technologies. TokenChain introduces a novel discrete speech chain framework that effectively integrates semantic-token ASR with a two-stage TTS system, demonstrating significant improvements in performance and convergence. This work represents a meaningful advancement in the field of speech processing, with potential applications across various domains.
Diffusion models have demonstrated remarkable performance in speech synthesis, but typically require multi-step sampling, resulting in low inference efficiency. Recent studies address this issue by distilling diffusion models into consistency models, enabling efficient one-step generation. However, these approaches introduce additional training costs and rely heavily on the performance of pre-trained teacher models. In this paper, we propose ECTSpeech, a simple and effective one-step speech synthesis framework that, for the first time, incorporates the Easy Consistency Tuning (ECT) strategy into speech synthesis. By progressively tightening consistency constraints on a pre-trained diffusion model, ECTSpeech achieves high-quality one-step generation while significantly reducing training complexity. In addition, we design a multi-scale gate module (MSGate) to enhance the denoiser's ability to fuse features at different scales. Experimental results on the LJSpeech dataset demonstrate that ECTSpeech achieves audio quality comparable to state-of-the-art methods under single-step sampling, while substantially reducing the model's training cost and complexity.
Primary: Xinjiang University
All Institutions: Xinjiang University, Tsinghua University, Tianjin University of Technology
The main contribution of this paper is the introduction of ECTSpeech, a novel framework that leverages Easy Consistency Tuning to achieve efficient one-step speech synthesis while maintaining high audio quality. This work significantly advances the field by addressing the limitations of existing diffusion models and consistency models, thereby enhancing the practical applicability of speech synthesis technologies.
The methodology presented in ECTSpeech is innovative as it introduces the Easy Consistency Tuning (ECT) strategy to the domain of speech synthesis for the first time. This approach allows for high-quality one-step generation without the need for a separate student model, significantly streamlining the training process. The incorporation of the multi-scale gate module (MSGate) enhances the model's ability to fuse features at different scales, which is crucial for capturing the nuances of speech signals. The two-stage training process, consisting of diffusion pretraining followed by consistency tuning, is well-structured and effectively addresses the challenges of inference efficiency and training complexity.
The experimental evaluation is robust, utilizing the LJSpeech dataset to benchmark the proposed model against several state-of-the-art methods. The results indicate that ECTSpeech achieves comparable or superior audio quality with significantly reduced training costs and inference times. The use of both subjective (Mean Opinion Score) and objective (Fréchet Distance, Fréchet Audio Distance) metrics provides a comprehensive assessment of the model's performance. The ablation studies further validate the contributions of the MSGate and consistency tuning, demonstrating their importance in enhancing synthesis quality.
The paper provides sufficient details regarding the model architecture, training protocols, and evaluation metrics, which would allow for reproducibility of the results. However, the absence of a publicly available code repository or demo URL limits the ease with which other researchers can replicate the findings.
One limitation of the study is the reliance on a single dataset (LJSpeech) for evaluation, which may not fully represent the diversity of speech synthesis tasks. Additionally, while the model shows promising results in one-step generation, the paper does not extensively discuss its performance in more complex scenarios, such as multi-speaker or emotional speech synthesis.
The advancements made in efficient speech synthesis through ECTSpeech have significant implications for applications in voice assistants, content creation, and accessibility technologies. By reducing training complexity and improving inference efficiency, this research could facilitate broader adoption of high-quality speech synthesis in real-time applications. The main contribution of this paper is the introduction of ECTSpeech, a novel framework that leverages Easy Consistency Tuning to achieve efficient one-step speech synthesis while maintaining high audio quality. This work significantly advances the field by addressing the limitations of existing diffusion models and consistency models, thereby enhancing the practical applicability of speech synthesis technologies.
Over the past two decades, speech emotion recognition (SER) has received growing attention. To train SER systems, researchers collect emotional speech databases annotated by crowdsourced or in-house raters who select emotions from predefined categories. However, disagreements among raters are common. Conventional methods treat these disagreements as noise, aggregating labels into a single consensus target. While this simplifies SER as a single-label task, it ignores the inherent subjectivity of human emotion perception. This dissertation challenges such assumptions and asks: (1) Should minority emotional ratings be discarded? (2) Should SER systems learn from only a few individuals' perceptions? (3) Should SER systems predict only one emotion per sample? Psychological studies show that emotion perception is subjective and ambiguous, with overlapping emotional boundaries. We propose new modeling and evaluation perspectives: (1) Retain all emotional ratings and represent them with soft-label distributions. Models trained on individual annotator ratings and jointly optimized with standard SER systems improve performance on consensus-labeled tests. (2) Redefine SER evaluation by including all emotional data and allowing co-occurring emotions (e.g., sad and angry). We propose an ``all-inclusive rule'' that aggregates all ratings to maximize diversity in label representation. Experiments on four English emotion databases show superior performance over majority and plurality labeling. (3) Construct a penalization matrix to discourage unlikely emotion combinations during training. Integrating it into loss functions further improves performance. Overall, embracing minority ratings, multiple annotators, and multi-emotion predictions yields more robust and human-aligned SER systems.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of innovative modeling and evaluation approaches in Speech Emotion Recognition that account for the subjectivity of annotators and the ambiguity of emotions, significantly enhancing the performance and applicability of SER systems.
The paper proposes a novel approach to Speech Emotion Recognition (SER) by addressing the subjectivity of emotion perception and the ambiguity of emotional boundaries. It introduces three main methodologies: (1) retaining all emotional ratings and using soft-label distributions for training, (2) redefining evaluation methods to include co-occurring emotions through an "all-inclusive rule," and (3) employing a penalization matrix to discourage unlikely emotion combinations during training. This multifaceted approach is well-justified by psychological findings and shows a clear departure from traditional single-label methods, making it a significant contribution to the field.
The experiments conducted on four English emotion databases demonstrate the effectiveness of the proposed methodologies. The results indicate that the new methods outperform conventional majority and plurality labeling approaches, showcasing improvements in SER system performance across various test conditions. The use of multiple datasets strengthens the validity of the findings, although the paper could benefit from more extensive comparative analysis with other state-of-the-art methods.
The paper provides a detailed account of the methodologies, datasets, and experimental setups, which aids reproducibility. However, it lacks explicit URLs or links to code repositories or demo pages, which would enhance the ability of other researchers to replicate the work. Clear documentation of the datasets used and the specific configurations for experiments would further support reproducibility.
One limitation is the reliance on subjective annotations, which can introduce variability and noise in the data. While the paper addresses this by proposing methods to incorporate all ratings, the inherent subjectivity of emotion perception remains a challenge. Additionally, the paper does not explore the potential impact of demographic factors on emotion perception, which could be an avenue for future research.
The findings have significant implications for the development of more robust and human-aligned SER systems, which can be applied in various domains such as customer service, mental health monitoring, and human-computer interaction. By embracing the complexity of human emotions, the proposed methodologies could lead to advancements in emotional AI technologies that better understand and respond to human emotional states. The main contribution of this paper is the introduction of innovative modeling and evaluation approaches in Speech Emotion Recognition that account for the subjectivity of annotators and the ambiguity of emotions, significantly enhancing the performance and applicability of SER systems.
Audio Question Answering (AQA) is a key task for evaluating Audio-Language Models (ALMs), yet assessing open-ended responses remains challenging. Existing metrics used for AQA such as BLEU, METEOR and BERTScore, mostly adapted from NLP and audio captioning, rely on surface similarity and fail to account for question context, reasoning, and partial correctness. To address the gap in literature, we make three contributions in this work. First, we introduce AQEval to enable systematic benchmarking of AQA metrics. It is the first benchmark of its kind, consisting of 10k model responses annotated by multiple humans for their correctness and relevance. Second, we conduct a comprehensive analysis of existing AQA metrics on AQEval, highlighting weak correlation with human judgment, especially for longer answers. Third, we propose a new metric - AURA score, to better evaluate open-ended model responses. On AQEval, AURA achieves state-of-the-art correlation with human ratings, significantly outperforming all baselines. Through this work, we aim to highlight the limitations of current AQA evaluation methods and motivate better metrics. We release both the AQEval benchmark and the AURA metric to support future research in holistic AQA evaluation.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University
The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
The paper introduces AQEval, a novel benchmark for Audio Question Answering (AQA) metrics, which is a significant advancement in evaluating open-ended responses in audio contexts. The methodology employs a combination of human annotations and a new metric, AURA, which integrates reasoning capabilities of large language models (LLMs) with an audio entailment component. This dual approach allows for a more nuanced evaluation of responses, addressing the limitations of existing metrics that primarily focus on surface-level similarity.
The experimental setup is robust, utilizing a dataset of 10k annotated responses that allows for systematic benchmarking of AQA metrics. The authors provide a comprehensive analysis of existing metrics, demonstrating their weak correlation with human judgments, particularly for longer answers. AURA is shown to outperform traditional metrics significantly, achieving state-of-the-art correlation with human ratings. The ablation studies further validate the effectiveness of the proposed methodology.
The paper includes detailed descriptions of the dataset construction, annotation process, and experimental setup, which enhances reproducibility. However, the reliance on specific LLMs for scoring may limit the generalizability of the results to other models or contexts.
While the paper addresses significant gaps in AQA evaluation, it does not explore the potential biases in human annotations or the limitations of the LLMs used. Additionally, the performance of AURA in real-world applications remains to be fully validated.
The introduction of AQEval and AURA has the potential to significantly influence future research in audio-language models and their evaluation. By providing a more accurate assessment of model responses, this work can lead to improvements in the development of ALMs and their applications in various domains, including accessibility, education, and content creation. The main contribution of this paper is the introduction of AQEval and the AURA score, which together provide a comprehensive framework for evaluating open-ended responses in audio question answering. This work addresses critical shortcomings in existing evaluation metrics and sets a new standard for future research in the field.
Sound field reconstruction involves estimating sound fields from a limited number of spatially distributed observations. This work introduces a differentiable physics approach for sound field reconstruction, where the initial conditions of the wave equation are approximated with a neural network, and the differential operator is computed with a differentiable numerical solver. The use of a numerical solver enables a stable network training while enforcing the physics as a strong constraint, in contrast to conventional physics-informed neural networks, which include the physics as a constraint in the loss function. We introduce an additional sparsity-promoting constraint to achieve meaningful solutions even under severe undersampling conditions. Experiments demonstrate that the proposed approach can reconstruct sound fields under extreme data scarcity, achieving higher accuracy and better convergence compared to physics-informed neural networks.
Primary: Technical University of Denmark (DTU)
All Institutions: Technical University of Denmark (DTU), Universidad Politécnica de Madrid (UPM)
This work introduces a differentiable physics approach for sound field reconstruction, significantly enhancing accuracy and convergence under data scarcity. The methodology combines neural networks with numerical PDE solvers, showcasing a promising direction for future research in acoustics and machine learning.
The paper presents a novel differentiable physics approach that integrates a neural network with a numerical PDE solver for sound field reconstruction. This method improves stability and convergence compared to traditional physics-informed neural networks (PINNs) by directly incorporating physical constraints through the numerical solver rather than as a penalty in the loss function. The introduction of a sparsity-promoting constraint is particularly innovative, allowing the model to perform well under extreme data scarcity. The use of automatic differentiation (AD) to compute gradients through the numerical solver is a significant methodological advancement, streamlining the training process.
The experiments conducted are rigorous and demonstrate the effectiveness of the proposed method across various scenarios, including single Gaussian pulses and complex source distributions. The results indicate that the differentiable physics approach significantly outperforms PINNs in terms of accuracy and convergence speed, particularly in highly undersampled conditions. The use of normalized mean squared error (NMSE) as a performance metric is appropriate, and the experiments are well-structured to showcase the strengths of the proposed method.
The paper provides sufficient detail regarding the implementation, including the architecture of the neural networks, the training process, and the numerical methods used. The availability of the code repository enhances reproducibility, allowing other researchers to replicate the experiments and build upon the work.
While the proposed method shows promising results, the paper does not extensively discuss potential limitations, such as the sensitivity of the model to the choice of hyperparameters or the specific numerical methods employed. Additionally, the method may face challenges in more complex acoustic environments that were not tested in the experiments.
The proposed differentiable physics approach has significant implications for sound field reconstruction and could be applied to various fields, including acoustics, audio engineering, and environmental monitoring. The ability to reconstruct sound fields from limited data could enhance applications in virtual reality, architectural acoustics, and audio signal processing. The integration of physics with machine learning also opens avenues for addressing other inverse problems in different domains. This work introduces a differentiable physics approach for sound field reconstruction, significantly enhancing accuracy and convergence under data scarcity. The methodology combines neural networks with numerical PDE solvers, showcasing a promising direction for future research in acoustics and machine learning.
While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.
Primary: The Hong Kong Polytechnic University
All Institutions: The Hong Kong Polytechnic University, Tsinghua University
The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
The proposed methodology introduces a novel framework, Siren, which utilizes multiple isolated transformers with causal conditioning and anti-causal alignment. This approach effectively addresses the limitations of existing RVQ tokenizers in T2A generation by mitigating gradient conflicts and enhancing audio reconstruction fidelity. The use of reinforcement learning for alignment is innovative, although the complexity of the architecture may pose challenges for implementation and scalability.
The experiments are extensive and demonstrate that Siren outperforms both existing LM-based and diffusion-based systems, achieving state-of-the-art results. However, the paper mentions the use of a curated dataset smaller than those in prior work, which raises questions about the generalizability of the results. The evaluation metrics, particularly in terms of fidelity, are well-defined, but further comparisons with a broader range of benchmarks would strengthen the findings.
The paper provides a GitHub repository link for the implementation, which is crucial for reproducibility. However, details on the training process, hyperparameters, and specific datasets used are somewhat limited, which could hinder replication efforts by other researchers.
The authors acknowledge several limitations, including training efficiency due to the sequential training of transformer modules, the trade-off between model size and semantic richness, and the need for larger, more diverse datasets. Addressing these limitations in future work will be essential for advancing the field.
The work has significant implications for multi-modal generation frameworks, potentially enabling more cohesive integration of audio and text. By repositioning LMs as competitive in T2A tasks, it opens pathways for applications in content creation, gaming, and accessibility technologies. The main contribution of this paper is the introduction of the Siren framework, which effectively bridges the gap between language models and diffusion models in text-to-audio generation. The comprehensive analysis of the methodology and experimental results highlights its potential to reshape the landscape of audio synthesis, making it a notable advancement in the field.
We introduce MAVE (Mamba with Cross-Attention for Voice Editing and Synthesis), a novel autoregressive architecture for text-conditioned voice editing and high-fidelity text-to-speech (TTS) synthesis, built on a cross-attentive Mamba backbone. MAVE achieves state-of-the-art performance in speech editing and very competitive results in zero-shot TTS, while not being explicitly trained on the latter task, outperforming leading autoregressive and diffusion models on diverse, real-world audio. By integrating Mamba for efficient audio sequence modeling with cross-attention for precise text-acoustic alignment, MAVE enables context-aware voice editing with exceptional naturalness and speaker consistency. In pairwise human evaluations on a random 40-sample subset of the RealEdit benchmark (400 judgments), 57.2% of listeners rated MAVE - edited speech as perceptually equal to the original, while 24.8% prefered the original and 18.0% MAVE - demonstrating that in the majority of cases edits are indistinguishable from the source. MAVE compares favorably with VoiceCraft and FluentSpeech both on pairwise comparisons and standalone mean opinion score (MOS) evaluations. For zero-shot TTS, MAVE exceeds VoiceCraft in both speaker similarity and naturalness, without requiring multiple inference runs or post-processing. Remarkably, these quality gains come with a significantly lower memory cost and approximately the same latency: MAVE requires ~6x less memory than VoiceCraft during inference on utterances from the RealEdit database (mean duration: 6.21s, A100, FP16, batch size 1). Our results demonstrate that MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the synergistic integration of structured state-space modeling and cross-modal attention.
Primary: unknown
All Institutions: unknown
MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the integration of structured state-space modeling and cross-modal attention. The paper presents a compelling advancement in the field of audio processing, demonstrating significant improvements in both efficiency and quality, although it would benefit from enhanced reproducibility measures and a discussion of ethical implications.
The paper presents MAVE, an autoregressive architecture that integrates a cross-attentive mechanism with a Mamba backbone for voice editing and TTS synthesis. The methodology is well-structured, leveraging state-space modeling and cross-modal attention to achieve high fidelity in voice editing and synthesis. The use of cross-attention for text-acoustic alignment is particularly innovative, allowing for context-aware modifications to audio. The autoregressive nature of the model also suggests a thoughtful approach to sequence generation, although details on the training regimen and hyperparameter tuning could enhance the understanding of the model's performance.
The experiments are robust, utilizing a variety of benchmarks, including the RealEdit dataset, to evaluate the model's performance. The human evaluation metrics, including pairwise comparisons and MOS scores, provide a comprehensive view of the model's effectiveness. The results indicate that MAVE not only matches but often exceeds existing models like VoiceCraft and FluentSpeech, particularly in terms of memory efficiency and naturalness. However, the paper could benefit from a more detailed analysis of the datasets used and the statistical significance of the results.
The paper lacks sufficient detail regarding the implementation specifics, such as the training process, data preprocessing, and evaluation metrics, which are crucial for reproducibility. While the architecture is described, the absence of code or a demo URL limits the ability for other researchers to replicate the findings. Including a link to a GitHub repository or supplementary materials would significantly enhance reproducibility.
One limitation is the reliance on subjective human evaluations, which can introduce variability and bias. Additionally, while the model shows promise in zero-shot TTS, the performance on diverse speaker characteristics and accents remains unexplored. The paper does not address potential ethical concerns related to voice synthesis technology, such as misuse in deepfakes or privacy violations.
The implications of MAVE are significant, particularly in applications like voice dubbing, personalized voice assistants, and content creation. The ability to edit voice recordings seamlessly has the potential to revolutionize industries reliant on audio content. However, the technology also raises ethical questions regarding consent and the potential for misuse, necessitating careful consideration in its deployment. MAVE establishes a new standard for flexible, high-fidelity voice editing and synthesis through the integration of structured state-space modeling and cross-modal attention. The paper presents a compelling advancement in the field of audio processing, demonstrating significant improvements in both efficiency and quality, although it would benefit from enhanced reproducibility measures and a discussion of ethical implications.
Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation.
Primary: unknown
All Institutions: unknown
The paper presents UniVoice, a unified transformer framework that integrates autoregressive speech recognition with flow-matching-based speech synthesis. This work is significant as it explores a novel approach to joint modeling in speech processing, addressing critical limitations in current methodologies and demonstrating robust performance across multiple tasks.
The methodology presented in this paper is innovative, as it proposes a unified framework that integrates autoregressive speech recognition with flow-matching-based synthesis. The dual attention mechanism and text-prefix-guided speech infilling method are significant contributions that address the limitations of existing models that treat ASR and TTS as separate tasks. The continuous representation approach is a notable departure from traditional discrete tokenization methods, which often suffer from information loss. The paper also provides a clear description of the model architecture, training objectives, and attention mask design, which enhances the understanding of the proposed methods.
The experimental evaluation is thorough, utilizing the LibriHeavy dataset for both ASR and TTS tasks. The results demonstrate that UniVoice achieves competitive performance compared to state-of-the-art models in both domains, with specific metrics provided for robustness, similarity, and quality. The ablation studies effectively showcase the advantages of the proposed methods over baseline models, although the paper acknowledges trade-offs in performance when compared to specialized models.
The paper provides sufficient implementation details, including model architecture, training procedures, and evaluation metrics, which supports reproducibility. The authors also mention plans to open-source the code and checkpoints, which is a positive step towards enabling other researchers to replicate and build upon their work.
The paper identifies several limitations, including the focus on only ASR and TTS tasks, the relatively small dataset and model size, and the underutilization of the conversational capabilities of LLMs. These limitations suggest that while the work is a significant step forward, there is potential for further development and exploration in future research.
The unified framework proposed in this paper has the potential to advance the field of speech processing by enabling more efficient and effective models that can handle both recognition and synthesis tasks. This could lead to improvements in applications such as virtual assistants, automated transcription services, and voice cloning technologies, ultimately enhancing user experience and accessibility in various domains. The paper presents UniVoice, a unified transformer framework that integrates autoregressive speech recognition with flow-matching-based speech synthesis. This work is significant as it explores a novel approach to joint modeling in speech processing, addressing critical limitations in current methodologies and demonstrating robust performance across multiple tasks.
Large language models (LLMs) have recently shown strong potential in audio-visual speech recognition (AVSR), but their high computational demands and sensitivity to token granularity limit their practicality in resource-constrained settings. Token compression methods can reduce inference cost, but they require fixing a compression rate in advance and produce a single fixed-length output, offering no flexibility to balance information density and efficiency at inference time. Matryoshka representation learning (MRL) addresses this by enabling a single model to operate across multiple token granularities, allowing compression rates to be adjusted dynamically. However, current MRL-based methods treat each scale independently during training, limiting cross-scale generalization, robustness at high compression, and interpretability. To overcome these limitations, we propose MoME (Mixture of Matryoshka Experts), a novel framework that integrates sparse Mixture-of-Experts (MoE) into MRL-based LLMs for AVSR. MoME augments a frozen LLM with top-k routed and shared experts, allowing dynamic capacity allocation across scales and modalities. A shared router promotes consistent expert activation across granularities, enabling compressed sequences to benefit from representations learned at lower compression. Experiments on LRS2 and LRS3 demonstrate that MoME achieves state-of-the-art performance across AVSR, ASR, and VSR tasks, while requiring significantly fewer parameters and maintaining robustness under noise. MoME unifies the adaptability of MRL with the efficiency of MoE, offering a scalable and interpretable solution for resource-aware speech recognition.
Primary: Imperial College London
All Institutions: Imperial College London, NatWest AI Research
The main contribution of this paper is the introduction of MoME, a novel framework that effectively combines Mixture-of-Experts with Matryoshka representation learning to enhance audio-visual speech recognition. This work is significant as it addresses key challenges in the field, such as computational efficiency and robustness, while providing a scalable solution that can be adapted to various multimodal tasks. The methodology is well-founded, and the experimental results support the claims made, although the lack of reproducibility and statistical significance reporting could be improved.
The proposed MoME framework integrates a Mixture-of-Experts (MoE) architecture with Matryoshka representation learning (MRL) to enhance audio-visual speech recognition (AVSR). This approach is innovative as it allows for dynamic capacity allocation across different token granularities, addressing the limitations of existing MRL methods that treat scales independently. The incorporation of a shared router for expert activation across scales is a notable design choice that promotes cross-scale generalization and robustness, which is well-justified in the paper. The methodology is sound and shows a clear understanding of the challenges in AVSR.
The experiments conducted on LRS2 and LRS3 datasets demonstrate the effectiveness of MoME in achieving state-of-the-art performance across various tasks (AVSR, ASR, VSR) while maintaining fewer parameters. The paper provides a comprehensive analysis of different configurations and ablation studies, which strengthens the validity of the results. However, the lack of statistical significance reporting for the experimental results is a missed opportunity for deeper insight into the robustness of the findings.
The paper includes sufficient details on training and testing protocols, including data preprocessing, training hyperparameters, and model configurations. However, the absence of publicly available code or a clear path for reproducibility limits the practical application of the results. The authors mention that the code will be available upon acceptance, which is a positive step.
The paper acknowledges that while MoE models reduce inference-time computation, they still require all experts to reside in memory, which can lead to increased memory usage. Additionally, the inference cost is noted to be slightly higher than some baselines, which could be a concern for real-world applications. The focus on audio-visual models also limits the generalizability of the findings to other domains.
The paper discusses potential societal impacts, particularly the risks associated with deploying LLMs that may produce biased or inaccurate outputs. This acknowledgment is crucial, as it highlights the need for careful evaluation before practical applications. The authors suggest conducting thorough safety and fairness evaluations, which is a responsible approach to mitigate potential negative consequences. The main contribution of this paper is the introduction of MoME, a novel framework that effectively combines Mixture-of-Experts with Matryoshka representation learning to enhance audio-visual speech recognition. This work is significant as it addresses key challenges in the field, such as computational efficiency and robustness, while providing a scalable solution that can be adapted to various multimodal tasks. The methodology is well-founded, and the experimental results support the claims made, although the lack of reproducibility and statistical significance reporting could be improved.