We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology, Northwestern Polytechnical University
MGAudio represents a significant advancement in the field of audio generation, introducing a novel model-guided framework that achieves state-of-the-art performance while addressing key challenges in video-to-audio synthesis. The methodology is robust, and the experimental results validate its effectiveness, making it a valuable contribution to the machine learning community.
The paper introduces MGAudio, a novel framework for video-to-audio generation that leverages model-guided dual-role alignment. The methodology is well-structured, incorporating a scalable flow-based Transformer model, a dual-role audio-visual encoder, and a model-guided objective. This approach effectively addresses the limitations of existing classifier-free guidance methods by providing direct model-based supervision, enhancing both training efficiency and audio generation quality. The dual-role alignment mechanism is particularly innovative, allowing the audio-visual encoder to serve dual functions that improve feature integration and generation coherence.
The experiments are comprehensive, demonstrating the effectiveness of MGAudio on the VGGSound and UnAV-100 benchmarks. The reported results, including a Fréchet Audio Distance (FAD) of 0.40, showcase significant improvements over previous state-of-the-art methods. The paper includes ablation studies that validate the contributions of each component, providing a thorough analysis of the model's performance across various metrics. The generalization capabilities of the model are also highlighted, indicating robustness across different datasets.
The paper provides sufficient implementation details, including training configurations, model architecture, and evaluation metrics, which facilitate reproducibility. The authors have made the code publicly available, further supporting the reproducibility of their results. However, more detailed descriptions of hyperparameter tuning and specific experimental setups could enhance reproducibility.
The paper acknowledges limitations in generating complex audio such as human vocalizations, indicating that the model may struggle with subtle visual cues. Additionally, the reliance on iterative sampling processes could hinder inference speed, which is a common challenge in generative models. The authors suggest future work could focus on improving efficiency and addressing these limitations.
The proposed framework has significant implications for various applications, including film production, video editing, and content creation, where realistic audio generation is crucial for enhancing viewer experience. The model's ability to generate coherent audio from video inputs could revolutionize workflows in multimedia production, making it more efficient and accessible. MGAudio represents a significant advancement in the field of audio generation, introducing a novel model-guided framework that achieves state-of-the-art performance while addressing key challenges in video-to-audio synthesis. The methodology is robust, and the experimental results validate its effectiveness, making it a valuable contribution to the machine learning community.
Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, University of Texas at Dallas, Carnegie Mellon University, ARRAY Innovation
The main contribution of this paper is the introduction of NaturalVoices, a large-scale, spontaneous podcast dataset designed for emotion-aware voice conversion, which addresses critical gaps in existing datasets and provides a valuable resource for advancing research in voice conversion and emotional speech synthesis. This work significantly enhances the field by offering a comprehensive dataset that captures the richness of real-world speech, along with a robust methodology for its utilization in voice conversion tasks.
The methodology presented in the paper is robust, focusing on the development of the NaturalVoices dataset, which is a significant advancement in the field of voice conversion. The authors provide a comprehensive automated data-sourcing pipeline that includes multi-level annotations for emotion, speaker identity, and speech quality. This systematic approach allows for flexible filtering and customization of the dataset, making it suitable for various voice conversion tasks. The use of podcasts as a source of spontaneous speech is particularly innovative, as it captures the natural emotional dynamics that are often absent in traditional datasets.
The experimental evaluation is thorough, demonstrating the dataset's utility through various voice conversion models. The authors conduct extensive experiments to assess the performance of state-of-the-art models trained on NaturalVoices, revealing both the strengths and limitations of current architectures. The results indicate that NaturalVoices not only supports high-quality voice conversion but also serves as a challenging benchmark, highlighting the need for more advanced models capable of leveraging the dataset's scale and variability.
The paper provides adequate details regarding the dataset construction and the automated processing pipeline, which enhances reproducibility. The authors have made the dataset publicly available along with the necessary tools for annotation and filtering, which facilitates further research. However, specific implementation details regarding the models used in experiments could be expanded to improve reproducibility.
One limitation identified in the paper is the variability in audio quality due to the nature of podcast recordings, which may affect the performance of voice conversion models. Additionally, while the dataset is extensive, it may still lack certain emotional nuances that could be captured through more diverse sources or additional annotations. The authors also acknowledge that current architectures struggle to fully exploit the dataset's complexity, indicating a gap in model capabilities.
The potential applications of NaturalVoices are significant, ranging from conversational speech synthesis to affective computing and deepfake detection. By providing a rich, expressive dataset, this work paves the way for advancements in voice conversion technologies that can better capture the nuances of human communication. The implications extend to various fields, including entertainment, accessibility, and human-computer interaction. The main contribution of this paper is the introduction of NaturalVoices, a large-scale, spontaneous podcast dataset designed for emotion-aware voice conversion, which addresses critical gaps in existing datasets and provides a valuable resource for advancing research in voice conversion and emotional speech synthesis. This work significantly enhances the field by offering a comprehensive dataset that captures the richness of real-world speech, along with a robust methodology for its utilization in voice conversion tasks.
Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framework that first segments spectrograms to generate soft masks of biologically relevant energy and then fuses these masks with the raw inputs for multi-band, denoised classification. Image and mask embeddings are integrated via mid-level fusion, enabling the model to focus on salient spectrogram regions while preserving global context. Using real-world recordings from the Saguenay St. Lawrence Marine Park Research Station in Canada, we demonstrate that segmentation-driven attention and mid-level fusion improve signal discrimination, reduce false positive detections, and produce reliable representations for operational marine mammal monitoring across diverse environmental conditions and signal-to-noise ratios. Beyond in-distribution evaluation, we further assess the generalization of Mask-Guided Classification (MGC) under distributional shifts by testing on spectrograms generated with alternative acoustic transformations. While high-capacity baseline models lose accuracy in this Out-of-distribution (OOD) setting, MGC maintains stable performance, with even simple fusion mechanisms (gated, concat) achieving comparable results across distributions. This robustness highlights the capacity of MGC to learn transferable representations rather than overfitting to a specific transformation, thereby reinforcing its suitability for large-scale, real-world biodiversity monitoring. We show that in all experimental settings, the MGC framework consistently outperforms baseline architectures, yielding substantial gains in accuracy on both in-distribution and OOD data.
Primary: MILA – Quebec AI Institute
All Institutions: MILA – Quebec AI Institute, Université de Montréal, Université du Québec à Rimouski, Institut Polytechnique de Paris
This paper presents a significant advancement in underwater bioacoustic monitoring through the introduction of a robust, attention-guided framework that enhances the classification of marine mammal vocalizations in challenging acoustic environments. The methodology and results underscore its potential impact on ecological research and conservation efforts.
The paper introduces a multi-step, attention-guided framework that effectively segments spectrograms and generates soft masks for denoising and classification of underwater bioacoustic signals. The methodology leverages a novel Mask-Guided Classification (MGC) approach, integrating segmentation-driven attention and mid-level fusion, which enhances the model's ability to focus on biologically relevant signals while suppressing noise. The use of soft masks to encode spatial proximity is a significant innovation that improves the model's performance in noisy environments.
The experiments are robust, utilizing a real-world dataset from the Saguenay–St. Lawrence Marine Park, which provides a challenging benchmark for underwater bioacoustic classification. The authors demonstrate the effectiveness of their approach through comprehensive evaluations, including in-distribution and out-of-distribution scenarios. The results show substantial improvements in accuracy over baseline models, highlighting the framework's effectiveness in real-world applications.
The authors commit to reproducibility by providing detailed descriptions of their methodology, including the segmentation model training and fusion strategies. However, the dataset is not yet publicly available, which limits immediate reproducibility. They express intentions to release the code and data in the future, which is a positive step for the community.
The paper acknowledges limitations related to the STFT's resolution trade-offs and potential information loss, which may affect the model's ability to capture complex vocalizations fully. Additionally, the lack of explicit uncertainty quantification is noted as a critical area for improvement in future work.
The proposed framework has significant implications for biodiversity monitoring and conservation efforts, particularly in the context of increasing anthropogenic noise in marine environments. By improving the reliability of automated detection systems for marine mammals, the research contributes to ecological monitoring and supports conservation strategies. This paper presents a significant advancement in underwater bioacoustic monitoring through the introduction of a robust, attention-guided framework that enhances the classification of marine mammal vocalizations in challenging acoustic environments. The methodology and results underscore its potential impact on ecological research and conservation efforts.
The analysis of speech production based on physical models of the vocal folds and vocal tract is essential for studies on vocal-fold behavior and linguistic research. This paper proposes a speech production analysis method using physics-informed neural networks (PINNs). The networks are trained directly on the governing equations of vocal-fold vibration and vocal-tract acoustics. Vocal-fold collisions introduce nondifferentiability and vanishing gradients, challenging phenomena for PINNs. We demonstrate, however, that introducing a differentiable approximation function enables the analysis of vocal-fold vibrations within the PINN framework. The period of self-excited vocal-fold vibration is generally unknown. We show that by treating the period as a learnable network parameter, a periodic solution can be obtained. Furthermore, by implementing the coupling between glottal flow and vocal-tract acoustics as a hard constraint, glottis-tract interaction is achieved without additional loss terms. We confirmed the method's validity through forward and inverse analyses, demonstrating that the glottal flow rate, vocal-fold vibratory state, and subglottal pressure can be simultaneously estimated from speech signals. Notably, the same network architecture can be applied to both forward and inverse analyses, highlighting the versatility of this approach. The proposed method inherits the advantages of PINNs, including mesh-free computation and the natural incorporation of nonlinearities, and thus holds promise for a wide range of applications.
Primary: Nagaoka University of Technology
All Institutions: Nagaoka University of Technology
This paper presents a pioneering application of physics-informed neural networks to analyze speech production dynamics, significantly advancing the intersection of machine learning and acoustic modeling. The innovative methodology and comprehensive experimental validation contribute valuable insights to the field, with potential applications in voice analysis and synthesis.
The paper introduces a novel approach using physics-informed neural networks (PINNs) for speech production analysis, effectively addressing the challenges of glottal closure and unknown oscillation periods through differentiable approximation functions and treating the oscillation period as a learnable parameter. The methodology is well-structured, leveraging the strengths of PINNs to incorporate physical constraints directly into the network's loss function, which enhances the model's ability to simulate complex vocal dynamics.
The experiments conducted demonstrate the effectiveness of the proposed method through both forward and inverse analyses. The results show that the PINN can accurately synthesize vowel sounds and estimate vocal-fold states from speech signals, with performance metrics indicating a high degree of accuracy compared to conventional methods. The use of a variety of parameters and conditions in the simulations adds robustness to the findings.
The paper provides detailed implementation information, including the architecture of the neural networks, loss function formulation, and training procedures. However, the absence of a publicly available code repository limits the ease of reproducibility for external researchers who may wish to validate or build upon this work.
While the proposed method shows promise, there are limitations regarding the computational cost associated with training and the complexity of the model, which may hinder real-time applications. Additionally, the focus on vowel sounds may restrict the applicability of the findings to a broader range of speech sounds, such as consonants or more complex vocalizations.
The implications of this research extend to various fields, including linguistics, speech therapy, and voice synthesis technologies. By providing a more accurate model of speech production, this work could lead to advancements in diagnosing voice disorders and improving speech synthesis systems, ultimately enhancing human-computer interaction and communication technologies. This paper presents a pioneering application of physics-informed neural networks to analyze speech production dynamics, significantly advancing the intersection of machine learning and acoustic modeling. The innovative methodology and comprehensive experimental validation contribute valuable insights to the field, with potential applications in voice analysis and synthesis.
Deploying accurate event detection on resource-constrained devices is challenged by the trade-off between performance and computational cost. While Early-Exit (EE) networks offer a solution through adaptive computation, they often fail to enforce a coherent hierarchical structure, limiting the reliability of their early predictions. To address this, we propose Hyperbolic Early-Exit networks (HypEE), a novel framework that learns EE representations in the hyperbolic space. Our core contribution is a hierarchical training objective with a novel entailment loss, which enforces a partial-ordering constraint to ensure that deeper network layers geometrically refine the representations of shallower ones. Experiments on multiple audio event detection tasks and backbone architectures show that HypEE significantly outperforms standard Euclidean EE baselines, especially at the earliest, most computationally-critical exits. The learned geometry also provides a principled measure of uncertainty, enabling a novel triggering mechanism that makes the overall system both more efficient and more accurate than a conventional EE and standard backbone models without early-exits.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Hyperbolic Early-Exit networks, which improve the efficiency and accuracy of event detection in resource-constrained environments through a novel hierarchical training objective. This work represents a meaningful advancement in the field of machine learning for audio applications, particularly in the context of adaptive computation and early-exit strategies.
The proposed Hyperbolic Early-Exit networks (HypEE) leverage hyperbolic geometry to enhance the representation learning of early-exit networks. The introduction of a hierarchical training objective with a novel entailment loss is a significant methodological advancement, as it enforces a partial-ordering constraint that ensures deeper layers refine the representations of shallower layers. This approach is innovative in the context of early-exit networks, which typically rely on Euclidean space representations. The methodology is well-structured, with a clear explanation of how the hyperbolic space is utilized to improve both efficiency and accuracy.
The experiments conducted on multiple audio event detection tasks demonstrate the effectiveness of the HypEE framework. The authors compare their method against standard Euclidean early-exit baselines, showcasing significant improvements, especially at the earliest exits where computational resources are most limited. The use of various backbone architectures strengthens the validity of the results. However, the paper could benefit from additional datasets to further validate the generalizability of the findings.
The paper lacks detailed implementation specifics, such as hyperparameter settings and the exact architecture configurations used in experiments, which are crucial for reproducibility. While the results are compelling, the absence of a publicly available code repository or supplementary materials limits the ability for others to replicate the study.
One limitation of the proposed method is its reliance on hyperbolic geometry, which may not be universally applicable across all types of data or tasks. Additionally, while the results are promising, the experiments are primarily focused on audio event detection, and further exploration in other domains would be beneficial to assess the broader applicability of the HypEE framework.
The implications of this research are significant, particularly for deploying machine learning models on resource-constrained devices, such as mobile phones and IoT devices. By improving the efficiency and accuracy of early-exit networks, the proposed framework could enhance real-time audio processing applications, leading to advancements in areas such as smart home technology, assistive devices, and environmental monitoring. The main contribution of this paper is the introduction of Hyperbolic Early-Exit networks, which improve the efficiency and accuracy of event detection in resource-constrained environments through a novel hierarchical training objective. This work represents a meaningful advancement in the field of machine learning for audio applications, particularly in the context of adaptive computation and early-exit strategies.
Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: https://huggingface.co/JHU-SmileLab
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, University of Texas at Dallas, Carnegie Mellon University, ARRAY Innovation
The main contribution of this paper is the introduction of NaturalVoices, a large-scale, spontaneous podcast dataset designed for emotion-aware voice conversion, which addresses critical gaps in existing datasets and provides a valuable resource for advancing research in voice conversion and emotional speech synthesis. This work significantly enhances the field by offering a comprehensive dataset that captures the richness of real-world speech, along with a robust methodology for its utilization in voice conversion tasks.
The methodology presented in the paper is robust, focusing on the development of the NaturalVoices dataset, which is a significant advancement in the field of voice conversion. The authors provide a comprehensive automated data-sourcing pipeline that includes multi-level annotations for emotion, speaker identity, and speech quality. This systematic approach allows for flexible filtering and customization of the dataset, making it suitable for various voice conversion tasks. The use of podcasts as a source of spontaneous speech is particularly innovative, as it captures the natural emotional dynamics that are often absent in traditional datasets.
The experimental evaluation is thorough, demonstrating the dataset's utility through various voice conversion models. The authors conduct extensive experiments to assess the performance of state-of-the-art models trained on NaturalVoices, revealing both the strengths and limitations of current architectures. The results indicate that NaturalVoices not only supports high-quality voice conversion but also serves as a challenging benchmark, highlighting the need for more advanced models capable of leveraging the dataset's scale and variability.
The paper provides adequate details regarding the dataset construction and the automated processing pipeline, which enhances reproducibility. The authors have made the dataset publicly available along with the necessary tools for annotation and filtering, which facilitates further research. However, specific implementation details regarding the models used in experiments could be expanded to improve reproducibility.
One limitation identified in the paper is the variability in audio quality due to the nature of podcast recordings, which may affect the performance of voice conversion models. Additionally, while the dataset is extensive, it may still lack certain emotional nuances that could be captured through more diverse sources or additional annotations. The authors also acknowledge that current architectures struggle to fully exploit the dataset's complexity, indicating a gap in model capabilities.
The potential applications of NaturalVoices are significant, ranging from conversational speech synthesis to affective computing and deepfake detection. By providing a rich, expressive dataset, this work paves the way for advancements in voice conversion technologies that can better capture the nuances of human communication. The implications extend to various fields, including entertainment, accessibility, and human-computer interaction. The main contribution of this paper is the introduction of NaturalVoices, a large-scale, spontaneous podcast dataset designed for emotion-aware voice conversion, which addresses critical gaps in existing datasets and provides a valuable resource for advancing research in voice conversion and emotional speech synthesis. This work significantly enhances the field by offering a comprehensive dataset that captures the richness of real-world speech, along with a robust methodology for its utilization in voice conversion tasks.
Recent advances in Audio-Language Models (ALMs) have significantly improved multimodal understanding capabilities. However, the introduction of the audio modality also brings new and unique vulnerability vectors. Previous studies have proposed jailbreak attacks that specifically target ALMs, revealing that defenses directly transferred from traditional audio adversarial attacks or text-based Large Language Model (LLM) jailbreaks are largely ineffective against these ALM-specific threats. To address this issue, we propose ALMGuard, the first defense framework tailored to ALMs. Based on the assumption that safety-aligned shortcuts naturally exist in ALMs, we design a method to identify universal Shortcut Activation Perturbations (SAPs) that serve as triggers that activate the safety shortcuts to safeguard ALMs at inference time. To better sift out effective triggers while preserving the model's utility on benign tasks, we further propose Mel-Gradient Sparse Mask (M-GSM), which restricts perturbations to Mel-frequency bins that are sensitive to jailbreaks but insensitive to speech understanding. Both theoretical analyses and empirical results demonstrate the robustness of our method against both seen and unseen attacks. Overall, \MethodName reduces the average success rate of advanced ALM-specific jailbreak attacks to 4.6% across four models, while maintaining comparable utility on benign benchmarks, establishing it as the new state of the art. Our code and data are available at https://github.com/WeifeiJin/ALMGuard.
Primary: The University of Adelaide
All Institutions: The University of Adelaide, Beijing University of Posts and Telecommunications, CSIRO's Data61, Responsible AI Research (RAIR) Centre, Tsinghua University
The paper presents ALMGuard, a pioneering framework that activates inherent safety shortcuts in Audio-Language Models to mitigate jailbreak attacks while preserving model utility. This work is significant as it addresses a critical gap in the security of multimodal AI systems, proposing a targeted and effective defense mechanism that enhances the robustness of ALMs against emerging threats.
The paper introduces a novel framework, ALMGuard, which effectively identifies and activates safety shortcuts in Audio-Language Models (ALMs) using Shortcut Activation Perturbations (SAPs) guided by a Mel-Gradient Sparse Mask (M-GSM). This approach is innovative as it specifically addresses the unique vulnerabilities of ALMs, which have not been adequately covered by existing defenses. The methodology is well-structured, combining theoretical analysis with empirical validation, and presents a clear optimization strategy that balances defense effectiveness with model utility.
The experiments are comprehensive, evaluating the proposed method against multiple state-of-the-art ALMs and a variety of jailbreak attacks. The results demonstrate a significant reduction in the success rates of these attacks while maintaining model performance on benign tasks, showcasing the robustness and generalizability of the approach. The use of diverse datasets and benchmarks strengthens the evaluation.
The paper provides sufficient implementation details, including the optimization process and hyperparameter settings. The availability of code on GitHub enhances reproducibility, allowing other researchers to validate the findings and potentially build upon the work.
While the method shows strong performance against acoustic-based attacks, it exhibits some limitations in defending against semantic-based attacks. The paper acknowledges this and suggests that future work could integrate semantic-level objectives to enhance robustness further. Additionally, the dependency on specific hyperparameters may require careful tuning in practical applications.
The framework has significant implications for enhancing the safety and reliability of ALMs in real-world applications, particularly in critical systems like robotics and virtual assistants. By addressing vulnerabilities unique to audio inputs, this research contributes to the broader goal of developing more secure AI systems that can be trusted in sensitive contexts. The paper presents ALMGuard, a pioneering framework that activates inherent safety shortcuts in Audio-Language Models to mitigate jailbreak attacks while preserving model utility. This work is significant as it addresses a critical gap in the security of multimodal AI systems, proposing a targeted and effective defense mechanism that enhances the robustness of ALMs against emerging threats.
The evaluation of intelligibility for TTS has reached a bottleneck, as existing assessments heavily rely on word-by-word accuracy metrics such as WER, which fail to capture the complexity of real-world speech or reflect human comprehension needs. To address this, we propose Spoken-Passage Multiple-Choice Question Answering, a novel subjective approach evaluating the accuracy of key information in synthesized speech, and release SP-MCQA-Eval, an 8.76-hour news-style benchmark dataset for SP-MCQA evaluation. Our experiments reveal that low WER does not necessarily guarantee high key-information accuracy, exposing a gap between traditional metrics and practical intelligibility. SP-MCQA shows that even state-of-the-art (SOTA) models still lack robust text normalization and phonetic accuracy. This work underscores the urgent need for high-level, more life-like evaluation criteria now that many systems already excel at WER yet may fall short on real-world intelligibility.
Primary: The Chinese University of Hong Kong
All Institutions: The Chinese University of Hong Kong
The paper introduces SP-MCQA, a novel framework for evaluating the intelligibility of TTS systems beyond traditional metrics. It significantly advances the field by addressing critical gaps in current evaluation methods and providing a comprehensive dataset for future research.
The proposed methodology, SP-MCQA, introduces a novel subjective evaluation framework that shifts the focus from traditional word-level accuracy metrics to assessing key information accuracy in synthesized speech. This approach is innovative as it addresses the limitations of existing metrics like WER and MOS, which do not adequately reflect human comprehension needs. The detailed construction of the SP-MCQA-Eval dataset, with its emphasis on challenging real-world speech scenarios, is commendable and provides a robust foundation for evaluating TTS systems.
The experiments conducted are thorough and well-structured, utilizing a diverse set of state-of-the-art TTS models. The evaluation metrics employed, including both subjective and objective assessments, provide a comprehensive view of model performance. The results reveal significant insights into the limitations of current TTS systems, particularly regarding phonetic accuracy and text normalization, which are critical for real-world applications.
The paper provides sufficient details regarding the experimental setup, including the selection of models and the evaluation process. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Future work should focus on making the methodology and datasets accessible to encourage further research in this area.
The primary limitation identified is the substantial manual effort required for human evaluation, which could hinder scalability. Additionally, the dataset may not encompass all possible speech variations, potentially limiting the generalizability of the findings. The reliance on human annotators also introduces variability that could affect the consistency of results.
This research has significant implications for the development of more intelligent TTS systems that prioritize human comprehension over mere word accuracy. By highlighting the importance of key information accuracy, it encourages the design of TTS systems that are better suited for real-world applications, potentially enhancing user experience in various domains, including education, media, and accessibility. The paper introduces SP-MCQA, a novel framework for evaluating the intelligibility of TTS systems beyond traditional metrics. It significantly advances the field by addressing critical gaps in current evaluation methods and providing a comprehensive dataset for future research.
Automated monitoring of marine mammals in the St. Lawrence Estuary faces extreme challenges: calls span low-frequency moans to ultrasonic clicks, often overlap, and are embedded in variable anthropogenic and environmental noise. We introduce a multi-step, attention-guided framework that first segments spectrograms to generate soft masks of biologically relevant energy and then fuses these masks with the raw inputs for multi-band, denoised classification. Image and mask embeddings are integrated via mid-level fusion, enabling the model to focus on salient spectrogram regions while preserving global context. Using real-world recordings from the Saguenay St. Lawrence Marine Park Research Station in Canada, we demonstrate that segmentation-driven attention and mid-level fusion improve signal discrimination, reduce false positive detections, and produce reliable representations for operational marine mammal monitoring across diverse environmental conditions and signal-to-noise ratios. Beyond in-distribution evaluation, we further assess the generalization of Mask-Guided Classification (MGC) under distributional shifts by testing on spectrograms generated with alternative acoustic transformations. While high-capacity baseline models lose accuracy in this Out-of-distribution (OOD) setting, MGC maintains stable performance, with even simple fusion mechanisms (gated, concat) achieving comparable results across distributions. This robustness highlights the capacity of MGC to learn transferable representations rather than overfitting to a specific transformation, thereby reinforcing its suitability for large-scale, real-world biodiversity monitoring. We show that in all experimental settings, the MGC framework consistently outperforms baseline architectures, yielding substantial gains in accuracy on both in-distribution and OOD data.
Primary: MILA – Quebec AI Institute
All Institutions: MILA – Quebec AI Institute, Université de Montréal, Université du Québec à Rimouski, Institut Polytechnique de Paris
This paper presents a significant advancement in underwater bioacoustic monitoring through the introduction of a robust, attention-guided framework that enhances the classification of marine mammal vocalizations in challenging acoustic environments. The methodology and results underscore its potential impact on ecological research and conservation efforts.
The paper introduces a multi-step, attention-guided framework that effectively segments spectrograms and generates soft masks for denoising and classification of underwater bioacoustic signals. The methodology leverages a novel Mask-Guided Classification (MGC) approach, integrating segmentation-driven attention and mid-level fusion, which enhances the model's ability to focus on biologically relevant signals while suppressing noise. The use of soft masks to encode spatial proximity is a significant innovation that improves the model's performance in noisy environments.
The experiments are robust, utilizing a real-world dataset from the Saguenay–St. Lawrence Marine Park, which provides a challenging benchmark for underwater bioacoustic classification. The authors demonstrate the effectiveness of their approach through comprehensive evaluations, including in-distribution and out-of-distribution scenarios. The results show substantial improvements in accuracy over baseline models, highlighting the framework's effectiveness in real-world applications.
The authors commit to reproducibility by providing detailed descriptions of their methodology, including the segmentation model training and fusion strategies. However, the dataset is not yet publicly available, which limits immediate reproducibility. They express intentions to release the code and data in the future, which is a positive step for the community.
The paper acknowledges limitations related to the STFT's resolution trade-offs and potential information loss, which may affect the model's ability to capture complex vocalizations fully. Additionally, the lack of explicit uncertainty quantification is noted as a critical area for improvement in future work.
The proposed framework has significant implications for biodiversity monitoring and conservation efforts, particularly in the context of increasing anthropogenic noise in marine environments. By improving the reliability of automated detection systems for marine mammals, the research contributes to ecological monitoring and supports conservation strategies. This paper presents a significant advancement in underwater bioacoustic monitoring through the introduction of a robust, attention-guided framework that enhances the classification of marine mammal vocalizations in challenging acoustic environments. The methodology and results underscore its potential impact on ecological research and conservation efforts.
Audio-visual speech enhancement (AVSE) is a task that uses visual auxiliary information to extract a target speaker's speech from mixed audio. In real-world scenarios, there often exist complex acoustic environments, accompanied by various interfering sounds and reverberation. Most previous methods struggle to cope with such complex conditions, resulting in poor perceptual quality of the extracted speech. In this paper, we propose an effective AVSE system that performs well in complex acoustic environments. Specifically, we design a "separation before dereverberation" pipeline that can be extended to other AVSE networks. The 4th COGMHEAR Audio-Visual Speech Enhancement Challenge (AVSEC) aims to explore new approaches to speech processing in multimodal complex environments. We validated the performance of our system in AVSEC-4: we achieved excellent results in the three objective metrics on the competition leaderboard, and ultimately secured first place in the human subjective listening test.
Primary: Wuhan University
All Institutions: Wuhan University, Duke Kunshan University, School of Artificial Intelligence, School of Computer Science, School of Cyber Science and Engineering, Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems
This paper presents a significant advancement in audio-visual speech enhancement by introducing a novel separation and dereverberation pipeline, demonstrating strong performance in complex acoustic environments. The methodology and results contribute meaningfully to the field, although challenges related to reproducibility and practical application remain.
The proposed methodology introduces a novel "separation before dereverberation" approach, which effectively decouples the speech extraction and dereverberation processes. This two-stage method, combined with progressive loss training, enhances the model's ability to handle complex acoustic environments. The integration of visual features and the use of a pre-trained dereverberation model as a post-processing step are significant advancements in the AVSE domain. However, the reliance on specific architectures like TFGridNet and the choice of dereverberation models may limit the generalizability of the approach.
The experiments are robust, utilizing a well-structured dataset from the AVSEC-4 challenge, which simulates complex real-world conditions. The paper reports competitive results across multiple objective metrics and achieves first place in subjective evaluations, indicating strong performance. The detailed breakdown of results and the inclusion of ablation studies provide valuable insights into the effectiveness of the proposed methods.
The paper includes implementation details such as the training strategy, loss functions, and dataset characteristics, which are essential for reproducibility. However, the lack of a publicly available code repository may hinder full reproducibility of the results.
One limitation is the potential over-optimization of noise removal, which can affect the natural timbre of the speech. Additionally, the long training times associated with certain strategies may limit practical applications. The paper does not address the scalability of the proposed methods to other AVSE tasks or environments.
The advancements in audio-visual speech enhancement have significant implications for applications in communication technologies, hearing aids, and assistive devices, particularly in noisy environments. The methodology could be extended to improve speech recognition systems and enhance user experiences in various multimodal applications. This paper presents a significant advancement in audio-visual speech enhancement by introducing a novel separation and dereverberation pipeline, demonstrating strong performance in complex acoustic environments. The methodology and results contribute meaningfully to the field, although challenges related to reproducibility and practical application remain.
We present PitchFlower, a flow-based neural audio codec with explicit pitch controllability. Our approach enforces disentanglement through a simple perturbation: during training, F0 contours are flattened and randomly shifted, while the true F0 is provided as conditioning. A vector-quantization bottleneck prevents pitch recovery, and a flow-based decoder generates high quality audio. Experiments show that PitchFlower achieves more accurate pitch control than WORLD at much higher audio quality, and outperforms SiFiGAN in controllability while maintaining comparable quality. Beyond pitch, this framework provides a simple and extensible path toward disentangling other speech attributes.
Primary: Sorbonne Université
All Institutions: Sorbonne Université
PitchFlower introduces a novel flow-based neural audio codec that achieves explicit pitch controllability through a unique perturbation strategy, marking a significant advancement in the field of audio processing and speech synthesis. The comprehensive evaluation of disentanglement strategies and the rigorous experimental validation underscore its potential impact on future research and applications in audio technology.
The methodology presented in PitchFlower is innovative, leveraging a flow-based architecture to achieve explicit pitch controllability through a perturbation strategy. The use of a vector-quantization bottleneck and conditioning on true F0 during training is a clever approach to enforce disentanglement. The authors systematically compare various disentanglement strategies, providing a comprehensive analysis of their trade-offs, which adds depth to the methodology. However, the reliance on WORLD for pitch extraction may introduce inherent biases and limitations.
The experiments are well-designed, utilizing the LibriTTS dataset and employing a range of objective metrics to evaluate pitch control, audio quality, and intelligibility. The results demonstrate that PitchFlower outperforms traditional DSP-based methods and competes effectively with state-of-the-art neural approaches. The comparative analysis with other methods such as SiFiGAN and WORLD is thorough, providing a clear understanding of PitchFlower's advantages and limitations.
The paper provides sufficient implementation details, including architecture specifications, training parameters, and evaluation metrics, which enhances reproducibility. The authors also share their code on GitHub, which is a positive aspect for the research community. However, the paper could benefit from additional details regarding the dataset preprocessing and specific hyperparameter tuning.
One of the main limitations noted is the model's performance being constrained by the F0 range observed during training, which may affect its applicability in broader contexts. Additionally, the introduction of artifacts from WORLD during the perturbation process could impact the quality of the output audio, particularly in terms of speaker similarity.
The implications of this work are significant for applications in speech synthesis, music production, and voice modulation technologies. The ability to control pitch with high fidelity can enhance user experiences in various domains, including entertainment and assistive technologies. Furthermore, the framework's potential for disentangling other speech attributes could lead to advancements in personalized audio applications. PitchFlower introduces a novel flow-based neural audio codec that achieves explicit pitch controllability through a unique perturbation strategy, marking a significant advancement in the field of audio processing and speech synthesis. The comprehensive evaluation of disentanglement strategies and the rigorous experimental validation underscore its potential impact on future research and applications in audio technology.
Self-Supervised Learning (SSL) excels at learning generic representations of acoustic signals, yet prevailing methods remain domain-specific, tailored to either speech or general audio, hindering the development of a unified representation model with a comprehensive capability over both domains. To address this, we present SPEAR (SPEech and Audio Representations), the first SSL framework to successfully learn unified speech and audio representations from a mixture of speech and audio data. SPEAR proposes a unified pre-training objective based on masked prediction of fine-grained discrete tokens for both speech and general audio. These tokens are derived from continuous speech and audio representations using a Multi-codebook Vector Quantisation (MVQ) method, retaining rich acoustic detail essential for modelling both speech and complex audio events. SPEAR is applied to pre-train both single-domain and unified speech-and-audio SSL models. Our speech-domain model establishes a new state-of-the-art on the SUPERB benchmark, a speech processing benchmark for SSL models, matching or surpassing the highly competitive WavLM Large on 12 out of 15 tasks with the same pre-training corpora and a similar model size. Crucially, our unified model learns complementary features and demonstrates comprehensive capabilities across two major benchmarks, SUPERB and HEAR, for evaluating audio representations. By further scaling up the model size and pre-training data, we present a unified model with 600M parameters that excels in both domains, establishing it as one of the most powerful and versatile open-source SSL models for auditory understanding. The inference code and pre-trained models will be made publicly available.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Shanghai Artificial Intelligence Laboratory, Tsinghua University, University of Cambridge
The main contribution of this paper is the introduction of SPEAR, a unified self-supervised learning framework that effectively learns representations for both speech and audio, achieving state-of-the-art results and demonstrating comprehensive capabilities across multiple benchmarks. This work represents a significant advancement in the field of audio representation learning, addressing a critical need for models that can operate across different acoustic domains.
The methodology presented in SPEAR is robust and innovative, focusing on a unified self-supervised learning framework that integrates speech and audio representation learning. The use of masked prediction of fine-grained discrete tokens derived from continuous representations via Multi-codebook Vector Quantisation (MVQ) is particularly noteworthy, as it retains essential acoustic details that are crucial for both speech and audio processing. This approach addresses a significant gap in the existing literature where models are typically domain-specific. The framework's ability to pre-train models effectively across both domains is a substantial contribution to the field.
The experimental setup is comprehensive, with the authors validating their models on established benchmarks such as SUPERB and HEAR. The results indicate that the speech-domain model achieves state-of-the-art performance, surpassing competitive models like WavLM Large. The unified model also demonstrates complementary feature learning, which is a critical aspect of its design. However, the paper could benefit from a more detailed analysis of the datasets used and the specific metrics for evaluation.
While the authors mention that the inference code and pre-trained models will be made publicly available, the paper lacks detailed implementation specifics that would facilitate reproducibility. Clear documentation of the training process, hyperparameters, and data preprocessing steps would enhance the reproducibility of the results.
One limitation of the study is the potential overfitting to the benchmarks used for evaluation, as the results are primarily reported on a limited set of tasks. Additionally, the scalability of the model with respect to diverse audio datasets outside the tested benchmarks remains to be explored. The paper could also address the computational resources required for training such large models, which may limit accessibility for some researchers.
The SPEAR framework has significant implications for various applications in speech and audio processing, including voice recognition, audio event detection, and multimedia content analysis. By providing a unified model, it opens avenues for cross-domain applications and could lead to advancements in areas such as human-computer interaction and accessibility technologies. The main contribution of this paper is the introduction of SPEAR, a unified self-supervised learning framework that effectively learns representations for both speech and audio, achieving state-of-the-art results and demonstrating comprehensive capabilities across multiple benchmarks. This work represents a significant advancement in the field of audio representation learning, addressing a critical need for models that can operate across different acoustic domains.
State-of-the-art vocal separation models like Mel-Band-Roformer rely on full temporal self-attention mechanisms, where each temporal frame interacts with every other frames. This incurs heavy computational costs that scales quadratically with input audio length, motivating chunking and windowing approaches. Through analysis of a pre-trained vocal separation model, we discovered that temporal attention patterns are highly localized. Building on this insight, we replaced full attention with windowed sink attention (WSA) with small temporal attention window and attention sinks. We show empirically that fine-tuning from the original checkpoint recovers 92% of the original SDR performance while reducing FLOPs by 44.5x. We release our code and checkpoints under MIT license at https://github.com/smulelabs/windowed-roformer.
Primary: Smule Labs
All Institutions: Smule Labs
The main contribution of this paper is the introduction of Windowed Sink Attention (WSA) for vocal source separation, which effectively reduces computational complexity while preserving separation quality. This work significantly advances the field by demonstrating that understanding the structure of audio data can lead to more efficient model architectures, ultimately making sophisticated audio processing techniques more accessible.
The paper introduces a novel approach to vocal source separation by replacing full temporal self-attention with Windowed Sink Attention (WSA), which significantly reduces computational costs while maintaining performance. The methodology is well-structured, beginning with an analysis of existing attention patterns that reveal the locality of temporal attention. The proposed WSA mechanism effectively captures local dependencies while incorporating global context through sink tokens. The use of knowledge distillation to fine-tune the model is a sound strategy, ensuring that the model retains performance while benefiting from the reduced computational complexity.
The experiments are rigorous, utilizing a well-known dataset (MUSDB18HQ) and comparing the performance of the WSA model against the original Mel-Band-Roformer. The evaluation metrics, including SDR, Fullness, and Bleedless, provide a comprehensive assessment of the model's performance. The results demonstrate that the WSA model achieves a 44.5x reduction in FLOPs while retaining over 90% of the original model's performance, highlighting the effectiveness of the proposed approach.
The authors have made their code and checkpoints publicly available under the MIT license, which enhances reproducibility. The paper provides sufficient detail regarding the implementation of the WSA mechanism and the training setup, allowing other researchers to replicate the experiments. However, the lack of a demo URL limits immediate accessibility for practical applications.
While the WSA model shows significant improvements in computational efficiency, the paper notes a degradation in the Bleedless score, indicating increased interference from non-target sources. This suggests that while the model is efficient, it may require further refinement to improve separation quality in challenging audio segments. Additionally, the choice of window size and the number of sink tokens could benefit from further exploration to optimize performance.
The findings have potential implications for the deployment of vocal separation models in resource-constrained environments, such as mobile devices and edge computing. By reducing computational requirements, the proposed method could enable broader access to advanced audio processing technologies, facilitating applications in music production, accessibility tools, and content analysis. The insights gained regarding the locality of attention patterns may also inspire future research in other domains of audio processing and machine learning. The main contribution of this paper is the introduction of Windowed Sink Attention (WSA) for vocal source separation, which effectively reduces computational complexity while preserving separation quality. This work significantly advances the field by demonstrating that understanding the structure of audio data can lead to more efficient model architectures, ultimately making sophisticated audio processing techniques more accessible.
We present MGAudio, a novel flow-based framework for open-domain video-to-audio generation, which introduces model-guided dual-role alignment as a central design principle. Unlike prior approaches that rely on classifier-based or classifier-free guidance, MGAudio enables the generative model to guide itself through a dedicated training objective designed for video-conditioned audio generation. The framework integrates three main components: (1) a scalable flow-based Transformer model, (2) a dual-role alignment mechanism where the audio-visual encoder serves both as a conditioning module and as a feature aligner to improve generation quality, and (3) a model-guided objective that enhances cross-modal coherence and audio realism. MGAudio achieves state-of-the-art performance on VGGSound, reducing FAD to 0.40, substantially surpassing the best classifier-free guidance baselines, and consistently outperforms existing methods across FD, IS, and alignment metrics. It also generalizes well to the challenging UnAV-100 benchmark. These results highlight model-guided dual-role alignment as a powerful and scalable paradigm for conditional video-to-audio generation. Code is available at: https://github.com/pantheon5100/mgaudio
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology, Northwestern Polytechnical University
MGAudio represents a significant advancement in the field of audio generation, introducing a novel model-guided framework that achieves state-of-the-art performance while addressing key challenges in video-to-audio synthesis. The methodology is robust, and the experimental results validate its effectiveness, making it a valuable contribution to the machine learning community.
The paper introduces MGAudio, a novel framework for video-to-audio generation that leverages model-guided dual-role alignment. The methodology is well-structured, incorporating a scalable flow-based Transformer model, a dual-role audio-visual encoder, and a model-guided objective. This approach effectively addresses the limitations of existing classifier-free guidance methods by providing direct model-based supervision, enhancing both training efficiency and audio generation quality. The dual-role alignment mechanism is particularly innovative, allowing the audio-visual encoder to serve dual functions that improve feature integration and generation coherence.
The experiments are comprehensive, demonstrating the effectiveness of MGAudio on the VGGSound and UnAV-100 benchmarks. The reported results, including a Fréchet Audio Distance (FAD) of 0.40, showcase significant improvements over previous state-of-the-art methods. The paper includes ablation studies that validate the contributions of each component, providing a thorough analysis of the model's performance across various metrics. The generalization capabilities of the model are also highlighted, indicating robustness across different datasets.
The paper provides sufficient implementation details, including training configurations, model architecture, and evaluation metrics, which facilitate reproducibility. The authors have made the code publicly available, further supporting the reproducibility of their results. However, more detailed descriptions of hyperparameter tuning and specific experimental setups could enhance reproducibility.
The paper acknowledges limitations in generating complex audio such as human vocalizations, indicating that the model may struggle with subtle visual cues. Additionally, the reliance on iterative sampling processes could hinder inference speed, which is a common challenge in generative models. The authors suggest future work could focus on improving efficiency and addressing these limitations.
The proposed framework has significant implications for various applications, including film production, video editing, and content creation, where realistic audio generation is crucial for enhancing viewer experience. The model's ability to generate coherent audio from video inputs could revolutionize workflows in multimedia production, making it more efficient and accessible. MGAudio represents a significant advancement in the field of audio generation, introducing a novel model-guided framework that achieves state-of-the-art performance while addressing key challenges in video-to-audio synthesis. The methodology is robust, and the experimental results validate its effectiveness, making it a valuable contribution to the machine learning community.
Recent synthetic speech detection models typically adapt a pre-trained SSL model via finetuning, which is computationally demanding. Parameter-Efficient Fine-Tuning (PEFT) offers an alternative. However, existing methods lack the specific inductive biases required to model the multi-scale temporal artifacts characteristic of spoofed audio. This paper introduces the Multi-Scale Convolutional Adapter (MultiConvAdapter), a parameter-efficient architecture designed to address this limitation. MultiConvAdapter integrates parallel convolutional modules within the SSL encoder, facilitating the simultaneous learning of discriminative features across multiple temporal resolutions, capturing both short-term artifacts and long-term distortions. With only $3.17$M trainable parameters ($1\%$ of the SSL backbone), MultiConvAdapter substantially reduces the computational burden of adaptation. Evaluations on five public datasets, demonstrate that MultiConvAdapter achieves superior performance compared to full fine-tuning and established PEFT methods.
Primary: Technical University of Berlin
All Institutions: Technical University of Berlin
The main contribution of this paper is the introduction of the Multi-Scale Convolutional Adapter, a parameter-efficient architecture that significantly improves synthetic speech detection by capturing multi-scale temporal artifacts with a minimal number of trainable parameters. This work represents a meaningful advancement in the field of audio processing, particularly in combating the challenges posed by synthetic speech technologies.
The proposed Multi-Scale Convolutional Adapter (MultiConvAdapter) introduces a novel architecture that integrates parallel convolutional modules within a self-supervised learning (SSL) encoder, allowing for the simultaneous learning of features across multiple temporal resolutions. This approach effectively addresses the limitations of existing parameter-efficient fine-tuning (PEFT) methods by incorporating inductive biases tailored for the temporal characteristics of synthetic speech. The methodology is well-structured, leveraging depthwise convolutions to maintain a low parameter count while enhancing the model's ability to capture both short-term and long-term artifacts in spoofed audio.
The experiments conducted on five public datasets demonstrate the robustness and effectiveness of MultiConvAdapter compared to full fine-tuning and other PEFT methods. The results show a significant reduction in Equal Error Rate (EER) across various configurations, indicating that the proposed method not only achieves superior performance but also does so with a fraction of the parameters. The comprehensive evaluation across diverse datasets reflects a thorough understanding of the challenges in synthetic speech detection.
The paper provides sufficient implementation details, including the architecture of the SSL backbone, training parameters, and data augmentation techniques. However, the absence of a publicly available code repository or demo URL limits the reproducibility of the results. Future work should consider sharing the implementation to facilitate validation by the research community.
While the MultiConvAdapter shows promising results, the paper does not extensively discuss potential limitations, such as the generalizability of the model across different types of synthetic speech or its performance in real-world scenarios. Additionally, the reliance on specific kernel sizes may not universally apply to all datasets, indicating a need for further exploration of adaptive kernel selection.
The advancements in synthetic speech detection have significant implications for security and misinformation mitigation in the context of AI-generated audio. The proposed method could enhance the reliability of systems designed to distinguish between genuine and spoofed speech, thus contributing positively to the integrity of audio communications in various applications, including media, security, and online interactions. The main contribution of this paper is the introduction of the Multi-Scale Convolutional Adapter, a parameter-efficient architecture that significantly improves synthetic speech detection by capturing multi-scale temporal artifacts with a minimal number of trainable parameters. This work represents a meaningful advancement in the field of audio processing, particularly in combating the challenges posed by synthetic speech technologies.
Codec-based text-to-speech (TTS) models have recently gained traction for their efficiency and strong performance in voice cloning. However, codec-based TTS faces limitations due to the challenges of pretraining robust speech codecs and the quality degradation introduced by quantization errors. Emerging evidence suggests that continuous-valued generative models can alleviate these issues and serve as a promising alternative. Yet, effectively modelling diverse speech patterns and developing reliable sampling strategies for continuous-valued autoregressive (AR) TTS remains underexplored. In this work, we propose BELLE, Bayesian evidential learning with language modelling for TTS, a novel continuous-valued AR framework that directly predicts mel-spectrograms from textual input. BELLE treats each mel-spectrogram frame as a Gaussian distribution sampled from a learned hyper distribution, enabling principled uncertainty estimation, particularly in scenarios with parallel data (i.e., one text-audio prompt paired with multiple speech samples). To obtain such data, diverse speech samples are synthesized using multiple pre-trained TTS models given the same text-audio prompts, which are distilled into BELLE via Bayesian evidential learning. Experimental results indicate that BELLE demonstrates highly competitive performance compared with the current best open-source TTS models, even though BELLE is trained on a large amount of synthetic data and uses only approximately one-tenth of their training data. Audio samples generated by BELLE are available at https://belletts.github.io/Belle/. The code, checkpoints, and synthetic data will be released after the paper is accepted.
Primary: Tsinghua University
All Institutions: Shanghai Artificial Intelligence Laboratory, Tsinghua University
The main contribution of this paper is the introduction of BELLE, a continuous-valued autoregressive TTS model that leverages Bayesian evidential learning and multi-teacher knowledge distillation, achieving competitive performance with significantly less training data. This work represents a notable advancement in the field of text-to-speech synthesis, addressing key challenges in audio generation and uncertainty estimation.
The paper introduces BELLE, a novel continuous-valued autoregressive TTS model utilizing Bayesian evidential learning for uncertainty estimation in mel-spectrogram prediction. The methodology is well-structured, leveraging multi-teacher knowledge distillation to enhance training efficiency and output quality. The approach of treating each mel-spectrogram frame as a Gaussian distribution sampled from a hyper-distribution is innovative and addresses the limitations of existing codec-based TTS models. The integration of Bayesian methods into TTS is a significant advancement, providing a principled framework for uncertainty quantification and improving the robustness of the model.
The experimental setup is thorough, utilizing a substantial dataset derived from Librispeech and augmented with synthetic samples from multiple pre-trained TTS models. The results demonstrate that BELLE achieves competitive performance against state-of-the-art models while using significantly less training data. The evaluation metrics, including subjective (MOS, SMOS) and objective (WER, SIM) measures, provide a comprehensive assessment of the model's capabilities. The findings indicate that BELLE not only maintains high audio quality but also enhances speaker similarity and diversity in the generated speech.
The paper includes a reproducibility statement, indicating that code, checkpoints, and synthetic data will be made available upon acceptance. The detailed descriptions of model configurations, training parameters, and experimental setups enhance the reproducibility of the results. However, the reliance on multiple external TTS models for data augmentation may complicate replication efforts if those models are not readily accessible.
The study primarily focuses on English speech data, limiting its applicability to multilingual contexts. Additionally, the real-time factor (RTF) of BELLE is still higher compared to some recent models, indicating potential latency issues in real-time applications. The authors acknowledge these limitations and suggest future work to extend BELLE's capabilities to other languages and improve latency.
BELLE has significant potential applications in various domains, including assistive communication technologies, personalized education platforms, and human-machine interaction systems. However, the paper also highlights ethical concerns regarding the misuse of TTS technologies, such as impersonation and identity fraud, emphasizing the need for responsible use and regulatory oversight. The main contribution of this paper is the introduction of BELLE, a continuous-valued autoregressive TTS model that leverages Bayesian evidential learning and multi-teacher knowledge distillation, achieving competitive performance with significantly less training data. This work represents a notable advancement in the field of text-to-speech synthesis, addressing key challenges in audio generation and uncertainty estimation.
We present a neuromuscular speech interface that translates electromyographic (EMG) signals collected from orofacial muscles during speech articulation directly into audio. We show that self-supervised speech (SS) representations exhibit a strong linear relationship with the electrical power of muscle action potentials: SS features can be linearly mapped to EMG power with a correlation of $r = 0.85$. Moreover, EMG power vectors corresponding to different articulatory gestures form structured and separable clusters in feature space. This relationship: $\text{SS features}$ $\xrightarrow{\texttt{linear mapping}}$ $\text{EMG power}$ $\xrightarrow{\texttt{gesture-specific clustering}}$ $\text{articulatory movements}$, highlights that SS models implicitly encode articulatory mechanisms. Leveraging this property, we directly map EMG signals to SS feature space and synthesize speech, enabling end-to-end EMG-to-speech generation without explicit articulatory models and vocoder training.
Primary: University of California
All Institutions: University of California
This paper presents a novel approach to synthesizing speech from EMG signals using self-supervised models, contributing valuable insights into the relationship between muscle activity and speech production. The methodology and experimental design are robust, and the potential applications in assistive technology highlight the importance of this research in the field of machine learning and audio processing.
The methodology presented in this paper is innovative, leveraging self-supervised speech models to create a neuromuscular speech interface that translates EMG signals into audio. The authors establish a strong linear relationship between self-supervised features and EMG power, which is a significant insight into the underlying mechanics of speech articulation. The use of a time depth separable convolutional network for mapping EMG features to speech representations is a novel approach that enhances the interpretability of the model. The paper also emphasizes the importance of structured representations of EMG signals, which is a meaningful contribution to the field.
The experiments are well-structured, involving data collection from multiple participants performing a variety of orofacial movements. The authors provide a comprehensive dataset that includes a large vocabulary and natural speaking rates, which is a notable improvement over previous studies. The results demonstrate a high correlation between the self-supervised features and EMG power, validating the proposed method. However, the paper could benefit from more extensive evaluations against existing benchmarks to further substantiate the claims.
The paper mentions that data and code will be made publicly available, which is a positive aspect for reproducibility. However, the lack of specific URLs or repositories at this stage limits immediate access to the implementation details. The methodology is described in sufficient detail, allowing for potential replication, but the absence of a demo or project URL is a drawback.
The study's limitations include the potential variability in EMG signal acquisition due to factors like electrode placement and skin conditions, which could affect the generalizability of the results. Additionally, while the authors claim to address the alignment-free mapping challenge, the complexity of the underlying relationships may still pose challenges in real-world applications. The reliance on self-supervised models may also limit the approach's applicability to languages or dialects not represented in the training data.
The implications of this research are significant, particularly for individuals with speech impairments. By providing a non-invasive method for speech synthesis, the work could enhance communication for a broader range of individuals, including those with conditions like dysarthria or laryngectomy. The approach has the potential to influence future research in neuromuscular interfaces and assistive technologies, paving the way for more inclusive communication solutions. This paper presents a novel approach to synthesizing speech from EMG signals using self-supervised models, contributing valuable insights into the relationship between muscle activity and speech production. The methodology and experimental design are robust, and the potential applications in assistive technology highlight the importance of this research in the field of machine learning and audio processing.
We present a novel neural network architecture for the efficient prediction of sound fields in two and three dimensions. The network is designed to automatically satisfy the Helmholtz equation, ensuring that the outputs are physically valid. Therefore, the method can effectively learn solutions to boundary-value problems in various wave phenomena, such as acoustics, optics, and electromagnetism. Numerical experiments show that the proposed strategy can potentially outperform state-of-the-art methods in room acoustics simulation, in particular in the range of mid to high frequencies.
Primary: Technical University of Denmark
All Institutions: Technical University of Denmark
The main contribution of this paper is the development of HergNet, a novel neural network architecture that efficiently predicts sound fields while satisfying physical constraints, demonstrating significant improvements over traditional numerical methods in specific scenarios. This work represents a meaningful advancement in the intersection of machine learning and physics-informed modeling, with the potential to influence a variety of applications across different scientific domains.
The paper introduces HergNet, a neural network architecture that leverages the Herglotz representation to efficiently predict sound fields by ensuring compliance with the Helmholtz equation. This approach is innovative as it integrates physics-informed principles directly into the neural network design, allowing for the automatic satisfaction of physical constraints. The methodology is well-structured, with clear mathematical formulations and justifications for the choices made in the design of the neural network, including the use of complex-valued networks and the focus on boundary conditions to reduce computational costs.
The experimental section is robust, featuring numerical experiments that demonstrate the effectiveness of HergNet in simulating sound fields in a shoebox room setup. The results are compared against analytic solutions, showcasing a high degree of accuracy, particularly in mid- to high-frequency ranges. The authors provide detailed descriptions of the experimental setup, including the parameters used for training and the computational resources required, which adds credibility to their findings.
The paper includes sufficient details about the implementation, including the use of Python and JAX, which aids in reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the experiments. Providing a link to the code would significantly enhance reproducibility.
While the method shows promise, it is noted that its effectiveness diminishes at lower frequencies, where traditional numerical methods may still outperform it. Additionally, the reliance on boundary conditions could limit its applicability in scenarios with complex internal geometries or varying material properties.
The implications of this work extend beyond acoustics, as the methodology could be adapted to other fields involving wave phenomena, such as optics and electromagnetism. The potential for faster simulations in complex environments could significantly benefit industries reliant on accurate wave modeling, including architectural acoustics, audio engineering, and even medical imaging. The main contribution of this paper is the development of HergNet, a novel neural network architecture that efficiently predicts sound fields while satisfying physical constraints, demonstrating significant improvements over traditional numerical methods in specific scenarios. This work represents a meaningful advancement in the intersection of machine learning and physics-informed modeling, with the potential to influence a variety of applications across different scientific domains.
The Tsetlin Machine (TM) has recently attracted attention as a low-power alternative to neural networks due to its simple and interpretable inference mechanisms. However, its performance on speech-related tasks remains limited. This paper proposes TsetlinKWS, the first algorithm-hardware co-design framework for the Convolutional Tsetlin Machine (CTM) on the 12-keyword spotting task. Firstly, we introduce a novel Mel-Frequency Spectral Coefficient and Spectral Flux (MFSC-SF) feature extraction scheme together with spectral convolution, enabling the CTM to reach its first-ever competitive accuracy of 87.35% on the 12-keyword spotting task. Secondly, we develop an Optimized Grouped Block-Compressed Sparse Row (OG-BCSR) algorithm that achieves a remarkable 9.84$\times$ reduction in model size, significantly improving the storage efficiency on CTMs. Finally, we propose a state-driven architecture tailored for the CTM, which simultaneously exploits data reuse and sparsity to achieve high energy efficiency. The full system is evaluated in 65 nm process technology, consuming 16.58 $\mu$W at 0.7 V with a compact 0.63 mm$^2$ core area. TsetlinKWS requires only 907k logic operations per inference, representing a 10$\times$ reduction compared to the state-of-the-art KWS accelerators, positioning the CTM as a highly-efficient candidate for ultra-low-power speech applications.
Primary: University of Southampton
All Institutions: University of Southampton, The Hong Kong University of Science and Technology (Guangzhou), Newcastle University, University College London
The main contribution of this paper is the introduction of TsetlinKWS, a novel algorithm-hardware co-design framework that significantly enhances the performance and efficiency of Tsetlin Machines for keyword spotting tasks. This work represents a meaningful advancement in the field of low-power machine learning accelerators, combining innovative methodologies with practical experimental validation.
The paper introduces a novel algorithm-hardware co-design framework for the Convolutional Tsetlin Machine (CTM) specifically tailored for keyword spotting tasks. The methodology is robust, combining a new Mel-Frequency Spectral Coefficient and Spectral Flux (MFSC-SF) feature extraction scheme with an optimized grouped block-compressed sparse row (OG-BCSR) algorithm for model size reduction. The state-driven architecture proposed is well thought out, leveraging data reuse and sparsity effectively to enhance energy efficiency. The integration of these components demonstrates a comprehensive approach to improving the performance of Tsetlin Machines in speech applications.
The experimental results are compelling, showcasing a competitive accuracy of 87.35% on the 12-keyword spotting task, which is a significant achievement for Tsetlin Machines in this domain. The evaluation is performed using a 65nm process technology, with detailed measurements of power consumption (16.58 µW) and core area (0.63 mm²). The paper includes comparisons with state-of-the-art keyword spotting accelerators, highlighting the advantages of the proposed system in terms of logic operations per inference (907k) and overall efficiency.
While the paper provides a thorough description of the methodologies and experimental setups, it lacks specific implementation details that would facilitate reproducibility. The absence of a project URL or demo further limits the ability of other researchers to replicate the results or build upon this work.
One limitation is the focus on a specific task (12-keyword spotting), which may not generalize well to other speech recognition tasks. Additionally, while the energy efficiency is impressive, the trade-offs in terms of accuracy and model complexity compared to other methods could be further explored. The paper does not address potential scalability issues when applied to larger datasets or more complex tasks.
The proposed TsetlinKWS framework has significant implications for ultra-low-power speech applications, particularly in edge computing scenarios where energy efficiency is paramount. The advancements in model compression and hardware acceleration could pave the way for more widespread adoption of Tsetlin Machines in real-time applications, potentially impacting areas such as smart devices, IoT, and assistive technologies. The main contribution of this paper is the introduction of TsetlinKWS, a novel algorithm-hardware co-design framework that significantly enhances the performance and efficiency of Tsetlin Machines for keyword spotting tasks. This work represents a meaningful advancement in the field of low-power machine learning accelerators, combining innovative methodologies with practical experimental validation.
Purpose: Surgical scene understanding is key to advancing computer-aided and intelligent surgical systems. Current approaches predominantly rely on visual data or end-to-end learning, which limits fine-grained contextual modeling. This work aims to enhance surgical scene representations by integrating 3D acoustic information, enabling temporally and spatially aware multimodal understanding of surgical environments. Methods: We propose a novel framework for generating 4D audio-visual representations of surgical scenes by projecting acoustic localization information from a phased microphone array onto dynamic point clouds from an RGB-D camera. A transformer-based acoustic event detection module identifies relevant temporal segments containing tool-tissue interactions which are spatially localized in the audio-visual scene representation. The system was experimentally evaluated in a realistic operating room setup during simulated surgical procedures performed by experts. Results: The proposed method successfully localizes surgical acoustic events in 3D space and associates them with visual scene elements. Experimental evaluation demonstrates accurate spatial sound localization and robust fusion of multimodal data, providing a comprehensive, dynamic representation of surgical activity. Conclusion: This work introduces the first approach for spatial sound localization in dynamic surgical scenes, marking a significant advancement toward multimodal surgical scene representations. By integrating acoustic and visual data, the proposed framework enables richer contextual understanding and provides a foundation for future intelligent and autonomous surgical systems.
Primary: Balgrist University Hospital, University of Zurich
All Institutions: Balgrist University Hospital, University of Zurich, ETH Zurich
This paper presents a novel framework for integrating acoustic and visual data to enhance surgical scene understanding, marking a significant step toward multimodal representations in intelligent surgical systems. The methodology is innovative, and the experimental results are promising, although further validation and optimization are necessary for practical implementation in clinical settings.
The methodology presented in this paper is innovative, combining acoustic localization with visual data to create a 4D representation of surgical scenes. The use of a phased microphone array for sound localization and the integration of a transformer-based acoustic event detection module are noteworthy advancements. The approach effectively addresses the limitations of existing methods that primarily rely on visual data, thus enhancing the contextual understanding of surgical environments. However, the methodology could benefit from further elaboration on the integration process and potential optimizations for real-time applications.
The experimental evaluation is robust, utilizing realistic surgical simulations and expert input to validate the proposed framework. The results demonstrate high accuracy in sound localization and multimodal data fusion, with quantitative metrics provided for event detection and localization. However, the limited number of sequences and the specific nature of the simulated environment may restrict the generalizability of the findings. More extensive testing across diverse surgical scenarios would strengthen the evaluation.
The paper mentions that the code and data will be made publicly available, which is a positive aspect for reproducibility. However, the details regarding the experimental setup, calibration, and synchronization processes are somewhat complex and may pose challenges for other researchers attempting to replicate the study without additional guidance.
The primary limitations include the reliance on a single viewpoint for the dynamic 3D representation, which may affect the completeness of the scene, and the inherent challenges of acoustic analysis in sound-poor surgical environments. Additionally, the current implementation is not real-time capable, which limits its immediate applicability in clinical settings. The accuracy of localization is also constrained by the low spatial resolution of the acoustic camera.
This work has significant implications for the future of intelligent surgical systems, potentially enhancing surgical training, workflow optimization, and real-time monitoring of surgical procedures. The integration of multimodal data could lead to improved patient outcomes and greater automation in surgical environments, paving the way for more advanced robotic surgical systems. This paper presents a novel framework for integrating acoustic and visual data to enhance surgical scene understanding, marking a significant step toward multimodal representations in intelligent surgical systems. The methodology is innovative, and the experimental results are promising, although further validation and optimization are necessary for practical implementation in clinical settings.
Despite rapid progress in Multi-modal Large Language Models and Large Audio-Language Models, existing audio benchmarks largely test semantics that can be recovered from text captions, masking deficits in fine-grained perceptual reasoning. We formalize audio 4D intelligence that is defined as reasoning over sound dynamics in time and 3D space, and introduce STAR-Bench to measure it. STAR-Bench combines a Foundational Acoustic Perception setting (six attributes under absolute and relative regimes) with a Holistic Spatio-Temporal Reasoning setting that includes segment reordering for continuous and discrete processes and spatial tasks spanning static localization, multi-source relations, and dynamic trajectories. Our data curation pipeline uses two methods to ensure high-quality samples. For foundational tasks, we use procedurally synthesized and physics-simulated audio. For holistic data, we follow a four-stage process that includes human annotation and final selection based on human performance. Unlike prior benchmarks where caption-only answering reduces accuracy slightly, STAR-Bench induces far larger drops (-31.5\% temporal, -35.2\% spatial), evidencing its focus on linguistically hard-to-describe cues. Evaluating 19 models reveals substantial gaps compared with humans and a capability hierarchy: closed-source models are bottlenecked by fine-grained perception, while open-source models lag across perception, knowledge, and reasoning. Our STAR-Bench provides critical insights and a clear path forward for developing future models with a more robust understanding of the physical world.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of STAR-Bench, a novel benchmark for evaluating deep spatio-temporal reasoning in audio intelligence, which addresses critical gaps in existing audio benchmarks. The comprehensive methodology and experimental evaluation provide valuable insights into the current capabilities of audio models and highlight areas for future development in the field.
The methodology presented in STAR-Bench is innovative, as it introduces a new framework for evaluating audio 4D intelligence, which combines foundational acoustic perception with holistic spatio-temporal reasoning. The use of both procedurally synthesized and physics-simulated audio for foundational tasks, along with a rigorous four-stage human annotation process for holistic data, demonstrates a thorough approach to data curation. However, the paper could benefit from more detailed descriptions of the specific algorithms or models used to process the audio data and perform the reasoning tasks, as well as the rationale behind their selection.
The experimental evaluation is robust, involving 19 models and highlighting significant performance gaps compared to human benchmarks. The reported drops in accuracy when relying on linguistically hard-to-describe cues provide compelling evidence of the benchmark's effectiveness in probing fine-grained perceptual reasoning. However, the paper lacks detailed metrics and visualizations that could enhance the understanding of model performance across different tasks, which would be beneficial for readers seeking to replicate or build upon this work.
The authors have made a commendable effort to ensure reproducibility by providing a detailed description of the dataset construction and evaluation pipeline. The commitment to releasing the benchmark dataset and evaluation code is a positive step towards facilitating further research in this area. However, the paper could improve by including more explicit instructions or examples for potential users to follow when using the dataset and code.
One limitation of the study is the reliance on human annotation, which may introduce biases based on the annotators' interpretations of audio cues. Additionally, while the benchmark addresses spatio-temporal reasoning, it may not encompass all aspects of audio perception, such as emotional or contextual understanding, which could limit its applicability in broader audio intelligence tasks.
The introduction of STAR-Bench has the potential to significantly impact the field of audio processing and multimodal learning by providing a new standard for evaluating models' understanding of complex audio environments. This could lead to advancements in applications such as robotics, autonomous systems, and interactive AI that require nuanced audio perception and reasoning capabilities. The main contribution of this paper is the introduction of STAR-Bench, a novel benchmark for evaluating deep spatio-temporal reasoning in audio intelligence, which addresses critical gaps in existing audio benchmarks. The comprehensive methodology and experimental evaluation provide valuable insights into the current capabilities of audio models and highlight areas for future development in the field.
Unified speech recognition aims to perform auditory, visual, and audiovisual speech recognition within a single model framework. While speech foundation models (SFMs) have demonstrated remarkable performance in auditory tasks, their adaptation to multimodal scenarios remains underexplored. This paper presents UASR-LLM, a novel framework that adapts frozen SFMs to unified VSR, ASR, and AVSR tasks by leveraging large language models (LLMs) as text decoders. Our approach introduces visual representations into multiple SFM layers through visual injection modules, enabling multimodal input processing and unified hidden representations. The augmented SFMs connect with decoder-only LLMs via a feed-forward adaptor, where concatenated representations and instruction prompts guide speech transcription. We implement a twostage training strategy: visual injection pretraining followed by speech recognition finetuning. SFM parameters remain frozen throughout training, with only visual injection modules optimized initially, and LLMs finetuned using LoRA parameters subsequently. Experimental results demonstrate superior performance over state-of-the-art baselines across VSR, ASR, and AVSR tasks under both clean and noisy conditions. Ablation studies confirm generalization across various SFMs and LLMs, validating the proposed training strategy.
Primary: Shaanxi Normal University
All Institutions: Shaanxi Normal University, University of Science and Technology of China, iFLYTEK Research
The main contribution of this paper is the introduction of UASR-LLM, a unified framework that effectively integrates visual and auditory modalities for improved speech recognition across multiple tasks. This work advances the field by demonstrating how existing models can be adapted to enhance performance in challenging conditions, thus paving the way for future research in multimodal speech recognition systems.
The proposed UASR-LLM framework presents a novel approach to unified speech recognition by integrating visual representations into speech foundation models (SFMs) through visual injection modules. The two-stage training strategy, which includes visual injection pretraining followed by speech recognition finetuning, is well-structured and innovative. The use of large language models (LLMs) as decoders is a significant methodological advancement, allowing for improved transcription across multiple modalities. The incorporation of cross-modal knowledge distillation and the careful design of the visual injection modules demonstrate a thoughtful approach to leveraging existing models while enhancing their capabilities.
The experimental results are robust, showcasing superior performance across VSR, ASR, and AVSR tasks compared to state-of-the-art baselines. The paper provides comprehensive evaluations under both clean and noisy conditions, which is crucial for real-world applicability. The ablation studies further validate the effectiveness of the proposed training strategy and the generalization capabilities of the framework across different SFMs and LLMs. However, the paper could benefit from more detailed comparisons with additional baselines to strengthen its claims.
The paper includes a thorough description of the datasets, model configurations, and training strategies, which aids in reproducibility. However, the absence of publicly available code or a project URL limits the ability for other researchers to replicate the experiments fully. Including a link to a code repository would significantly enhance reproducibility.
The paper acknowledges several limitations, including the computational burden associated with LLM integration and the reliance on large-scale audio-visual datasets for optimal performance. The authors also note that while their method improves recognition robustness, it still lags behind previous lipreading studies that utilize larger datasets, indicating a need for further exploration of visual encoder capabilities.
The proposed framework has significant implications for practical applications in various domains, such as intelligent home systems, online conferencing, and service robotics, where robust speech recognition is essential. The integration of visual information enhances the system's performance in noisy environments, making it particularly valuable for real-world scenarios. The main contribution of this paper is the introduction of UASR-LLM, a unified framework that effectively integrates visual and auditory modalities for improved speech recognition across multiple tasks. This work advances the field by demonstrating how existing models can be adapted to enhance performance in challenging conditions, thus paving the way for future research in multimodal speech recognition systems.
Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, MoE Key Lab of Artificial Intelligence, Jiangsu Key Lab of Language Computing, School of Computer Science, X-LANCE Lab
The main contribution of this paper is the introduction of ISA-Bench, a comprehensive benchmark for evaluating instruction sensitivity in large audio language models, which reveals significant challenges and trade-offs in model performance. The paper's technical contributions, innovative methodology, and rigorous experimental evaluation position it as a valuable resource for advancing research in audio language processing.
The paper introduces ISA-Bench, a novel benchmark designed to evaluate instruction sensitivity in large audio language models (LALMs) across three dimensions: instruction description, output format, and task composition. This multidimensional approach is a significant improvement over existing benchmarks, which typically focus on only one aspect of instruction sensitivity. The methodology is well-structured, employing a systematic way to generate instruction variants and assess model performance through compliance-aware metrics. The use of diverse instruction variants generated by LLMs adds depth to the evaluation process, ensuring that the benchmark reflects real-world scenarios.
The experimental section is robust, featuring a comprehensive evaluation of several state-of-the-art LALMs across five atomic tasks. The results highlight the significant challenges posed by instruction sensitivity, with clear evidence that even top-performing models struggle under varied instruction forms. The fine-tuning experiments conducted on Qwen2-Audio provide valuable insights into the trade-offs between improving instruction-following ability and the risk of catastrophic forgetting, demonstrating a thorough understanding of the complexities involved in training LALMs.
The paper provides a clear description of the experimental setup, including the models tested and the datasets used. However, specific implementation details, such as hyperparameters for fine-tuning and the exact nature of the instruction variants, are somewhat lacking. While the GitHub repository offers a project URL, which may contain additional resources, the reproducibility of results could be enhanced with more detailed methodological transparency.
One notable limitation is the potential for catastrophic forgetting observed during fine-tuning, which could hinder the practical deployment of LALMs in real-world applications. Additionally, the benchmark may not fully capture all nuances of instruction sensitivity, particularly in more complex or less common audio tasks. The reliance on specific datasets may also limit the generalizability of the findings.
The development of ISA-Bench has significant implications for the field of audio language processing, as it addresses a critical gap in the evaluation of LALMs. By highlighting the importance of instruction robustness, the benchmark encourages future research to focus on improving model performance in real-world scenarios, ultimately enhancing the reliability of human-computer interaction through audio understanding. The main contribution of this paper is the introduction of ISA-Bench, a comprehensive benchmark for evaluating instruction sensitivity in large audio language models, which reveals significant challenges and trade-offs in model performance. The paper's technical contributions, innovative methodology, and rigorous experimental evaluation position it as a valuable resource for advancing research in audio language processing.
Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.
Primary: ASLP Lab
All Institutions: ASLP Lab, MiLM Plus
The main contribution of this paper is the introduction of DiffRhythm 2, a novel semi-autoregressive framework for high-fidelity song generation that effectively addresses lyric alignment and multi-preference optimization challenges. The comprehensive analysis of its technical contributions, methodology, and experimental results underscores its significance in advancing the state of the art in music generation.
The paper introduces DiffRhythm 2, a semi-autoregressive framework that leverages block flow matching for song generation. This approach addresses significant challenges in lyric alignment and multi-preference optimization, which have been inadequately handled in prior works. The integration of a music variational autoencoder (VAE) with a low frame rate enhances the efficiency of long-sequence generation. The stochastic block REPA loss is a novel contribution that improves musicality and structural coherence, while cross-pair preference optimization effectively mitigates performance degradation associated with multi-preference alignment. Overall, the methodology is well-structured and presents a significant advancement in the field of music generation.
The experimental setup is robust, utilizing a large-scale dataset of 1.4 million songs across various genres. The evaluation metrics are comprehensive, encompassing both subjective and objective assessments. The results demonstrate that DiffRhythm 2 outperforms existing models in several key areas, including lyric accuracy and music quality. The ablation studies further validate the importance of the proposed methods, showing significant performance drops when key components are removed. However, comparisons with commercial systems reveal that while DiffRhythm 2 excels among open-source models, it still lags behind in overall performance.
The authors commit to reproducibility by releasing inference code and model checkpoints, which is crucial for the community to validate and build upon their work. However, the paper could benefit from more detailed descriptions of the training process and hyperparameter settings to facilitate easier replication of results.
The paper acknowledges limitations, particularly regarding the fidelity of audio reconstruction due to the low frame rate of the VAE. Additionally, the challenge of improving vocal modeling without sacrificing creativity remains a significant hurdle. The authors also note that open-source models generally do not match the performance of commercial systems, indicating a need for further advancements.
The implications of this work are substantial, as it addresses the growing demand for high-quality, automated music generation tools. The potential for misuse in creating disinformation or deepfake audio is a concern that the authors recognize, and they express a commitment to responsible advancement in the field. This research could pave the way for innovative applications in music production, entertainment, and personalized content creation. The main contribution of this paper is the introduction of DiffRhythm 2, a novel semi-autoregressive framework for high-fidelity song generation that effectively addresses lyric alignment and multi-preference optimization challenges. The comprehensive analysis of its technical contributions, methodology, and experimental results underscores its significance in advancing the state of the art in music generation.
Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.
Primary: ASLP Lab
All Institutions: ASLP Lab, MiLM Plus
This paper presents a significant advancement in the field of music generation, effectively addressing long-standing challenges in lyric alignment and multi-preference optimization while demonstrating strong experimental results. The innovative methodology and thoughtful consideration of broader impacts position DiffRhythm 2 as a noteworthy contribution to the audio machine learning domain.
The paper introduces DiffRhythm 2, a semi-autoregressive framework that employs block flow matching to enhance lyric alignment in song generation. The methodology is innovative, combining a music variational autoencoder (VAE) with a diffusion transformer to achieve high-quality song generation while maintaining efficiency. The introduction of stochastic block REPA loss and cross-pair preference optimization strategies addresses significant challenges in multi-preference optimization, showcasing a thoughtful approach to balancing various musical dimensions. The semi-autoregressive architecture is particularly noteworthy for its ability to maintain coherence over long sequences without relying on external constraints.
The experimental setup is robust, utilizing a large-scale dataset of 1.4 million songs for training and a well-defined evaluation metric system that includes both subjective and objective assessments. The results demonstrate that DiffRhythm 2 outperforms existing models in several key metrics, including lyric accuracy and music quality, indicating the effectiveness of the proposed methods. However, the paper could benefit from more detailed comparisons with a broader range of models, particularly in subjective evaluations.
The authors have committed to releasing the inference code and model checkpoints, which is a positive step towards reproducibility. However, the paper lacks detailed implementation specifics that could aid other researchers in replicating the results, such as hyperparameter settings and training configurations.
The paper acknowledges limitations, particularly regarding the fidelity of audio reconstruction due to the low frame rate of the VAE. Additionally, the challenge of improving vocal modeling without compromising creativity is noted, indicating areas for future research. The reliance on a large dataset may also raise concerns about the generalizability of the model across different musical styles.
The potential applications of DiffRhythm 2 extend beyond music generation, as it could influence areas such as automated content creation, music therapy, and interactive entertainment. However, the authors also recognize the ethical implications of generating realistic audio content, emphasizing the need for responsible use and guidelines to prevent misuse. This paper presents a significant advancement in the field of music generation, effectively addressing long-standing challenges in lyric alignment and multi-preference optimization while demonstrating strong experimental results. The innovative methodology and thoughtful consideration of broader impacts position DiffRhythm 2 as a noteworthy contribution to the audio machine learning domain.
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
Primary: Soul AI Lab
All Institutions: Soul AI Lab
The main contribution of this work is the introduction of SoulX-Podcast, a sophisticated framework for generating realistic, long-form, multi-speaker podcasts that incorporates dialectal and paralinguistic diversity. This paper significantly advances the field of speech synthesis by addressing the complexities of conversational speech generation, providing a versatile tool for future applications in audio content creation and interactive systems.
The methodology presented in SoulX-Podcast is comprehensive, addressing the challenges of multi-speaker and multi-turn dialogue synthesis. The authors effectively integrate paralinguistic controls and dialectal variations, which are often overlooked in existing TTS systems. The processing workflow for dialogue data is well-structured, employing techniques such as speaker diarization, quality filtering, and dual-ASR transcription to ensure high-quality training data. The two-stage generative framework, utilizing a large language model (LLM) for semantic token prediction followed by acoustic feature generation, is a novel approach that enhances the system's ability to maintain coherence and expressiveness in long-form dialogues.
The experimental evaluation is robust, with extensive comparisons against state-of-the-art models in both monologue and multi-turn dialogue synthesis. The results demonstrate significant improvements in intelligibility, speaker similarity, and coherence across various metrics. The inclusion of multiple dialects and paralinguistic features adds depth to the evaluation, showcasing the model's versatility. However, the paper could benefit from more detailed statistical analyses and discussions on the significance of the results.
The paper provides sufficient details regarding the data processing and model training procedures, including the specific ASR models used and the dataset sizes. However, the lack of specific hyperparameters and training configurations may hinder full reproducibility. The availability of the source code and demo links is a positive aspect, facilitating further exploration by the research community.
One limitation is the reliance on the quality of the underlying ASR systems, which may affect the overall performance, particularly in dialectal contexts. Additionally, while the model shows promise in generating paralinguistic features, the evaluation of these features could be more comprehensive, especially for subtle cues that may not be easily distinguishable. The paper also does not address potential ethical concerns related to the misuse of synthesized speech.
The advancements made in SoulX-Podcast have significant implications for various applications, including podcasting, virtual assistants, and language learning tools. By enabling realistic multi-speaker dialogues with dialectal diversity, the system can enhance user engagement and accessibility in speech interfaces. However, ethical considerations regarding the potential for misuse, such as voice cloning and misinformation, should be carefully managed. The main contribution of this work is the introduction of SoulX-Podcast, a sophisticated framework for generating realistic, long-form, multi-speaker podcasts that incorporates dialectal and paralinguistic diversity. This paper significantly advances the field of speech synthesis by addressing the complexities of conversational speech generation, providing a versatile tool for future applications in audio content creation and interactive systems.
Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.
Primary: Soul AI Lab
All Institutions: Soul AI Lab
SoulX-Podcast represents a substantial advancement in multi-speaker, multi-turn dialogue synthesis, integrating dialectal and paralinguistic diversity into TTS systems. The comprehensive methodology and strong experimental results underscore its potential impact on the field of audio machine learning.
The methodology presented in SoulX-Podcast is robust and innovative, particularly in its approach to multi-speaker and multi-turn dialogue synthesis. The integration of paralinguistic features and dialectal controls is a significant advancement over traditional TTS systems, which typically focus on single-speaker outputs. The authors effectively utilize a two-stage generative framework that combines large language models with acoustic feature generation, which is well-documented and logically structured. The data processing techniques, including speaker diarization and paralinguistic annotation, are thorough and demonstrate a strong understanding of the challenges in dialogue synthesis.
The experimental evaluation is comprehensive, with a clear focus on both monologue and dialogue generation tasks. The results indicate that SoulX-Podcast outperforms existing state-of-the-art models across various metrics, including intelligibility and speaker similarity. The use of diverse datasets, including dialectal speech data, enhances the credibility of the findings. However, the paper could benefit from more detailed comparisons with a broader range of models to contextualize its performance further.
The paper provides sufficient details about the implementation and training processes, including the data collection and preprocessing steps. The availability of the source code and demo page enhances reproducibility, allowing other researchers to validate and build upon the work. However, additional information on hyperparameters and training configurations would further facilitate replication.
One limitation is the reliance on specific ASR systems for transcription, which may introduce biases or errors that affect the overall performance. Additionally, while the model supports multiple dialects, the performance metrics for dialectal generation could be improved, as indicated by the relatively high CER values for some dialects. The paper also acknowledges potential ethical concerns related to misuse of the technology, which is an important consideration but lacks a detailed discussion on mitigation strategies.
The development of SoulX-Podcast has significant implications for the field of speech synthesis, particularly in applications requiring natural and expressive dialogue generation, such as virtual assistants, audiobooks, and entertainment. The ability to generate speech in multiple dialects enhances accessibility and personalization, making it a valuable tool for diverse user groups. The ethical considerations raised in the paper highlight the need for responsible deployment of such technologies. SoulX-Podcast represents a substantial advancement in multi-speaker, multi-turn dialogue synthesis, integrating dialectal and paralinguistic diversity into TTS systems. The comprehensive methodology and strong experimental results underscore its potential impact on the field of audio machine learning.
Accurate far-field speech datasets are critical for tasks such as automatic speech recognition (ASR), dereverberation, speech enhancement, and source separation. However, current datasets are limited by the trade-off between acoustic realism and scalability. Measured corpora provide faithful physics but are expensive, low-coverage, and rarely include paired clean and reverberant data. In contrast, most simulation-based datasets rely on simplified geometrical acoustics, thus failing to reproduce key physical phenomena like diffraction, scattering, and interference that govern sound propagation in complex environments. We introduce Treble10, a large-scale, physically accurate room-acoustic dataset. Treble10 contains over 3000 broadband room impulse responses (RIRs) simulated in 10 fully furnished real-world rooms, using a hybrid simulation paradigm implemented in the Treble SDK that combines a wave-based and geometrical acoustics solver. The dataset provides six complementary subsets, spanning mono, 8th-order Ambisonics, and 6-channel device RIRs, as well as pre-convolved reverberant speech scenes paired with LibriSpeech utterances. All signals are simulated at 32 kHz, accurately modelling low-frequency wave effects and high-frequency reflections. Treble10 bridges the realism gap between measurement and simulation, enabling reproducible, physically grounded evaluation and large-scale data augmentation for far-field speech tasks. The dataset is openly available via the Hugging Face Hub, and is intended as both a benchmark and a template for next-generation simulation-driven audio research.
Primary: University Imagination
All Institutions: University Imagination, Important Laboratory
The main contribution of this paper is the introduction of the Treble10 dataset, which offers a high-quality, scalable resource for far-field speech recognition and related tasks, bridging the gap between measured and simulated acoustic data. The technical contributions are substantial, as they address a critical need in the field for realistic training data, while the methodology employed is both innovative and well-structured, positioning this work as a significant advancement in audio machine learning research.
The methodology presented in the paper is robust and innovative, utilizing a hybrid simulation paradigm that combines wave-based and geometrical acoustics solvers. This approach allows for the accurate modeling of complex acoustic phenomena such as diffraction and scattering, which are often overlooked in traditional datasets. The dataset's structure, with multiple subsets catering to different audio configurations, enhances its utility for various applications in speech recognition and enhancement.
The paper does not provide extensive experimental results or comparisons with existing datasets, which could have strengthened its claims about the dataset's efficacy. However, the detailed description of the dataset's generation process and the inclusion of paired clean and reverberant speech data suggest a high potential for practical applications in real-world scenarios.
The dataset is openly available via the Hugging Face Hub, which enhances reproducibility. The authors provide clear instructions on how to utilize the dataset, including code examples, which is essential for researchers looking to replicate or build upon their work.
One limitation is the lack of extensive empirical validation of the dataset's performance in real-world applications. While the authors claim that the dataset bridges the gap between measurement and simulation, concrete results demonstrating its advantages over existing datasets would bolster their assertions. Additionally, the reliance on simulations may not capture every nuance of real-world acoustics.
The Treble10 dataset has significant implications for advancing research in far-field speech recognition, dereverberation, and speech enhancement. By providing a high-quality, scalable dataset, it enables researchers to develop more robust algorithms that can perform well in complex acoustic environments, potentially leading to improvements in consumer electronics and assistive technologies. The main contribution of this paper is the introduction of the Treble10 dataset, which offers a high-quality, scalable resource for far-field speech recognition and related tasks, bridging the gap between measured and simulated acoustic data. The technical contributions are substantial, as they address a critical need in the field for realistic training data, while the methodology employed is both innovative and well-structured, positioning this work as a significant advancement in audio machine learning research.
We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.
Primary: Unknown
All Institutions: Unknown
This paper introduces LibriConvo, a synthetic conversational dataset that enhances the training and evaluation of ASR and speaker diarization systems by ensuring realistic conversational dynamics and semantic coherence. The technical contributions, particularly the innovative methodology and comprehensive evaluation, position this work as a significant advancement in the field of multi-speaker speech processing.
The methodology presented in this paper is robust, leveraging the Speaker-Aware Simulated Conversation (SASC) framework to generate realistic multi-speaker dialogues. The authors effectively address the limitations of previous datasets by ensuring semantic coherence and realistic conversational dynamics. The use of kernel density estimation for gap distributions and a Markov chain for turn-taking modeling are innovative approaches that enhance the dataset's realism. Additionally, the selection of room impulse responses based on spatial plausibility adds a layer of acoustic authenticity that is often overlooked in synthetic datasets.
The experimental evaluation is thorough, with baseline results demonstrating the effectiveness of the dataset for both ASR and diarization tasks. The comparison between the Sortformer and pyannote models provides valuable insights into the performance of different architectures on the generated dataset. The reported results, including a 7.29% WER for ASR, indicate that the dataset is not only realistic but also useful for advancing state-of-the-art systems. However, the paper could benefit from more extensive comparisons with a broader range of existing datasets.
The authors provide clear details about the dataset generation process and the experimental setup, which enhances reproducibility. The dataset is publicly available on Hugging Face, allowing other researchers to replicate the experiments and build upon this work. However, the paper lacks detailed code implementation or scripts that would facilitate direct reproduction of the results.
One limitation of the study is the reliance on synthetic data, which may not fully capture the complexities of real-world conversations. While the authors address this by ensuring semantic coherence and realistic timing, there may still be nuances in natural speech that are not replicated. Additionally, the dataset's size, while substantial, may limit its applicability to more diverse conversational scenarios.
The LibriConvo dataset has significant implications for the fields of ASR and speaker diarization, providing a valuable resource for researchers to train and evaluate models under controlled conditions. Its realistic conversational dynamics can lead to advancements in multi-speaker speech processing, potentially improving applications in virtual assistants, transcription services, and other areas requiring accurate speech recognition in noisy environments. This work encourages further exploration of synthetic datasets in machine learning, highlighting their potential to complement real-world data. This paper introduces LibriConvo, a synthetic conversational dataset that enhances the training and evaluation of ASR and speaker diarization systems by ensuring realistic conversational dynamics and semantic coherence. The technical contributions, particularly the innovative methodology and comprehensive evaluation, position this work as a significant advancement in the field of multi-speaker speech processing.
Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real-world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.
Primary: Ewha Womans University
All Institutions: Ewha Womans University
This paper presents a valuable contribution to the field of audio deepfake detection by introducing a benchmark that rigorously evaluates the robustness of detection systems against unseen synthesis methods and speaker identities. The comprehensive methodology and experimental evaluation enhance our understanding of current limitations and pave the way for future advancements in the area.
The methodology presented in this paper is robust, as it introduces a novel benchmark, TWINSHIFT, that systematically evaluates audio deepfake detection under conditions that have not been previously addressed. By employing six different synthesis systems paired with distinct speaker sets, the authors ensure that the evaluation is comprehensive and reflective of real-world scenarios. The approach is methodologically sound, with clear definitions and a structured framework for assessing robustness.
The experiments conducted are extensive and well-structured, revealing significant robustness gaps in existing detection systems. The use of diverse synthesis methods and speaker identities provides a valuable perspective on the generalization capabilities of current models. The results are presented clearly, with appropriate metrics that highlight the limitations of existing methods, thus contributing to the field's understanding of audio deepfake detection.
The paper provides a GitHub repository for the TWINSHIFT benchmark, which enhances reproducibility. However, details on the specific implementations of the detection systems evaluated could be more thoroughly documented to facilitate complete reproducibility of the results.
One limitation is that while the benchmark is comprehensive, it may not cover all possible synthesis methods or speaker variations that could arise in real-world scenarios. Additionally, the paper does not delve deeply into the implications of the robustness gaps uncovered, which could be an area for further exploration.
The implications of this research are significant, as it addresses a pressing issue in the realm of misinformation and fraud through audio deepfakes. The TWINSHIFT benchmark can serve as a foundational tool for future research, guiding the development of more robust detection systems that could be deployed in real-world applications. This paper presents a valuable contribution to the field of audio deepfake detection by introducing a benchmark that rigorously evaluates the robustness of detection systems against unseen synthesis methods and speaker identities. The comprehensive methodology and experimental evaluation enhance our understanding of current limitations and pave the way for future advancements in the area.
Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.
Primary: Seoul National University
All Institutions: Korea University, Seoul National University
The paper presents SAO-Instruct, a pioneering model for free-form audio editing using natural language instructions, significantly advancing the field of audio processing and generative models. The combination of innovative methodology, comprehensive experimental evaluation, and acknowledgment of limitations positions this work as a meaningful contribution to machine learning research.
The methodology presented in the paper is innovative, particularly in its approach to generating a dataset of audio editing triplets using a combination of Prompt-to-Prompt, DDPM inversion, and manual editing. This hybrid approach allows the model to learn from both synthetic and real-world data, enhancing its ability to generalize across diverse audio clips and editing instructions. The use of Bayesian Optimization to fine-tune parameters for audio generation is a noteworthy aspect that adds rigor to the training process. However, the reliance on synthetic data may introduce biases that could affect the model's performance in real-world scenarios.
The experimental evaluation is comprehensive, utilizing both objective metrics (such as Fréchet Distance and CLAP scores) and subjective listening tests to assess the model's performance. The ablation studies effectively demonstrate the contributions of different dataset generation methods, providing insights into the model's strengths and weaknesses. The results indicate that the proposed model outperforms existing baselines in subjective evaluations, which is a significant achievement. However, the paper could benefit from a more detailed discussion of the statistical significance of the results.
The paper provides sufficient implementation details, including training configurations, dataset generation processes, and evaluation metrics, which enhance reproducibility. However, the lack of publicly available code or model weights at the time of publication is a limitation that could hinder other researchers from replicating the results.
The paper acknowledges several limitations, including the computational expense of the dataset generation process and potential issues with the model's ability to handle complex audio scenes. Additionally, the model's performance is influenced by the phrasing of edit instructions, which can lead to variability in output quality. The authors also note the need for larger and more diverse datasets to improve generalization.
The introduction of free-form instruction-based audio editing has significant implications for various applications, including content creation, sound design, and accessibility in audio editing. However, the potential for misuse in creating deceptive audio content raises ethical concerns that need to be addressed. The authors emphasize the importance of responsible deployment practices and the need for future research to explore detection methods for synthetically edited audio. The paper presents SAO-Instruct, a pioneering model for free-form audio editing using natural language instructions, significantly advancing the field of audio processing and generative models. The combination of innovative methodology, comprehensive experimental evaluation, and acknowledgment of limitations positions this work as a meaningful contribution to machine learning research.
Speech enhancement is a fundamental challenge in signal processing, particularly when robustness is required across diverse acoustic conditions and microphone setups. Deep learning methods have been successful for speech enhancement, but often assume fixed array geometries, limiting their use in mobile, embedded, and wearable devices. Existing array-agnostic approaches typically rely on either raw microphone signals or beamformer outputs, but both have drawbacks under changing geometries. We introduce HyBeam, a hybrid framework that uses raw microphone signals at low frequencies and beamformer signals at higher frequencies, exploiting their complementary strengths while remaining highly array-agnostic. Simulations across diverse rooms and wearable array configurations demonstrate that HyBeam consistently surpasses microphone-only and beamformer-only baselines in PESQ, STOI, and SI-SDR. A bandwise analysis shows that the hybrid approach leverages beamformer directivity at high frequencies and microphone cues at low frequencies, outperforming either method alone across all bands.
Primary: School of Electrical and Computer Engineering
All Institutions: School of Electrical and Computer Engineering
The paper presents HyBeam, a hybrid microphone-beamforming framework for array-agnostic speech enhancement in wearables. The technical contribution is significant, as it addresses a critical challenge in speech processing by effectively leveraging the strengths of both microphone and beamformer inputs, leading to improved performance across diverse conditions.
The methodology presented in the paper is robust, introducing a hybrid framework that effectively combines raw microphone signals and beamformer outputs to enhance speech in diverse acoustic environments. The authors provide a comprehensive analysis of existing methods, highlighting their limitations and justifying the need for a hybrid approach. The design choices made for the hybrid model, particularly the bandwise input strategy, are well-founded based on empirical observations from the baseline models. The use of simulations to test the framework across various room configurations and microphone setups adds depth to the methodology.
The experimental evaluation is thorough, utilizing a well-defined setup that includes both seen and unseen array geometries to assess the generalization capabilities of the proposed method. The metrics used (PESQ, STOI, SI-SDR) are appropriate for the task of speech enhancement and provide a comprehensive view of performance. The results demonstrate clear advantages of the hybrid approach over baseline methods, reinforcing the effectiveness of the proposed model.
The paper provides sufficient detail regarding the experimental setup, including the simulation of room acoustics and the training methodology. However, the lack of a publicly accessible code repository limits the reproducibility of the results. Including a demo or project URL would enhance the ability for others to replicate the findings.
One limitation noted in the paper is the performance drop in the high-frequency range (4-8 kHz), which the authors acknowledge and suggest as a direction for future work. Additionally, while the hybrid model shows improvements, it may still be sensitive to extreme perturbations in microphone positioning, which could be explored further.
The implications of this research are significant, particularly for applications in wearable technology, telecommunication, and assistive devices. By improving speech enhancement in challenging acoustic environments, the HyBeam framework could enhance communication for users in various settings, including those with hearing impairments or in noisy environments. The paper presents HyBeam, a hybrid microphone-beamforming framework for array-agnostic speech enhancement in wearables. The technical contribution is significant, as it addresses a critical challenge in speech processing by effectively leveraging the strengths of both microphone and beamformer inputs, leading to improved performance across diverse conditions.
Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: https://github.com/bigai-nlco/UltraVoice.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of the UltraVoice dataset, which enables fine-grained style control in spoken dialogue models, thereby enhancing the expressiveness and human-likeness of speech interactions. This work is significant as it addresses a critical gap in the current capabilities of spoken dialogue systems and provides a valuable resource for future research in the field.
The methodology presented in the paper is robust, focusing on the creation of a large-scale synthetic dataset, UltraVoice, which is designed to facilitate fine-grained style control in spoken dialogue systems. The use of GPT-4o for text generation and various TTS engines for audio synthesis is innovative, allowing for the generation of diverse speech styles without the ethical concerns associated with using real human voices. The paper clearly outlines the dimensions of style control and the models used for fine-tuning, which enhances reproducibility.
The experiments conducted demonstrate significant improvements in both stylistic controllability and core conversational abilities of the models fine-tuned on the UltraVoice dataset. The reported improvements in Mean Opinion Score (MOS) and Instruction Following Rate (IFR) are substantial and indicate that the proposed dataset effectively enhances model performance. The use of benchmarks like URO-Bench adds credibility to the evaluation, although more details on the experimental setup would strengthen the findings.
The authors provide a comprehensive reproducibility statement, detailing the data generation pipeline, model configurations, and evaluation metrics. This transparency is commendable and facilitates further research. However, the lack of specific institution information may hinder some aspects of reproducibility in terms of institutional resources.
While the paper addresses ethical concerns related to the misuse of synthetic speech, it does not delve deeply into the limitations of the dataset itself, such as potential biases in the generated speech styles or the generalizability of the models trained on synthetic data. Additionally, the reliance on synthetic data may limit the realism of the conversational interactions.
The UltraVoice dataset has the potential to significantly advance the field of controllable speech synthesis and dialogue systems, enabling more human-like interactions in various applications, including virtual assistants and entertainment. However, the potential for misuse, such as creating deepfakes or manipulating emotions through stylized speech, raises important ethical considerations that the authors have acknowledged. The main contribution of this paper is the introduction of the UltraVoice dataset, which enables fine-grained style control in spoken dialogue models, thereby enhancing the expressiveness and human-likeness of speech interactions. This work is significant as it addresses a critical gap in the current capabilities of spoken dialogue systems and provides a valuable resource for future research in the field.