Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-intensive, and slow. We introduce the first test-time adaptation (TTA) framework for generative SLMs that process interleaved audio-text prompts. Our method updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. This stabilizes token distributions and improves robustness to acoustic variability without degrading core task accuracy. Evaluated on automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench, our approach yields consistent gains under diverse corruptions. Because adaptation touches only a small fraction of weights, it is both compute- and memory-efficient, supporting deployment on resource-constrained platforms. This work enhances the robustness and adaptability of generative SLMs for real-world speech-driven applications.
Primary: Graduate Institute of Communication Engineering, National Taiwan University
All Institutions: Graduate Institute of Communication Engineering, National Taiwan University, Meta Reality Labs Research
The main contribution of this paper is the introduction of the SLM-TTA framework, which enables real-time adaptation of generative spoken language models to varying acoustic conditions without the need for additional data or labels. This work significantly enhances the robustness and adaptability of SLMs, making them more applicable to real-world scenarios where acoustic variability is prevalent.
The proposed SLM-TTA framework introduces a novel approach to test-time adaptation for generative spoken language models by updating a small subset of parameters during inference. This methodology addresses the critical issue of performance degradation in SLMs under various acoustic conditions without requiring additional data or labels. The interleaved audio-text prompts are a clever way to leverage incoming utterances for adaptation, which enhances the model's robustness while maintaining core task accuracy. The focus on efficiency in terms of compute and memory usage is particularly relevant for deployment in resource-constrained environments, making the approach practical for real-world applications.
The experimental setup is robust, evaluating the framework across multiple tasks, including automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench. The consistent performance gains reported under diverse corruptions provide strong evidence of the method's effectiveness. However, the paper could benefit from more detailed comparisons with existing adaptation techniques to better contextualize its contributions.
The paper lacks specific implementation details that would facilitate reproducibility, such as hyperparameter settings, model architectures, and training procedures. Providing a clear description of the experimental setup and access to code or datasets would significantly enhance reproducibility.
One limitation is the reliance on a targeted subset of parameters for adaptation, which may not capture all necessary variations in acoustic conditions. Additionally, the framework's performance in highly variable or extreme conditions is not fully explored, which could be a potential area for further research.
The framework has significant implications for the deployment of generative spoken language models in real-world applications, particularly in environments with varying acoustic conditions. Its focus on efficiency and adaptability could lead to improvements in speech-driven applications across various domains, including virtual assistants, automated transcription services, and accessibility tools. The main contribution of this paper is the introduction of the SLM-TTA framework, which enables real-time adaptation of generative spoken language models to varying acoustic conditions without the need for additional data or labels. This work significantly enhances the robustness and adaptability of SLMs, making them more applicable to real-world scenarios where acoustic variability is prevalent.
Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods. The model and dataset are open-sourced at https://github.com/LLM-VLM-GSL/AHA.
Primary: Arizona State University
All Institutions: Arizona State University
The main contribution of this paper is the introduction of the AHA framework, which effectively reduces hallucinations in LALMs by employing a novel taxonomy of errors and counterfactual hard negative mining. This work not only enhances the performance of audio-language models but also sets a precedent for future research in the domain, emphasizing the importance of rigorous evaluation and alignment strategies.
The paper introduces the AHA framework, which innovatively addresses the issue of hallucinations in Large Audio-Language Models (LALMs) through a systematic taxonomy of errors and the use of counterfactual hard negative mining. The methodology is well-structured, combining dataset construction with a diagnostic benchmark (AHA-Eval) to rigorously evaluate the model's performance. The approach of leveraging negative examples to enhance model alignment is particularly noteworthy, as it challenges the traditional reliance on positive reinforcement alone.
The experimental results demonstrate significant improvements in the Qwen-Audio-AHA model, with a 13.7% reduction in hallucination rates on the AHA-Eval benchmark and consistent performance gains on public datasets. The comprehensive evaluation across multiple benchmarks and the clear presentation of results enhance the credibility of the findings. However, the paper could benefit from more detailed comparisons with other state-of-the-art models to contextualize the improvements.
The implementation details are adequately provided, including hyperparameters and training configurations. The open-sourcing of the model and dataset is a strong point for reproducibility, allowing other researchers to validate and build upon the work. However, more explicit instructions on the setup and execution of experiments would further enhance reproducibility.
The paper acknowledges limitations related to the evaluation protocols, particularly the rigid sensitivity of current models in judging semantic equivalence. This suggests that the reported hallucination rates might be inflated due to false positives. Additionally, the focus on fine-grained reasoning might overlook broader contextual understanding, which could limit the applicability of the findings in more complex scenarios.
The proposed AHA framework has the potential to significantly advance the field of audio-language understanding by providing a robust mechanism to mitigate hallucinations, thus improving the reliability of LALMs in real-world applications such as multimedia forensics and acoustic monitoring. The emphasis on fine-grained reasoning could also inspire further research into more nuanced evaluation metrics for multimodal models. The main contribution of this paper is the introduction of the AHA framework, which effectively reduces hallucinations in LALMs by employing a novel taxonomy of errors and counterfactual hard negative mining. This work not only enhances the performance of audio-language models but also sets a precedent for future research in the domain, emphasizing the importance of rigorous evaluation and alignment strategies.
Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.
Primary: HKUST(GZ)
All Institutions: HKUST(GZ), Tencent, Shanghai Jiao Tong University, Technical University of Munich
The main contribution of this paper is the introduction of PhyAVBench, a novel benchmark designed to evaluate the audio physics grounding capabilities of T2AV models through a systematic and controlled approach. This work significantly advances the field by addressing a critical gap in existing evaluation frameworks and promoting the development of more realistic audio-visual generation systems.
The methodology presented in PhyAVBench is robust and innovative, employing a systematic approach to benchmark T2AV models against a comprehensive set of audio-physical dimensions. The use of controlled variables to assess models' sensitivity to physical changes is a significant advancement over existing benchmarks, which often overlook the underlying physical principles. The integration of human refinement and quality control in prompt generation and video collection enhances the reliability of the benchmark. The introduction of the Audio-Physics Sensitivity Test (APST) and the Contrastive Physical Response Score (CPRS) as evaluation metrics is particularly noteworthy, as it allows for a nuanced understanding of models' performance in relation to physical realism.
While the paper primarily focuses on the benchmark design and data curation, the proposed framework is well-structured to facilitate future experimental evaluations of various T2AV models. The comprehensive coverage of audio-physical dimensions and scenarios ensures that the benchmark can be applied to a wide range of models, from commercial to academic. However, the paper lacks detailed results from model evaluations, which would have provided concrete evidence of the benchmark's effectiveness and the models' capabilities.
The paper outlines a clear data curation pipeline and methodology for creating the benchmark, which enhances reproducibility. However, the absence of specific implementation details or code availability limits the ability to fully reproduce the results. Future work should consider providing access to the benchmark dataset and evaluation scripts to foster reproducibility within the research community.
One limitation of the study is the focus on the benchmark's design without presenting empirical results from model evaluations. This leaves questions about the practical applicability of the benchmark unanswered. Additionally, while the benchmark covers a wide range of physical phenomena, it may not encompass all possible scenarios encountered in real-world audio-visual generation, which could limit its generalizability.
PhyAVBench has the potential to significantly influence the development of T2AV models by providing a rigorous framework for evaluating their understanding of audio-physical principles. This can lead to advancements in applications such as virtual reality, gaming, and filmmaking, where realistic audio-visual content is crucial. By emphasizing the importance of physical realism in audio generation, the benchmark encourages researchers to develop models that are not only perceptually convincing but also physically grounded. The main contribution of this paper is the introduction of PhyAVBench, a novel benchmark designed to evaluate the audio physics grounding capabilities of T2AV models through a systematic and controlled approach. This work significantly advances the field by addressing a critical gap in existing evaluation frameworks and promoting the development of more realistic audio-visual generation systems.
The rapid advancements in artificial intelligence have significantly accelerated the adoption of speech recognition technology, leading to its widespread integration across various applications. However, this surge in usage also highlights a critical issue: audio data is highly vulnerable to unauthorized exposure and analysis, posing significant privacy risks for businesses and individuals. This paper introduces an Information-Obfuscation Reversible Adversarial Example (IO-RAE) framework, the pioneering method designed to safeguard audio privacy using reversible adversarial examples. IO-RAE leverages large language models to generate misleading yet contextually coherent content, effectively preventing unauthorized eavesdropping by humans and Automatic Speech Recognition (ASR) systems. Additionally, we propose the Cumulative Signal Attack technique, which mitigates high-frequency noise and enhances attack efficacy by targeting low-frequency signals. Our approach ensures the protection of audio data without degrading its quality or our ability. Experimental evaluations demonstrate the superiority of our method, achieving a targeted misguidance rate of 96.5% and a remarkable 100% untargeted misguidance rate in obfuscating target keywords across multiple ASR models, including a commercial black-box system from Google. Furthermore, the quality of the recovered audio, measured by the Perceptual Evaluation of Speech Quality score, reached 4.45, comparable to high-quality original recordings. Notably, the recovered audio processed by ASR systems exhibited an error rate of 0%, indicating nearly lossless recovery. These results highlight the practical applicability and effectiveness of our IO-RAE framework in protecting sensitive audio privacy.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of the IO-RAE framework, which effectively combines reversible adversarial examples with advanced techniques to protect audio privacy. This work represents a significant step forward in addressing the critical issue of audio data vulnerability, showcasing innovative methodologies and promising experimental results that could influence future research and applications in the field.
The proposed IO-RAE framework introduces a novel approach to audio privacy protection by combining reversible adversarial examples with large language models to generate misleading content. The methodology is well-structured, employing the Cumulative Signal Attack (CSA) to enhance the effectiveness of adversarial noise while maintaining audio quality. The use of alignment techniques to locate sensitive timestamps and the integration of LLMs for target generation are innovative aspects that contribute to the overall robustness of the method. However, the paper could benefit from more detailed explanations of the algorithms and their computational complexities.
The experiments are thorough, utilizing multiple datasets (Mozilla Common Voice, TIMIT, and LibriSpeech) and a variety of ASR models, including a commercial black-box system from Google. The reported metrics, such as targeted misguidance rates and audio quality scores, demonstrate the effectiveness of the IO-RAE framework. The results indicate a strong performance in both attack success and audio recovery, with high scores in perceptual quality. However, the evaluation could be strengthened by including more comparative analyses with existing methods to highlight the advantages of IO-RAE.
The paper provides a reasonable level of detail regarding the experimental setup, including datasets, model architectures, and evaluation metrics. However, the absence of a publicly available code repository or demo limits the reproducibility of the results. Including a link to a GitHub repository or similar would enhance the ability of other researchers to validate and build upon this work.
One limitation is the reliance on specific datasets, which may not fully represent real-world scenarios. Additionally, while the method shows promise against various ASR systems, the robustness of the approach against more sophisticated adversarial defenses remains to be fully assessed. The paper does not address potential ethical implications of using adversarial techniques in audio privacy, which is an important consideration in this domain.
The IO-RAE framework has significant implications for audio privacy protection in various applications, including personal assistants, voice-activated devices, and sensitive communications. By effectively obfuscating audio data while preserving its quality, this research could help mitigate privacy risks associated with the widespread use of ASR technologies. The approach could also inspire further research into adversarial techniques in other domains, promoting a broader understanding of privacy protection in machine learning. The main contribution of this paper is the introduction of the IO-RAE framework, which effectively combines reversible adversarial examples with advanced techniques to protect audio privacy. This work represents a significant step forward in addressing the critical issue of audio data vulnerability, showcasing innovative methodologies and promising experimental results that could influence future research and applications in the field.
End-to-end automatic speech recognition has become the dominant paradigm in both academia and industry. To enhance recognition performance, the Weighted Finite-State Transducer (WFST) is widely adopted to integrate acoustic and language models through static graph composition, providing robust decoding and effective error correction. However, WFST decoding relies on a frame-by-frame autoregressive search over CTC posterior probabilities, which severely limits inference efficiency. Motivated by establishing a more principled compatibility between WFST decoding and CTC modeling, we systematically study the two fundamental components of CTC outputs, namely blank and non-blank frames, and identify a key insight: blank frames primarily encode positional information, while non-blank frames carry semantic content. Building on this observation, we introduce Keep-Only-One and Insert-Only-One, two decoding algorithms that explicitly exploit the structural roles of blank and non-blank frames to achieve significantly faster WFST-based inference without compromising recognition accuracy. Experiments on large-scale in-house, AISHELL-1, and LibriSpeech datasets demonstrate state-of-the-art recognition accuracy with substantially reduced decoding latency, enabling truly efficient and high-performance WFST decoding in modern speech recognition systems.
Primary: unknown
All Institutions: unknown
The paper presents a significant contribution to the field of automatic speech recognition by introducing innovative algorithms that enhance WFST decoding efficiency without sacrificing accuracy. The methodology is well-grounded in theoretical insights, and the experimental results support the claims of improved performance, although further details on reproducibility and broader applicability would strengthen the work.
The paper introduces two novel algorithms, Keep-Only-One (KOO) and Insert-Only-One (IOO), which leverage the structural roles of blank and non-blank frames in CTC outputs to enhance WFST decoding. This approach is methodologically sound, as it builds on a clear understanding of the underlying mechanics of CTC and WFST. The algorithms are well-defined, and the authors provide a systematic study of the components involved, which adds rigor to their claims. However, the paper could benefit from a more detailed explanation of the algorithmic complexity and potential trade-offs involved in their approach.
The authors conduct extensive experiments on large-scale datasets, including in-house data, AISHELL-1, and LibriSpeech, demonstrating state-of-the-art recognition accuracy while significantly reducing decoding latency. The experimental design appears robust, with appropriate metrics for evaluating performance. However, the paper would benefit from a clearer presentation of the results, including comparative analysis with existing state-of-the-art methods, which would strengthen the claims of superiority.
The paper lacks sufficient implementation details, such as the specific configurations used during experiments and any available code or resources. This limits the reproducibility of the results, as other researchers may struggle to replicate the findings without access to the exact methodologies employed.
One limitation is the reliance on the specific characteristics of the datasets used for evaluation, which may not generalize to all speech recognition tasks. Additionally, while the proposed algorithms improve efficiency, the paper does not address potential scenarios where the performance might degrade, such as in highly noisy environments or with diverse accents.
The advancements in decoding efficiency and accuracy have significant implications for real-time automatic speech recognition applications, particularly in areas such as virtual assistants, transcription services, and accessibility technologies. The methodologies proposed could lead to more responsive systems that can operate effectively in a variety of settings. The paper presents a significant contribution to the field of automatic speech recognition by introducing innovative algorithms that enhance WFST decoding efficiency without sacrificing accuracy. The methodology is well-grounded in theoretical insights, and the experimental results support the claims of improved performance, although further details on reproducibility and broader applicability would strengthen the work.
Spoken Language Models (SLMs) are increasingly central to modern speech-driven applications, but performance degrades under acoustic shift - real-world noise, reverberation, and microphone variation. Prior solutions rely on offline domain adaptation, which is post-hoc, data-intensive, and slow. We introduce the first test-time adaptation (TTA) framework for generative SLMs that process interleaved audio-text prompts. Our method updates a small, targeted subset of parameters during inference using only the incoming utterance, requiring no source data or labels. This stabilizes token distributions and improves robustness to acoustic variability without degrading core task accuracy. Evaluated on automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench, our approach yields consistent gains under diverse corruptions. Because adaptation touches only a small fraction of weights, it is both compute- and memory-efficient, supporting deployment on resource-constrained platforms. This work enhances the robustness and adaptability of generative SLMs for real-world speech-driven applications.
Primary: Graduate Institute of Communication Engineering, National Taiwan University
All Institutions: Graduate Institute of Communication Engineering, National Taiwan University, Meta Reality Labs Research
The main contribution of this paper is the introduction of the SLM-TTA framework, which enables real-time adaptation of generative spoken language models to varying acoustic conditions without the need for additional data or labels. This work significantly enhances the robustness and adaptability of SLMs, making them more applicable to real-world scenarios where acoustic variability is prevalent.
The proposed SLM-TTA framework introduces a novel approach to test-time adaptation for generative spoken language models by updating a small subset of parameters during inference. This methodology addresses the critical issue of performance degradation in SLMs under various acoustic conditions without requiring additional data or labels. The interleaved audio-text prompts are a clever way to leverage incoming utterances for adaptation, which enhances the model's robustness while maintaining core task accuracy. The focus on efficiency in terms of compute and memory usage is particularly relevant for deployment in resource-constrained environments, making the approach practical for real-world applications.
The experimental setup is robust, evaluating the framework across multiple tasks, including automatic speech recognition, speech translation, and 19 audio understanding tasks from AIR-Bench. The consistent performance gains reported under diverse corruptions provide strong evidence of the method's effectiveness. However, the paper could benefit from more detailed comparisons with existing adaptation techniques to better contextualize its contributions.
The paper lacks specific implementation details that would facilitate reproducibility, such as hyperparameter settings, model architectures, and training procedures. Providing a clear description of the experimental setup and access to code or datasets would significantly enhance reproducibility.
One limitation is the reliance on a targeted subset of parameters for adaptation, which may not capture all necessary variations in acoustic conditions. Additionally, the framework's performance in highly variable or extreme conditions is not fully explored, which could be a potential area for further research.
The framework has significant implications for the deployment of generative spoken language models in real-world applications, particularly in environments with varying acoustic conditions. Its focus on efficiency and adaptability could lead to improvements in speech-driven applications across various domains, including virtual assistants, automated transcription services, and accessibility tools. The main contribution of this paper is the introduction of the SLM-TTA framework, which enables real-time adaptation of generative spoken language models to varying acoustic conditions without the need for additional data or labels. This work significantly enhances the robustness and adaptability of SLMs, making them more applicable to real-world scenarios where acoustic variability is prevalent.
Although Large Audio-Language Models (LALMs) deliver state-of-the-art (SOTA) performance, they frequently suffer from hallucinations, e.g. generating text not grounded in the audio input. We analyze these grounding failures and identify a distinct taxonomy: Event Omission, False Event Identity, Temporal Relation Error, and Quantitative Temporal Error. To address this, we introduce the AHA (Audio Hallucination Alignment) framework. By leveraging counterfactual hard negative mining, our pipeline constructs a high-quality preference dataset that forces models to distinguish strict acoustic evidence from linguistically plausible fabrications. Additionally, we establish AHA-Eval, a diagnostic benchmark designed to rigorously test these fine-grained temporal reasoning capabilities. We apply this data to align Qwen2.5-Omni. The resulting model, Qwen-Audio-AHA, achieves a 13.7% improvement on AHA-Eval. Crucially, this benefit generalizes beyond our diagnostic set. Our model shows substantial gains on public benchmarks, including 1.3% on MMAU-Test and 1.6% on MMAR, outperforming latest SOTA methods. The model and dataset are open-sourced at https://github.com/LLM-VLM-GSL/AHA.
Primary: Arizona State University
All Institutions: Arizona State University
The main contribution of this paper is the introduction of the AHA framework, which effectively reduces hallucinations in LALMs by employing a novel taxonomy of errors and counterfactual hard negative mining. This work not only enhances the performance of audio-language models but also sets a precedent for future research in the domain, emphasizing the importance of rigorous evaluation and alignment strategies.
The paper introduces the AHA framework, which innovatively addresses the issue of hallucinations in Large Audio-Language Models (LALMs) through a systematic taxonomy of errors and the use of counterfactual hard negative mining. The methodology is well-structured, combining dataset construction with a diagnostic benchmark (AHA-Eval) to rigorously evaluate the model's performance. The approach of leveraging negative examples to enhance model alignment is particularly noteworthy, as it challenges the traditional reliance on positive reinforcement alone.
The experimental results demonstrate significant improvements in the Qwen-Audio-AHA model, with a 13.7% reduction in hallucination rates on the AHA-Eval benchmark and consistent performance gains on public datasets. The comprehensive evaluation across multiple benchmarks and the clear presentation of results enhance the credibility of the findings. However, the paper could benefit from more detailed comparisons with other state-of-the-art models to contextualize the improvements.
The implementation details are adequately provided, including hyperparameters and training configurations. The open-sourcing of the model and dataset is a strong point for reproducibility, allowing other researchers to validate and build upon the work. However, more explicit instructions on the setup and execution of experiments would further enhance reproducibility.
The paper acknowledges limitations related to the evaluation protocols, particularly the rigid sensitivity of current models in judging semantic equivalence. This suggests that the reported hallucination rates might be inflated due to false positives. Additionally, the focus on fine-grained reasoning might overlook broader contextual understanding, which could limit the applicability of the findings in more complex scenarios.
The proposed AHA framework has the potential to significantly advance the field of audio-language understanding by providing a robust mechanism to mitigate hallucinations, thus improving the reliability of LALMs in real-world applications such as multimedia forensics and acoustic monitoring. The emphasis on fine-grained reasoning could also inspire further research into more nuanced evaluation metrics for multimodal models. The main contribution of this paper is the introduction of the AHA framework, which effectively reduces hallucinations in LALMs by employing a novel taxonomy of errors and counterfactual hard negative mining. This work not only enhances the performance of audio-language models but also sets a precedent for future research in the domain, emphasizing the importance of rigorous evaluation and alignment strategies.
Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.
Primary: HKUST(GZ)
All Institutions: HKUST(GZ), Tencent, Shanghai Jiao Tong University, Technical University of Munich
The main contribution of this paper is the introduction of PhyAVBench, a novel benchmark designed to evaluate the audio physics grounding capabilities of T2AV models through a systematic and controlled approach. This work significantly advances the field by addressing a critical gap in existing evaluation frameworks and promoting the development of more realistic audio-visual generation systems.
The methodology presented in PhyAVBench is robust and innovative, employing a systematic approach to benchmark T2AV models against a comprehensive set of audio-physical dimensions. The use of controlled variables to assess models' sensitivity to physical changes is a significant advancement over existing benchmarks, which often overlook the underlying physical principles. The integration of human refinement and quality control in prompt generation and video collection enhances the reliability of the benchmark. The introduction of the Audio-Physics Sensitivity Test (APST) and the Contrastive Physical Response Score (CPRS) as evaluation metrics is particularly noteworthy, as it allows for a nuanced understanding of models' performance in relation to physical realism.
While the paper primarily focuses on the benchmark design and data curation, the proposed framework is well-structured to facilitate future experimental evaluations of various T2AV models. The comprehensive coverage of audio-physical dimensions and scenarios ensures that the benchmark can be applied to a wide range of models, from commercial to academic. However, the paper lacks detailed results from model evaluations, which would have provided concrete evidence of the benchmark's effectiveness and the models' capabilities.
The paper outlines a clear data curation pipeline and methodology for creating the benchmark, which enhances reproducibility. However, the absence of specific implementation details or code availability limits the ability to fully reproduce the results. Future work should consider providing access to the benchmark dataset and evaluation scripts to foster reproducibility within the research community.
One limitation of the study is the focus on the benchmark's design without presenting empirical results from model evaluations. This leaves questions about the practical applicability of the benchmark unanswered. Additionally, while the benchmark covers a wide range of physical phenomena, it may not encompass all possible scenarios encountered in real-world audio-visual generation, which could limit its generalizability.
PhyAVBench has the potential to significantly influence the development of T2AV models by providing a rigorous framework for evaluating their understanding of audio-physical principles. This can lead to advancements in applications such as virtual reality, gaming, and filmmaking, where realistic audio-visual content is crucial. By emphasizing the importance of physical realism in audio generation, the benchmark encourages researchers to develop models that are not only perceptually convincing but also physically grounded. The main contribution of this paper is the introduction of PhyAVBench, a novel benchmark designed to evaluate the audio physics grounding capabilities of T2AV models through a systematic and controlled approach. This work significantly advances the field by addressing a critical gap in existing evaluation frameworks and promoting the development of more realistic audio-visual generation systems.
Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves approximately 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement. Our approach achieves an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of the Unweighted Accuracy of a full-scale baseline. Cross-corpus evaluation on RAVDESS reveals that the theatrical nature of acted emotions causes predictions to cluster by arousal level rather than by specific emotion categories - happiness predictions systematically bleed into anger predictions, and sadness predictions bleed into neutral predictions, due to acoustic saturation when actors prioritize clarity over subtlety. Despite this theatricality effect reducing overall RAVDESS accuracy to 46.64%, the model maintains robust arousal detection with 99% recall for anger, 55% recall for neutral, and 27% recall for sadness. These findings demonstrate a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.
Primary: University of Science and Technology, Zewail City
All Institutions: University of Science and Technology, Zewail City
This paper demonstrates a significant advancement in mobile Speech Emotion Recognition by leveraging a distilled transformer architecture, achieving a balance between model efficiency and accuracy. The methodology is rigorous, and the findings contribute valuable insights into the challenges of cross-corpus generalization in emotion recognition tasks.
The paper presents a well-structured methodology that emphasizes mobile deployment and cross-corpus robustness. The use of DistilHuBERT for model compression is innovative, and the 5-fold Leave-One-Session-Out (LOSO) cross-validation approach is commendable for ensuring speaker independence. The integration of CREMA-D for cross-corpus training is a strategic move to enhance generalization, although the authors acknowledge the challenges posed by the differing nature of the datasets. The acoustic processing and data augmentation techniques are thorough, addressing potential biases and enhancing model robustness.
The experimental framework is robust, with clear metrics for evaluation. The results indicate a significant improvement in performance metrics with cross-corpus training, and the detailed analysis of class-wise performance provides valuable insights into the model's strengths and weaknesses. However, the overall accuracy on the RAVDESS dataset highlights the challenges of generalization across different emotional expression styles.
The paper mentions that all experimental code and trained models are publicly available, which is crucial for reproducibility. However, specific URLs for accessing these resources are not provided, which could hinder the ease of replication by other researchers.
The paper acknowledges limitations such as the reduced accuracy on the RAVDESS dataset due to the theatrical nature of the emotions expressed, which may not align with the model's training data. Additionally, while the proposed deployment architecture is promising, it requires empirical validation to assess its effectiveness in real-world applications.
The work has significant implications for mobile applications in mental health monitoring and adaptive human-computer interaction. By addressing the computational demands of SER systems, this research paves the way for more accessible and privacy-preserving emotion recognition technologies. This paper demonstrates a significant advancement in mobile Speech Emotion Recognition by leveraging a distilled transformer architecture, achieving a balance between model efficiency and accuracy. The methodology is rigorous, and the findings contribute valuable insights into the challenges of cross-corpus generalization in emotion recognition tasks.
Speech Emotion Recognition (SER) has significant potential for mobile applications, yet deployment remains constrained by the computational demands of state-of-the-art transformer architectures. This paper presents a mobile-efficient SER system based on DistilHuBERT, a distilled and 8-bit quantized transformer that achieves 92% parameter reduction compared to full-scale Wav2Vec 2.0 models while maintaining competitive accuracy. We conduct a rigorous 5-fold Leave-One-Session-Out (LOSO) cross-validation on the IEMOCAP dataset to ensure speaker independence, augmented with cross-corpus training on CREMA-D to enhance generalization. Cross-corpus training with CREMA-D yields a 1.2% improvement in Weighted Accuracy, a 1.4% gain in Macro F1-score, and a 32% reduction in cross-fold variance, with the Neutral class showing the most substantial benefit at 5.4% F1-score improvement. Our approach achieves an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of full-scale baseline performance. Cross-corpus evaluation on RAVDESS reveals that the theatrical nature of acted emotions causes predictions to cluster by arousal level rather than valence: happiness is systematically confused with anger due to acoustic saturation in high-energy expressions. Despite this theatricality effect reducing overall RAVDESS accuracy to 43.29%, the model maintains robust arousal detection with 97% recall for anger and 64% for sadness. These findings establish a Pareto-optimal tradeoff between model size and accuracy, enabling practical affect recognition on resource-constrained mobile devices.
Primary: University of Science and Technology, Zewail City
All Institutions: University of Science and Technology, Zewail City
This paper establishes a mobile-efficient SER system using DistilHuBERT, demonstrating a significant contribution to the field by addressing both computational efficiency and generalization challenges in emotion recognition. The rigorous methodology and experimental validation enhance its relevance and potential impact in real-world applications.
The methodology presented in this paper is robust, emphasizing a mobile-efficient architecture through the use of DistilHuBERT, which is a significant advancement in the field of Speech Emotion Recognition (SER). The authors adopt a strict 5-fold Leave-One-Session-Out (LOSO) cross-validation protocol to mitigate speaker leakage, enhancing the reliability of their results. The integration of the CREMA-D dataset for cross-corpus training is a novel approach that addresses generalization issues, although it introduces complexities due to differences in emotional expressiveness. The use of Adaptive Focal Loss to handle class imbalance is a thoughtful addition that demonstrates an understanding of the challenges in SER.
The experimental framework is well-structured, utilizing the IEMOCAP dataset as the primary benchmark while ensuring speaker independence through rigorous validation protocols. The results indicate a 1.2% improvement in Weighted Accuracy and a significant reduction in cross-fold variance, showcasing the effectiveness of cross-corpus training. However, the performance drop on the RAVDESS dataset highlights the challenges of domain adaptation, particularly in distinguishing emotions in acted versus naturalistic speech. The findings are presented clearly, with appropriate metrics and comparisons to existing models.
The authors have made efforts to ensure reproducibility by providing access to their experimental code and trained models, although specific URLs are not mentioned in the paper. This commitment to transparency is commendable and facilitates further research in the field.
One notable limitation is the performance drop observed on the RAVDESS dataset, attributed to the theatricality of the emotions expressed, which may not generalize well to real-world applications. Additionally, while the paper discusses the potential for future work, it does not explore the implications of the identified theatricality gap in depth. The reliance on a distilled transformer architecture may also limit the model's ability to capture subtle emotional nuances.
The implications of this research are significant, as it paves the way for practical applications of SER in mobile devices, enhancing user interaction in various domains such as mental health monitoring and customer service. The focus on privacy-preserving, low-latency emotion recognition aligns well with current trends in AI ethics and user-centered design. This paper establishes a mobile-efficient SER system using DistilHuBERT, demonstrating a significant contribution to the field by addressing both computational efficiency and generalization challenges in emotion recognition. The rigorous methodology and experimental validation enhance its relevance and potential impact in real-world applications.
Audio-language models combine audio encoders with large language models to enable multimodal reasoning, but they also introduce new security vulnerabilities. We propose a universal targeted latent space attack, an encoder-level adversarial attack that manipulates audio latent representations to induce attacker-specified outputs in downstream language generation. Unlike prior waveform-level or input-specific attacks, our approach learns a universal perturbation that generalizes across inputs and speakers and does not require access to the language model. Experiments on Qwen2-Audio-7B-Instruct demonstrate consistently high attack success rates with minimal perceptual distortion, revealing a critical and previously underexplored attack surface at the encoder level of multimodal systems.
Primary: Moshe Sipper
All Institutions: Moshe Sipper, Raz Lapid
The main contribution of this paper is the introduction of a novel encoder-level adversarial attack on audio-language models, which reveals critical vulnerabilities and opens avenues for further research in security within multimodal AI systems. The paper's innovative approach and experimental results provide valuable insights into the robustness of audio-language models, although it requires improvements in reproducibility and a broader exploration of its implications.
The proposed methodology introduces a universal targeted latent space attack that operates at the encoder level of audio-language models. This is a significant departure from traditional waveform-level or input-specific attacks, showcasing an innovative approach to adversarial attacks in multimodal systems. The authors effectively leverage latent representations to induce specific outputs, which is a novel contribution to the field. However, the paper could benefit from a more detailed explanation of the learning process for the universal perturbation and its generalization capabilities across different inputs and speakers.
The experiments conducted on the Qwen2-Audio-7B-Instruct model demonstrate high attack success rates with minimal perceptual distortion, which is commendable. The results are quantitatively presented, showing the effectiveness of the proposed attack method. However, the paper lacks a comprehensive comparison with existing attack methods, which would strengthen the claims of superiority and robustness of the proposed approach.
The paper does not provide sufficient details regarding the implementation of the attack or the datasets used for training and evaluation, which raises concerns about reproducibility. Clearer guidelines or links to code repositories would enhance the ability of other researchers to replicate the results.
One notable limitation is the lack of a thorough exploration of the attack's effectiveness across a broader range of audio-language models beyond the one tested. Additionally, the paper does not address potential countermeasures that could be employed against such attacks, which is crucial for understanding the practical implications of the research.
The implications of this research are significant, as it highlights a previously underexplored vulnerability in multimodal systems that combine audio and language processing. The findings could inform the development of more secure audio-language models and raise awareness about the potential risks associated with deploying such systems in real-world applications. The main contribution of this paper is the introduction of a novel encoder-level adversarial attack on audio-language models, which reveals critical vulnerabilities and opens avenues for further research in security within multimodal AI systems. The paper's innovative approach and experimental results provide valuable insights into the robustness of audio-language models, although it requires improvements in reproducibility and a broader exploration of its implications.
Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence and potential mode collapse during training, while diffusion methods require multi-step inference that introduces considerable computational overhead. In this work, we introduce Flow2GAN, a two-stage framework that combines Flow Matching training for learning generative capabilities with GAN fine-tuning for efficient few-step inference. Specifically, given audio's unique properties, we first improve Flow Matching for audio modeling through: 1) reformulating the objective as endpoint estimation, avoiding velocity estimation difficulties when involving empty regions; 2) applying spectral energy-based loss scaling to emphasize perceptually salient quieter regions. Building on these Flow Matching adaptations, we demonstrate that a further stage of lightweight GAN fine-tuning enables us to obtain one-step generator that produces high-quality audio. In addition, we develop a multi-branch network architecture that processes Fourier coefficients at different time-frequency resolutions, which improves the modeling capabilities compared to prior single-resolution designs. Experimental results indicate that our Flow2GAN delivers high-fidelity audio generation from Mel-spectrograms or discrete audio tokens, achieving better quality-efficiency trade-offs than existing state-of-the-art GAN-based and Flow Matching-based methods. Online demo samples are available at https://flow2gan.github.io, and the source code is released at https://github.com/k2-fsa/Flow2GAN.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Flow2GAN, a hybrid framework that effectively combines Flow Matching and GANs for high-fidelity audio generation. This work significantly advances the state of the art in audio synthesis by addressing key limitations of existing methods and providing a robust experimental evaluation of its performance.
The methodology presented in Flow2GAN is innovative, combining Flow Matching and GANs to address the limitations of both approaches in audio generation. The reformulation of the Flow Matching objective to endpoint estimation is a significant improvement, as it mitigates challenges related to velocity estimation in sparse regions. The introduction of spectral energy-based loss scaling is also a thoughtful enhancement that prioritizes perceptually important audio features. The multi-branch network architecture for processing Fourier coefficients at varying resolutions is a novel approach that likely enhances the model's ability to capture complex audio patterns. Overall, the methodology is well-structured and demonstrates a clear progression from theory to practical application.
The experimental section provides a solid evaluation of the proposed method against existing state-of-the-art techniques. The authors report quantitative metrics that indicate improved quality-efficiency trade-offs, which is crucial for practical audio generation applications. However, the paper would benefit from a more detailed description of the datasets used, including their size and diversity, as well as the specific metrics employed for evaluation. Additionally, comparisons with more recent advancements in the field could strengthen the claims of superiority over existing methods.
The authors have included a reproducibility statement and provided links to the demo and source code, which is a positive aspect for the community. However, the paper lacks detailed implementation specifics, such as hyperparameter settings and training protocols, which are essential for other researchers to replicate the results accurately. More thorough documentation in the code repository would enhance reproducibility.
One limitation of the proposed method is the reliance on the two-stage framework, which may introduce additional complexity in training and inference. While the authors claim efficiency in few-step inference, the computational overhead of the initial Flow Matching stage may still be a concern in real-time applications. Additionally, the paper does not address the scalability of the approach to larger datasets or more complex audio generation tasks.
The potential applications of Flow2GAN are significant, particularly in fields such as music generation, sound design, and audio synthesis for virtual environments. By improving the fidelity and efficiency of audio generation, this work could contribute to advancements in interactive media and entertainment. Furthermore, the techniques developed could be adapted for other generative tasks beyond audio, influencing a broader range of machine learning applications. The main contribution of this paper is the introduction of Flow2GAN, a hybrid framework that effectively combines Flow Matching and GANs for high-fidelity audio generation. This work significantly advances the state of the art in audio synthesis by addressing key limitations of existing methods and providing a robust experimental evaluation of its performance.