Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
The paper presents a novel agent-based framework, Resp-Agent, which integrates multimodal data (audio and EHR) for respiratory sound generation and diagnosis. The methodology is innovative, utilizing an Active Adversarial Curriculum Agent (Thinker-A$^2$CA) to dynamically identify weaknesses in diagnostics and schedule targeted synthesis. The Modality-Weaving Diagnoser and Flow Matching Generator are well-conceived to address the representation and data gaps in respiratory sound analysis. The use of large language models (LLMs) for generating clinical narratives and the careful design of the dataset (Resp-229k) enhance the robustness of the approach. However, while the methodology is sound, it heavily relies on the quality of the underlying data and the effectiveness of the LLMs used for synthesis.
The experiments are comprehensive, evaluating the proposed system against multiple benchmarks and comparing it with existing methods. The results demonstrate significant improvements in diagnostic accuracy and robustness, particularly in handling class imbalance and data scarcity. The use of a strict cross-domain evaluation protocol adds rigor to the assessment of generalization capabilities. The paper also includes detailed ablation studies that validate the contributions of various components of the system, further strengthening the findings.
The authors have made significant efforts to ensure reproducibility by providing access to the code and dataset. The detailed descriptions of the architecture, training procedures, and evaluation metrics contribute to the transparency of the research. However, the reliance on LLMs and the complexity of the system may pose challenges for complete replication without adequate computational resources.
One limitation is the dependency on the quality and diversity of the Resp-229k dataset, which may affect the generalizability of the findings. Additionally, while the paper addresses class imbalance, the performance on extremely rare conditions may still be limited. The complexity of the system could also hinder its practical deployment in clinical settings without further validation.
The proposed framework has the potential to significantly advance the field of respiratory sound analysis and diagnosis, offering a robust tool for clinicians to improve diagnostic accuracy and support medical education. The integration of generative modeling with diagnostic capabilities could lead to more effective training datasets and enhance the understanding of respiratory diseases. However, ethical considerations regarding the use of AI in clinical decision-making must be addressed. The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through both training-based and training-free approaches. We introduce MUKA, a multi-kernel adaptation framework that combines the fine-grained, context-dependent representations of instruction-tuning based models like Pengi with the global semantic representations of contrastive pretraining methods like CLAP. By constructing a product kernel that aligns local similarity with global semantics, MUKA enhances representational power while preserving the theoretical guarantees of kernel methods and avoiding additional training. Extensive experiments across 11 diverse audio datasets demonstrate that MUKA achieves state-of-the-art performance among training-free methods and even surpasses training-based adapters in several scenarios, offering a compelling balance between adaptability and efficiency.
Primary: IMT Atlantique
All Institutions: IMT Atlantique, Polytechnique Montréal, Inria, University Rennes, IRISA, CNRS, Université de Montpellier
The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
The methodology proposed in MUKA is innovative as it introduces a multi-kernel product approach that effectively combines the strengths of different audio-language models, specifically Pengi and CLAP. This combination allows for a more nuanced representation of audio data, capturing both fine-grained details and broader semantic contexts. The theoretical grounding in kernel methods adds robustness to the approach, and the avoidance of additional training enhances its practicality in few-shot scenarios. However, the paper could benefit from a more detailed explanation of the kernel design choices and how they were empirically validated.
The experiments are extensive, covering 11 diverse audio datasets, which demonstrates the versatility of the proposed method. The results indicate that MUKA achieves state-of-the-art performance among training-free methods and competes well with training-based methods. The use of cross-validation and clear reporting of accuracy metrics strengthens the experimental rigor. However, the paper lacks a discussion on the statistical significance of the results, which would provide a clearer picture of the performance improvements.
The paper outlines the experimental setup and methodology sufficiently to allow for reproducibility. It mentions the use of specific datasets and the pre-trained models employed, along with the computational resources used for experiments. However, the absence of a public code repository or demo limits the ease of reproducibility for other researchers.
One limitation is the reliance on existing models (Pengi and CLAP) without exploring the potential for developing new models tailored specifically for audio-language tasks. Additionally, while the paper claims efficiency, it does not provide a detailed computational complexity analysis of MUKA compared to other methods. The scope of datasets, while diverse, may not cover all potential audio-language applications, which could limit the generalizability of the findings.
The implications of this work are significant for the field of audio processing and multimodal learning. By improving few-shot adaptation in audio-language models, MUKA could facilitate advancements in applications such as audio classification, emotion recognition, and sound event detection. The proposed methodology could also inspire further research into kernel methods and their applications in other domains, potentially leading to more efficient machine learning models. The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Alibaba Group, Carnegie Mellon University, Microsoft Corporation, Queen Mary University of London, Shanghai Jiao Tong University
The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
The paper presents a well-structured methodology for evaluating audio reasoning models through the introduction of the MMAR-Rubrics, which emphasizes the quality of reasoning chains rather than just final answers. This is a significant shift in evaluation paradigms, addressing the limitations of existing benchmarks that focus primarily on accuracy. The dual-track design allows for a comprehensive exploration of both end-to-end models and agent-based systems, providing insights into different architectural approaches. The use of instance-level evaluation criteria enhances the reliability and stability of the assessment process.
The experimental setup is robust, with a large number of participants (156 teams from 18 countries) demonstrating significant interest and engagement in the challenge. The results indicate a clear performance differentiation between agent systems and single models, with detailed analyses of top-performing systems providing valuable insights into effective strategies. The use of rigorous evaluation metrics, including reliability and human alignment studies, strengthens the credibility of the findings.
The paper provides sufficient details regarding the evaluation protocols and the challenge design, including the release of the MMAR-Rubrics benchmark data and evaluation scripts. However, the reproducibility of the models themselves may be limited due to the proprietary nature of some systems and the lack of detailed descriptions of their architectures and training processes.
One limitation is the potential variability in the quality of the reasoning paths generated by different models, which may not be fully captured by the evaluation metrics. Additionally, the reliance on LLMs for scoring may introduce biases or inconsistencies, although the authors have taken steps to mitigate this through their instance-level rubric approach. The challenge also does not address the scalability of the proposed evaluation methods to more complex real-world scenarios.
The findings from this research have significant implications for the development of explainable AI in audio processing, particularly in applications requiring robust reasoning capabilities, such as automated transcription services, audio analysis for accessibility, and interactive audio agents. By focusing on the reasoning process, this work contributes to enhancing the transparency and trustworthiness of AI systems in critical domains. The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
In sequence-to-sequence Transformer ASR, autoregressive (AR) models achieve strong accuracy but suffer from slow decoding, while non-autoregressive (NAR) models enable parallel decoding at the cost of degraded performance. We propose a principled NAR ASR framework based on Masked Diffusion Models to reduce this gap. A pre-trained speech encoder is coupled with a Transformer diffusion decoder conditioned on acoustic features and partially masked transcripts for parallel token prediction. To mitigate the training-inference mismatch, we introduce Iterative Self-Correction Training that exposes the model to its own intermediate predictions. We also design a Position-Biased Entropy-Bounded Confidence-based sampler with positional bias to further boost results. Experiments across multiple benchmarks demonstrate consistent gains over prior NAR models and competitive performance with strong AR baselines, while retaining parallel decoding efficiency.
Primary: Georgia Institute of Technology
All Institutions: Georgia Institute of Technology, UniversitĂ degli Studi di Palermo
The paper presents MDM-ASR, a novel approach that leverages masked diffusion models for efficient and accurate non-autoregressive automatic speech recognition. The integration of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and speech processing.
The proposed MDM-ASR framework innovatively integrates masked diffusion models into the ASR domain, addressing the limitations of autoregressive and traditional non-autoregressive models. The use of Iterative Self-Correction Training (ISCT) to align training with inference is a significant methodological advancement, as it allows the model to learn from its own predictions, thereby enhancing robustness. The introduction of Position-Biased Entropy-Bounded Confidence-based samplers further refines the decoding process, showcasing a well-thought-out approach to improving efficiency and accuracy.
The experiments are comprehensive, covering multiple English and multilingual datasets, and the results demonstrate that MDM-ASR outperforms existing models in both accuracy and decoding efficiency. The ablation studies provide valuable insights into the contributions of various components, reinforcing the robustness of the findings. However, the reliance on specific datasets may limit the generalizability of the results.
The paper provides sufficient details regarding the experimental setup, including model architecture and training procedures, which enhances reproducibility. However, the absence of publicly available code or a demo limits the practical reproducibility of the results.
The paper acknowledges limitations in terms of dataset diversity and the need for further exploration of alternative model configurations. Additionally, the evaluation is primarily based on benchmark datasets, which may not fully capture real-world performance across varied conditions.
The advancements in ASR technology presented in this paper have significant implications for real-time applications, such as virtual assistants and transcription services, where efficiency and accuracy are paramount. The proposed methods could pave the way for more scalable and effective ASR systems across different languages and domains. The paper presents MDM-ASR, a novel approach that leverages masked diffusion models for efficient and accurate non-autoregressive automatic speech recognition. The integration of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and speech processing.
This paper highlights the critical importance of multi-channel speech enhancement (MCSE) for speech emotion recognition (ER) in cocktail party scenarios. A multi-channel speech dereverberation and separation front-end integrating DNN-WPE and mask-based MVDR is used to extract the target speaker's speech from the mixture speech, before being fed into the downstream ER back-end using HuBERT- and ViT-based speech and visual features. Experiments on mixture speech constructed using the IEMOCAP and MSP-FACE datasets suggest the MCSE output consistently outperforms domain fine-tuned single-channel speech representations produced by: a) Conformer-based metric GANs; and b) WavLM SSL features with optional SE-ER dual task fine-tuning. Statistically significant increases in weighted, unweighted accuracy and F1 measures by up to 9.5%, 8.5% and 9.1% absolute (17.1%, 14.7% and 16.0% relative) are obtained over the above single-channel baselines. The generalization of IEMOCAP trained MCSE front-ends are also shown when being zero-shot applied to out-of-domain MSP-FACE data.
Primary: Institute of Software, Chinese Academy of Sciences
All Institutions: Institute of Software, Chinese Academy of Sciences, National Research Council Canada, The Chinese University of Hong Kong
This paper makes a significant contribution by demonstrating the effectiveness of multi-channel speech enhancement techniques for improving emotion recognition in challenging acoustic environments. The innovative methodology and strong experimental results highlight its potential impact on future research and applications in the field.
The paper presents a robust methodology that integrates multi-channel speech enhancement (MCSE) with emotion recognition (ER) in cocktail party scenarios. The use of a DNN-WPE based dereverberation and mask-based MVDR separation front-end is innovative, particularly in its application to ER, which has traditionally relied on single-channel inputs. The integration of HuBERT and ViT for feature extraction further enhances the approach, making it suitable for both audio-only and audio-visual ER systems. The detailed ablation studies provide insights into the contributions of each component, showcasing a comprehensive understanding of the problem space.
The experiments are well-structured, utilizing two established datasets (IEMOCAP and MSP-FACE) to evaluate the proposed system's performance. The results demonstrate statistically significant improvements in accuracy and F1 scores compared to single-channel baselines, indicating the effectiveness of the MCSE approach. The zero-shot application of the MCSE front-end to out-of-domain data is particularly noteworthy, suggesting good generalization capabilities.
The paper provides sufficient details regarding the experimental setup, including model configurations and training strategies, which enhances reproducibility. However, the absence of a publicly available code repository may hinder full reproducibility for other researchers.
While the paper addresses a significant gap in the literature, it does not explore the potential computational costs and real-time applicability of the proposed MCSE front-end in practical scenarios. Additionally, the reliance on simulated data for training may limit the model's performance in real-world applications.
The findings of this research have the potential to significantly advance the field of emotion recognition in noisy environments, particularly in applications such as human-computer interaction, assistive technologies, and surveillance systems. The integration of multi-channel processing could lead to more robust systems capable of understanding human emotions in complex auditory scenes. This paper makes a significant contribution by demonstrating the effectiveness of multi-channel speech enhancement techniques for improving emotion recognition in challenging acoustic environments. The innovative methodology and strong experimental results highlight its potential impact on future research and applications in the field.
Self-supervised speech models (S3Ms) are known to encode rich phonetic information, yet how this information is structured remains underexplored. We conduct a comprehensive study across 96 languages to analyze the underlying structure of S3M representations, with particular attention to phonological vectors. We first show that there exist linear directions within the model's representation space that correspond to phonological features. We further demonstrate that the scale of these phonological vectors correlate to the degree of acoustic realization of their corresponding phonological features in a continuous manner. For example, the difference between [d] and [t] yields a voicing vector: adding this vector to [p] produces [b], while scaling it results in a continuum of voicing. Together, these findings indicate that S3Ms encode speech using phonologically interpretable and compositional vectors, demonstrating phonological vector arithmetic. All code and interactive demos are available at https://github.com/juice500ml/phonetic-arithmetic .
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the demonstration that self-supervised speech models encode phonologically interpretable and compositional vectors, revealing a structured representation of phonological features. This work significantly advances the understanding of S3M representations and their potential applications in speech technology and linguistics.
The paper presents a novel approach to understanding self-supervised speech models (S3Ms) by investigating the linear structure of phonological features within their representation space. The methodology involves analyzing phonological vectors across 96 languages, establishing a framework for phonological vector arithmetic. The use of cosine similarity to evaluate phonological analogies and the introduction of a vocoder to assess the scaling of phonological vectors are innovative aspects that enhance the understanding of S3M representations.
The experiments are well-structured, utilizing two datasets (TIMIT and VoxAngeles) to validate the hypotheses regarding phonological vector arithmetic and scaling. The results demonstrate a strong correlation between the scale of phonological vectors and acoustic measurements, providing empirical support for the proposed theories. The analysis of phonological features across different languages adds to the robustness of the findings, although the paper could benefit from a broader range of S3Ms to validate the generalizability of the results.
The authors have made their code and interactive demos publicly available, which is a positive aspect for reproducibility. However, the paper could improve by providing more detailed implementation specifics, particularly regarding the training of the vocoder and the exact configurations used for the S3Ms.
The study is limited by its focus on a specific set of phonological features as defined by PanPhon, which may not capture the full complexity of phonological systems across all languages. Additionally, the results are influenced by the choice of vocoder, and the authors acknowledge that different vocoders may yield varying synthesis results. The paper also notes that it does not explore all possible S3Ms, which could limit the generalizability of the findings.
The findings have significant implications for both speech processing and linguistic theory. By demonstrating that S3Ms can learn interpretable phonological structures, the research opens avenues for more intuitive speech synthesis and understanding of phonological features as continuous rather than binary. This could enhance applications in speech recognition, synthesis, and language learning technologies. The main contribution of this paper is the demonstration that self-supervised speech models encode phonologically interpretable and compositional vectors, revealing a structured representation of phonological features. This work significantly advances the understanding of S3M representations and their potential applications in speech technology and linguistics.
Voice-based digital biomarkers can enable scalable, non-invasive screening and monitoring of Parkinson's disease (PD) and Amyotrophic Lateral Sclerosis (ALS). However, models trained on one cohort or device often fail on new acquisition settings due to cross-device and cross-cohort domain shift. This challenge is amplified in real-world scenarios with partial-label mismatch, where datasets may contain different disease labels and only partially overlap in class space. In addition, voice-based models may exploit demographic cues, raising concerns about gender-related unfairness, particularly when deployed across heterogeneous cohorts. To tackle these challenges, we propose a hybrid framework for unified three-class (healthy/PD/ALS) cross-domain voice classification from partially overlapping cohorts. The method combines style-based domain generalization with conditional adversarial alignment tailored to partial-label settings, reducing negative transfer. An additional adversarial gender branch promotes gender-invariant representations. We conduct a comprehensive evaluation across four heterogeneous sustained-vowel datasets, spanning distinct acquisition settings and devices, under both domain generalization and unsupervised domain adaptation protocols. The proposed approach is compared against twelve state-of-the-art machine learning and deep learning methods, and further evaluated through three targeted ablations, providing the first cross-cohort benchmark and end-to-end domain-adaptive framework for unified healthy/PD/ALS voice classification under partial-label mismatch and fairness constraints. Across all experimental settings, our method consistently achieves the best external generalization over the considered evaluation metrics, while maintaining reduced gender disparities. Notably, no competing method shows statistically significant gains in external performance.
Primary: Ecole Polytechnique Federale de Lausanne
All Institutions: Ecole Polytechnique Federale de Lausanne, UniversitĂ Campus Bio-Medico di Roma, Eustema S.p.A., UmeĂĄ University, UniCamillus-Saint Camillus International University of Health Sciences
The paper presents a novel framework for voice classification of Parkinson's and ALS that effectively addresses challenges of domain adaptation and fairness. The comprehensive evaluation and innovative methodology contribute significantly to the fields of machine learning and medical AI.
The proposed FairPDA framework integrates multiple advanced techniques such as style-based domain generalization, conditional adversarial alignment, and adversarial gender debiasing. This hybrid approach is well-structured and addresses the complex problem of partial-label domain adaptation while considering fairness in voice classification tasks. The methodology is innovative in its combination of techniques and is well-justified through the literature review, although it could benefit from clearer explanations of the specific contributions of each component.
The experiments are comprehensive, utilizing four heterogeneous datasets and comparing the proposed method against twelve state-of-the-art approaches. The evaluation metrics are appropriate for the task, including Balanced Accuracy, Matthews Correlation Coefficient, and fairness metrics. The results demonstrate that FairPDA consistently outperforms competing methods, although the absolute performance levels are moderate, indicating the difficulty of the task.
The paper provides sufficient details regarding the methodology, datasets, and evaluation metrics, which facilitates reproducibility. However, the absence of a publicly available code repository or demo limits the ease with which others can replicate the findings.
The study is limited by its focus on binary gender labels for fairness analysis, which restricts the scope of the fairness evaluation. Additionally, the performance metrics indicate that while FairPDA outperforms competitors, the overall accuracy remains moderate, suggesting that further improvements are needed for practical deployment.
The research has significant implications for the field of medical AI, particularly in the development of voice-based diagnostic tools for neurodegenerative diseases. By addressing fairness and domain adaptation, this work contributes to the ethical deployment of AI in healthcare, potentially leading to better patient outcomes across diverse populations. The paper presents a novel framework for voice classification of Parkinson's and ALS that effectively addresses challenges of domain adaptation and fairness. The comprehensive evaluation and innovative methodology contribute significantly to the fields of machine learning and medical AI.
In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/meanvoiceflow/.
Primary: NTT Corporation
All Institutions: NTT Corporation
The paper presents MeanVoiceFlow, a novel one-step nonparallel voice conversion model that significantly enhances conversion speed and efficiency. The technical contributions, particularly in addressing training stability and maintaining consistency between training and inference, are well-founded and have the potential to influence future work in voice conversion and related audio applications.
The proposed MeanVoiceFlow model introduces a novel approach to voice conversion by utilizing mean flows instead of traditional instantaneous velocities, which significantly enhances the speed and efficiency of the conversion process. The introduction of a structural margin reconstruction loss addresses training instability, while the conditional diffused-input training method effectively bridges the gap between training and inference, ensuring consistency in performance. The methodology is well-structured, with clear theoretical foundations and practical implementations that are rigorously justified.
The experimental validation is thorough, employing a variety of datasets and metrics to assess the model's performance. The results demonstrate that MeanVoiceFlow achieves performance on par with existing multi-step and distillation-based models, showcasing its effectiveness even when trained from scratch. The use of both objective and subjective evaluation metrics strengthens the credibility of the findings, although further details on the statistical significance of the results would enhance the robustness of the claims.
The paper provides sufficient implementation details, including the architecture of the neural networks and the training procedures, which should facilitate reproducibility. However, the absence of code availability or a public repository could hinder independent verification of the results. Including a clear description of the experimental setup and hyperparameters is beneficial, yet a shared codebase would greatly enhance reproducibility.
One limitation of the study is the reliance on specific datasets, which may affect the generalizability of the results to other voice conversion tasks or languages. Additionally, while the model performs well in zero-shot scenarios, its performance in more complex voice conversion tasks involving diverse accents or languages remains to be evaluated. The potential for over-smoothing in outputs due to the structural margin reconstruction loss also warrants further investigation.
The advancements presented in this paper have significant implications for real-time voice conversion applications, such as in virtual assistants, gaming, and entertainment. The ability to convert voices quickly and effectively without extensive pretraining could democratize access to high-quality voice synthesis technologies. Furthermore, the methodologies introduced may inspire future research in related fields, such as speech synthesis and audio processing. The paper presents MeanVoiceFlow, a novel one-step nonparallel voice conversion model that significantly enhances conversion speed and efficiency. The technical contributions, particularly in addressing training stability and maintaining consistency between training and inference, are well-founded and have the potential to influence future work in voice conversion and related audio applications.
Flow matching and diffusion bridge models have emerged as leading paradigms in generative speech enhancement, modeling stochastic processes between paired noisy and clean speech signals based on principles such as flow matching, score matching, and Schrödinger bridge. In this paper, we present a framework that unifies existing flow and diffusion bridge models by interpreting them as constructions of Gaussian probability paths with varying means and variances between paired data. Furthermore, we investigate the underlying consistency between the training/inference procedures of these generative models and conventional predictive models. Our analysis reveals that each sampling step of a well-trained flow or diffusion bridge model optimized with a data prediction loss is theoretically analogous to executing predictive speech enhancement. Motivated by this insight, we introduce an enhanced bridge model that integrates an effective probability path design with key elements from predictive paradigms, including improved network architecture, tailored loss functions, and optimized training strategies. Experiments on denoising and dereverberation tasks demonstrate that the proposed method outperforms existing flow and diffusion baselines with fewer parameters and reduced computational complexity. The results also highlight that the inherently predictive nature of this generative framework imposes limitations on its achievable upper-bound performance.
Primary: Nanjing University
All Institutions: Nanjing University
The main contribution of this paper is the introduction of a unified framework for flow and diffusion bridge models in speech enhancement, which enhances performance through innovative methodologies and insights. This work significantly advances the field by bridging generative and predictive modeling approaches, offering a comprehensive solution to challenges in speech enhancement.
The paper presents a unified framework that integrates flow matching and diffusion bridge models for speech enhancement, providing a novel interpretation of these models as Gaussian probability paths. The methodology is robust, combining theoretical insights with practical improvements in network architecture and training strategies. The introduction of a time embedding mechanism and an enhanced loss function demonstrates a thoughtful approach to optimizing performance while reducing complexity.
The experiments are well-structured, utilizing two datasets for denoising and dereverberation tasks. The results show a clear performance advantage over existing baselines, with comprehensive metrics that validate the effectiveness of the proposed model. The ablation studies further strengthen the findings by isolating the impact of various modifications.
The paper includes sufficient implementation details, including dataset descriptions, training configurations, and hyperparameter settings, which enhance reproducibility. The availability of code on GitHub supports this aspect, allowing other researchers to replicate the experiments.
While the proposed model shows significant improvements, the authors acknowledge that its inherently predictive nature may impose an upper limit on performance compared to purely predictive models. Additionally, the reliance on specific architectures may limit generalizability to other tasks or domains.
The research has potential applications in various speech processing tasks, including real-time communication systems, hearing aids, and assistive technologies for the hearing impaired. The integration of predictive paradigms into generative models could inspire further innovations in speech enhancement and related fields. The main contribution of this paper is the introduction of a unified framework for flow and diffusion bridge models in speech enhancement, which enhances performance through innovative methodologies and insights. This work significantly advances the field by bridging generative and predictive modeling approaches, offering a comprehensive solution to challenges in speech enhancement.
Audio-text retrieval is crucial for bridging acoustic signals and natural language. While contrastive dual-encoder architectures like CLAP have shown promise, they are fundamentally limited by the capacity of small-scale encoders. Specifically, the text encoders struggle to understand complex queries that require reasoning or world knowledge. In this paper, we propose AuroLA, a novel contrastive language-audio pre-training framework that re-purposes Multimodal Large Language Models (MLLMs) as a unified backbone for retrieval. Specifically, we make three contributions: (i) we construct a scalable data pipeline that curates diverse audio from multiple sources and generates multi-granular captions, ranging from long descriptions to structured tags, via automated annotation; (ii) we adapt an MLLM for retrieval by prompting it to summarize the audio/text input and using the hidden state of a special token as audio/text embeddings. For model training, we devise a novel Hybrid-NCE loss, which employs multi-granular supervision and hard-negative reweighting to robustly align audio with diverse textual supervision; and (iii) we design an MLLM-based bidirectional re-ranking module that refines retrieval candidates through deep cross-modal interaction. Extensive experiments demonstrate that AuroLA consistently outperforms state-of-the-art models, including the recent PE-AV, while utilizing only approximately 1% of PE-AV's training data. Lastly, we observe clear scaling trends regarding dataset size and model capacity, validating the effectiveness of MLLM as a unified backbone for audio-text retrieval. Code is available at https://github.com/Jazzcharles/AuroLA.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of AuroLA, a novel framework that effectively utilizes Multimodal Large Language Models for audio-text retrieval, demonstrating significant improvements over existing methods. The comprehensive analysis of the technical contributions, innovative methodology, and promising experimental results highlight its potential impact on the field of machine learning and audio processing.
The proposed AuroLA framework introduces a novel approach to audio-text retrieval by leveraging Multimodal Large Language Models (MLLMs) as a unified backbone. The methodology is well-structured, with a focus on creating a scalable data pipeline and a Hybrid-NCE loss that enhances the alignment of audio and text embeddings through multi-granular supervision. The adaptation of MLLMs for retrieval tasks is innovative, particularly the use of a special token's hidden state for embeddings. However, the paper could benefit from a more detailed explanation of the implementation of the Hybrid-NCE loss and its advantages over traditional contrastive losses.
The experiments conducted are extensive, demonstrating the superiority of AuroLA over existing state-of-the-art models, including PE-AV, while using significantly less training data. The results are compelling, showcasing clear scaling trends that validate the proposed framework. However, the paper lacks a thorough comparison with a broader range of models and datasets, which could provide a more comprehensive understanding of AuroLA's performance across different scenarios.
The paper mentions that code is available on GitHub, which is a positive aspect for reproducibility. However, the paper does not provide sufficient implementation details or hyperparameter settings that would allow other researchers to easily replicate the experiments. A more detailed supplementary material or appendix could enhance reproducibility.
One limitation is the reliance on the quality and diversity of the audio data curated for training, which may affect the generalizability of the model. Additionally, while the use of MLLMs is innovative, the computational cost associated with training and deploying such models could be a barrier to practical applications. The paper also does not address potential biases in the data or the model's performance across different languages or dialects.
The implications of this research are significant, particularly in applications such as multimedia search engines, accessibility tools for the hearing impaired, and content-based audio retrieval systems. By improving audio-text retrieval capabilities, this work could enhance user experiences in various domains, including education, entertainment, and information retrieval. The main contribution of this paper is the introduction of AuroLA, a novel framework that effectively utilizes Multimodal Large Language Models for audio-text retrieval, demonstrating significant improvements over existing methods. The comprehensive analysis of the technical contributions, innovative methodology, and promising experimental results highlight its potential impact on the field of machine learning and audio processing.
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Adobe Research, OpenAI
This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
The paper presents a novel framework, AudioChat, which integrates audio generation, editing, and understanding through a unified model. The methodology leverages a tool-calling agent, AudioCopilot, to synthesize training data through simulated user interactions, which is innovative in addressing the data scarcity issue in complex audio scene processing. The introduction of the Audio Transfusion Forcing objective is a significant advancement, allowing the model to perform structured reasoning and multi-turn interactions effectively. The architecture employs a continuous audio tokenizer and a multi-modal language model, which are well-justified and contribute to the model's performance.
The experiments are comprehensive, evaluating AudioChat against various baselines across multiple tasks including storytelling, editing, and understanding. The use of novel evaluation metrics like multiFLAM and editFLAM provides a more nuanced assessment of the model's capabilities compared to traditional metrics. The results indicate that AudioChat outperforms existing models, demonstrating its effectiveness in handling complex audio tasks. However, the paper could benefit from more detailed comparisons with a broader range of existing methods.
The authors provide ample details regarding the training data, hyperparameters, and methodology, which supports reproducibility. However, the proprietary nature of some training data may limit full replication of the results. The paper does a commendable job of outlining the architecture and training process, allowing for potential implementation by other researchers.
One limitation is the reliance on synthetic data generated by AudioCopilot, which may not capture the full diversity of real-world audio scenarios. Additionally, while the model shows promise, its performance in edge cases or highly nuanced audio tasks remains to be thoroughly evaluated. The potential ethical implications of audio generation technologies, such as misuse for impersonation, are acknowledged but not deeply explored.
The development of AudioChat has significant implications for various applications in multimedia, including film, gaming, and virtual reality, where immersive audio storytelling is crucial. The ability to generate and edit complex audio scenes could enhance user experiences in these domains. However, the potential for misuse in creating deceptive audio content raises ethical concerns that need to be addressed by the research community. This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Primary: LY Corporation
All Institutions: LY Corporation
The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
The proposed CC-G2PnP model employs a Conformer-CTC architecture that innovatively processes grapheme tokens in chunks, allowing for streaming inference of phonemic and prosodic labels. The introduction of minimum look-ahead (MLA) is a significant methodological advancement, as it addresses the limitations of previous streaming models that rely on explicit word boundaries. This approach is particularly beneficial for unsegmented languages like Japanese, where word boundaries are not clearly defined. The integration of self-conditioned CTC into the architecture further enhances the model's performance by allowing dynamic learning of alignments between graphemes and phonemes.
The experiments conducted on a Japanese dataset demonstrate the effectiveness of CC-G2PnP, showing significant improvements in character error rate (CER) and sentence error rate (SER) compared to baseline models. The use of both objective metrics and subjective assessments of TTS naturalness provides a comprehensive evaluation of the model's performance. The dataset preparation and experimental conditions are well-documented, allowing for a clear understanding of the model's capabilities and limitations.
While the paper provides detailed descriptions of the model architecture and training procedures, the lack of a publicly available code repository or demo URL limits reproducibility. The absence of specific hyperparameters and training configurations in a readily accessible format could hinder other researchers from replicating the results.
One limitation noted is the reliance on a large amount of training data to achieve optimal performance, which may not be feasible for all applications. Additionally, while the model performs well in terms of accuracy, the subjective evaluation of TTS naturalness could vary based on the speaker used during testing, which may not generalize across different voices.
The CC-G2PnP model has the potential to significantly enhance text-to-speech systems, particularly for languages without explicit word boundaries. This could lead to more natural and efficient human-machine interactions in various applications, including virtual assistants, language learning tools, and accessibility technologies for the visually impaired. The advancements in streaming G2PnP could also inspire further research in related areas, such as real-time speech synthesis and multilingual processing. The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
Primary: Stanford University
All Institutions: Stanford University, SCB 10X, OpenAthena, University of Southern California, University of Cambridge
The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
The methodology presented in the paper is robust and systematic, focusing on the design choices that influence the performance of audio foundation models. The authors thoroughly investigate various aspects, including data sources, text mixture ratios, and token composition, which are critical for optimizing model performance. The introduction of the SODA model, which integrates semantic, acoustic, and text tokens, represents a significant advancement in audio modeling. The use of next-token prediction at scale is a novel approach that extends the capabilities of existing models.
The paper includes a comprehensive empirical evaluation, particularly through the IsoFLOP analysis that examines scaling laws for discrete audio models. The authors provide extensive experimentation across 64 models, which is a commendable effort to validate their findings. The results indicate that optimal data grows faster than model size, which is a valuable insight for future research in this area. However, the paper could benefit from more detailed comparisons with existing models beyond the scaling predictions.
While the authors mention establishing a validated training recipe, the paper lacks specific implementation details that would facilitate reproducibility. Providing access to code or detailed hyperparameter settings would enhance the paper's contribution to the community and allow for independent verification of results.
One limitation is the reliance on a specific architecture for the SODA model, which may not generalize well to all audio tasks. Additionally, the paper does not address potential biases in the training data or the implications of using large-scale models in real-world applications. The scaling law findings, while insightful, may also be context-dependent and require further validation across diverse datasets.
The implications of this research are significant, as it opens up new avenues for audio generation and cross-modal tasks, such as speech-to-speech translation. The ability to model semantic content alongside acoustic details can enhance applications in various domains, including entertainment, accessibility, and communication technologies. The findings could influence future research directions and encourage the development of more sophisticated audio models. The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
Primary: Ghent University
All Institutions: Ghent University, Fraunhofer IEE, University of Kassel
The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
The paper introduces Convex Gated Probing (CGP), a novel probing method that leverages a gating mechanism to efficiently utilize all frozen layers of audio SSL models. This approach addresses the limitations of existing probing techniques, which often fail to capture the full potential of audio embeddings. The methodology is well-structured, presenting a clear rationale for the design choices and improvements made to the SSL pipeline, leading to the development of the Better Audio Transformer (BAT). The integration of CGP into the SSL framework is innovative and shows promise in enhancing model evaluation and performance.
The experiments are comprehensive, demonstrating the effectiveness of BAT across various audio benchmarks. The authors provide detailed comparisons against state-of-the-art models, showcasing significant performance improvements in both frozen-feature probing and fine-tuning scenarios. The results are well-documented, with sufficient statistical rigor to support the claims made regarding the superiority of BAT over existing models.
The authors emphasize the importance of reproducibility and provide a new PyTorch implementation to facilitate this. However, the paper mentions challenges in replicating results from existing models, which raises questions about the reliability of previous benchmarks. The authors' efforts to standardize methodologies and hyperparameters contribute positively to the reproducibility aspect, although the lack of a public code repository limits accessibility.
One limitation noted is the reliance on the specific architecture of the Better Audio Transformer, which may not generalize across different audio tasks or datasets. Additionally, while the CGP method shows promise, its effectiveness in more complex audio scenarios or with other model architectures remains to be validated. The paper also acknowledges the challenges of hyperparameter sensitivity in fine-tuning, which could affect the generalizability of results.
The advancements presented in this work have the potential to significantly impact the field of self-supervised audio representation learning. By improving the evaluation methods and model architectures, the research could lead to more efficient and accessible audio models, reducing computational overhead and fostering innovation in audio-related applications. The focus on reproducibility and transparency also aligns with broader efforts to enhance the reliability of machine learning research. The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fréchet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally, we validate our results through a subjective listener test, where subjects gave positive ratings to the proposed model generations. This technique paves the way for more controllable and expressive generative sound frameworks, enabling sound designers to focus less on tedious, repetitive tasks and more on their actual creative process.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
The methodology presented in this paper is robust and innovative, leveraging Diffusion Transformers and a novel Audio Prompt Guidance technique to effectively extend and morph audio. The authors provide a clear description of their approach, including the masking function and the fine-tuning strategy using the Noise Floor Dataset to mitigate hallucinations. However, while the methodology is well-structured, it could benefit from a more detailed exploration of the limitations of the masking function and guidance techniques in varying audio contexts.
The experimental evaluation is comprehensive, employing both objective metrics (Fréchet Audio Distance) and subjective listener tests to validate the effectiveness of the proposed model. The use of a large dataset for training and the careful selection of evaluation clips from sound design professionals enhances the credibility of the results. However, the paper could improve by including more diverse audio samples and comparing against a broader range of existing methods.
The paper provides sufficient detail on the architecture, training process, and evaluation metrics, which aids in reproducibility. However, the absence of specific code or model weights limits the ease with which other researchers can replicate the results. Including a GitHub repository or similar resource would significantly enhance reproducibility.
The paper acknowledges the potential for hallucinations in generated audio, particularly with stationary sounds, and discusses the trade-off between reducing hallucinations and maintaining fidelity to the original prompts. However, it does not thoroughly address how the model performs with non-stationary sounds or in complex soundscapes, which could be a significant limitation for practical applications.
The proposed model has the potential to significantly impact the field of sound design by automating tedious tasks and enhancing the creative process for sound designers. The ability to generate high-quality audio extensions and morphs could streamline workflows in various industries, including film, gaming, and virtual reality. Furthermore, the methodology could inspire future research in generative audio models and their applications in other domains. The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared to FOA systems for steering vector upmixing, source localization, and speech denoising.
Primary: unknown
All Institutions: JSPS KAKENHI, JST FOREST, ANR Project SAROUMANE
The main contribution of this paper is the introduction of SIRUP, a novel diffusion-based approach for enhancing spatial audio representation from FOA to HOA, which addresses critical limitations in existing methods and demonstrates significant improvements in sound source localization and speech denoising. The methodology is innovative, and the experimental results are promising, indicating a strong potential impact on the field of audio processing and machine listening.
The proposed SIRUP method innovatively integrates a variational autoencoder (VAE) with a latent diffusion model to enhance steering vector upmixing from first-order ambisonics (FOA) to higher-order ambisonics (HOA). This approach addresses the limitations of traditional methods by directly learning a latent representation of HOA data, conditioned on FOA inputs, which is a significant departure from the conventional cascaded analysis-rendering pipeline. The use of a composite loss function that combines cosine similarity with MSE is a thoughtful addition that likely contributes to the stability and performance of the model.
The experimental setup is robust, utilizing simulated room impulse responses to evaluate the performance of SIRUP across various conditions, including different signal-to-noise ratios and reverberation times. The metrics chosen for evaluation, such as beamwidth and directivity index, are appropriate for assessing the quality of the upmixed steering vectors. The results indicate that SIRUP significantly outperforms FOA systems, demonstrating its effectiveness in sound source localization and speech denoising.
While the paper provides a detailed description of the methodology, including model architecture and training procedures, it lacks explicit links to code repositories or supplementary materials that would facilitate reproducibility. The absence of a publicly available implementation may hinder other researchers from validating the findings.
One limitation is the reliance on simulated data, which may not fully capture the complexities of real-world acoustic environments. Additionally, the paper does not address the scalability of the method to larger microphone arrays or the potential computational costs associated with training the diffusion model.
The implications of this research are significant for machine listening applications, particularly in augmented reality, robotics, and autonomous systems, where accurate spatial audio representation is crucial. By improving the spatial resolution of sound source localization and enhancing speech denoising, SIRUP could lead to advancements in user experience and system performance in these domains. The main contribution of this paper is the introduction of SIRUP, a novel diffusion-based approach for enhancing spatial audio representation from FOA to HOA, which addresses critical limitations in existing methods and demonstrates significant improvements in sound source localization and speech denoising. The methodology is innovative, and the experimental results are promising, indicating a strong potential impact on the field of audio processing and machine listening.
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park, Adobe Research, OpenAI
The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
The paper introduces the Timestamped Audio Captioner (TAC) and its extension TAC-V, which leverage a synthetic data pipeline to create temporally grounded audio descriptions. The methodology is innovative, utilizing a dynamic acoustic mixer to generate complex audio mixtures with precise temporal annotations, addressing the limitations of traditional audio captioning methods that often rely on sparse annotations. The approach of separating the audio captioning task from reasoning tasks through a cascade with a text-only LLM is particularly noteworthy, allowing for independent scaling and improved performance.
The experiments are comprehensive, comparing TAC against state-of-the-art models on multiple benchmarks, including MMAU-Pro, MMSU, and others. The results demonstrate significant improvements in temporal grounding and reduced hallucination rates, validating the effectiveness of the proposed methods. The ablation studies provide insights into the importance of various components of the model, further strengthening the findings.
The paper provides sufficient detail regarding the implementation, including the use of specific architectures (Qwen2-Audio) and training procedures (LoRA). However, the reliance on synthetic data may introduce challenges in replicating results in real-world scenarios, which could limit reproducibility.
The authors acknowledge limitations related to the synthetic data approach, including potential biases and a sim-to-real gap. Additionally, the model may struggle with fine-grained musical precision, which could affect its applicability in certain contexts.
The work has significant implications for improving the reliability of audio understanding systems, particularly in safety-critical applications and accessibility tools for the hearing impaired. However, the potential for misuse in surveillance contexts raises ethical considerations that must be addressed. The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such variation has strong influence on the embedding vectors at the output of the encoder and their quantization. This methodology is inherently inefficient, leading to codebook redundancy and suboptimal bitrate-distortion performance. To address these limitations, we propose to introduce shape-gain decomposition, widely used in classical speech/audio coding, into the NAC framework. The principle of the proposed Equalizer methodology is to decompose the input signal -- before the NAC encoder -- into gain and normalized shape vector on a short-term basis. The shape vector is processed by the NAC, while the gain is quantized with scalar quantization and transmitted separately. The output (decoded) signal is reconstructed from the normalized output of the NAC and the quantized gain. Our experiments conducted on speech signals show that this general methodology, easily applicable to any NAC, enables a substantial gain in bitrate-distortion performance, as well as a massive reduction in complexity.
Primary: Inria at Univ. Grenoble Alpes
All Institutions: Inria at Univ. Grenoble Alpes, CNRS, LJK, Univ. Grenoble Alpes, Grenoble-INP, GIPSA-lab
The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
The proposed methodology, The Equalizer, introduces a novel shape-gain decomposition approach to neural audio codecs (NACs), which is a significant departure from traditional methods that encode gain and shape jointly. The paper effectively integrates classical signal processing concepts into modern NAC frameworks, demonstrating a clear understanding of both domains. The methodology is well-structured, involving the decomposition of input signals into gain and shape vectors before encoding, and the subsequent reconstruction of the output signal. This approach not only enhances bitrate-distortion performance but also reduces complexity, making it a valuable contribution to the field.
The experiments are robust, utilizing a substantial dataset (LibriSpeech) and comparing the proposed method against several state-of-the-art NACs. The evaluation metrics—STOI, PESQ, and SI-SDR—are appropriate for assessing audio quality and intelligibility. The results clearly demonstrate the advantages of the proposed method over traditional NACs, particularly in terms of robustness to gain variations and overall performance across different bitrates. The paper provides comprehensive experimental results that substantiate the claims made about the effectiveness of The Equalizer.
The paper includes detailed implementation details, including the training setup, evaluation metrics, and specific configurations used for the NACs. However, the lack of a publicly available project URL or demo limits the reproducibility of the results. Future work could benefit from making the code and models available to the community to facilitate further exploration and validation of the proposed methodology.
One limitation of the study is the focus on speech signals, which may not generalize to other audio types. Additionally, while the paper discusses the potential for future work, it does not explore the implications of the normalization on the embedding vectors in detail, which could be crucial for understanding the full impact of the proposed method.
The proposed methodology has significant implications for audio coding and compression, particularly in applications where efficient transmission and storage of audio data are critical, such as in telecommunications and streaming services. By improving the robustness and efficiency of NACs, this work could lead to better audio quality in various consumer and professional audio applications. The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
Primary: unknown
All Institutions: unknown
The paper presents a novel generative-first neural audio autoencoder that significantly improves encoding speed and compression efficiency while maintaining high reconstruction quality. This work is a meaningful contribution to the field of audio processing, addressing key limitations of existing models and opening avenues for practical applications in generative audio tasks.
The paper introduces a generative-first architecture for audio autoencoding, which is a significant departure from the traditional reconstruction-first approach. The methodology is well-structured, with clear architectural modifications aimed at improving efficiency and flexibility. The use of efficient activations, early downsampling, and the incorporation of mel-spectrograms to capture high-frequency information are notable innovations. The post-training adaptation to support both continuous and discrete latents without retraining is particularly clever and enhances the model's applicability.
The experimental setup is robust, with thorough evaluations of speed, quality, and generative utility. The benchmarks against state-of-the-art codecs demonstrate the effectiveness of GenAE in achieving better compression and reconstruction quality. The use of multiple metrics (SI-SDR, STFT loss, mel-spectrogram L1 distance) adds credibility to the results. However, the absence of a clear comparison with a wider range of existing models could limit the perceived impact.
The paper provides detailed implementation specifics, including architecture choices, training configurations, and evaluation metrics, which are essential for reproducibility. However, the lack of accessible code or a demo limits the practical reproducibility of the results.
The paper does not address potential limitations in terms of the generalizability of the model across different audio types beyond instrumental music. Additionally, the computational resources required for training (8 A100 GPUs for a week) may not be accessible to all researchers, which could hinder broader adoption.
The advancements in audio autoencoding presented in this paper have the potential to significantly impact various applications, including music generation, audio compression, and real-time audio processing. The ability to handle multiple audio formats with a single model streamlines workflows and could lead to more efficient use of computational resources in audio-related tasks. The paper presents a novel generative-first neural audio autoencoder that significantly improves encoding speed and compression efficiency while maintaining high reconstruction quality. This work is a meaningful contribution to the field of audio processing, addressing key limitations of existing models and opening avenues for practical applications in generative audio tasks.
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a generative-first neural audio autoencoder that optimizes encoding speed and compression while maintaining high reconstruction quality across various audio formats. This work represents a significant advancement in the field of audio processing, addressing key limitations of existing models and paving the way for more efficient generative audio applications.
The proposed methodology introduces a generative-first architecture that significantly optimizes the encoding process for audio autoencoders. By focusing on architectural modifications such as efficient activations, early downsampling, and the integration of mel-spectrograms, the authors effectively address the limitations of existing reconstruction-first models. The approach to unify continuous and discrete latent representations through a post-training process is particularly innovative, allowing for greater flexibility in generative modeling. However, the paper could benefit from a clearer explanation of the theoretical underpinnings of some architectural choices, particularly the use of SnakeLite activations and their impact on performance.
The experiments are well-structured, comparing the proposed GenAE model against several state-of-the-art codecs. The use of multiple metrics (e.g., SI-SDR, PESQ-WB) to evaluate reconstruction quality and the real-time factor for speed assessment provides a comprehensive view of the model's performance. However, the paper lacks detailed descriptions of the datasets used for training and evaluation, which may affect the reproducibility and generalizability of the results. Additionally, the absence of a direct comparison with other generative models limits the contextual understanding of GenAE's advantages.
The paper provides sufficient details on the architecture and training setup, including hyperparameters and loss functions, which aids in reproducibility. However, the lack of publicly available code or datasets limits the ability for other researchers to replicate the results fully. The authors should consider releasing their model and training data to enhance reproducibility.
One limitation is the reliance on specific audio datasets, which may not fully represent the diversity of audio signals encountered in real-world applications. Additionally, while the model achieves impressive speed and compression rates, the trade-off between these factors and reconstruction quality in extreme cases is not thoroughly explored. The potential for overfitting due to the complexity of the model, especially with the extensive use of attention mechanisms, is another concern.
The advancements presented in this paper could significantly impact various applications in audio processing, including music generation, audio compression for streaming, and real-time audio manipulation. By enabling faster and more efficient audio encoding, the GenAE model could facilitate broader adoption of generative audio technologies in both commercial and research settings. The ability to handle multiple audio formats in a single model also simplifies deployment for developers. The main contribution of this paper is the introduction of a generative-first neural audio autoencoder that optimizes encoding speed and compression while maintaining high reconstruction quality across various audio formats. This work represents a significant advancement in the field of audio processing, addressing key limitations of existing models and paving the way for more efficient generative audio applications.
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.
Primary: Carleton University
All Institutions: Carleton University, Zendesk, Durham University, Salute Devices, MIRAI, Stanford University, Aarhus University, Indian Institute of Technology, Kharagpur, Harvard University, Capital One
The paper introduces the Massive Audio Embedding Benchmark (MAEB), a significant contribution to the field of audio machine learning that provides a comprehensive evaluation framework across diverse tasks and languages. The methodology and experimental results offer valuable insights into model performance, although further statistical analysis and detailed reproducibility guidelines would enhance its impact.
The methodology presented in the paper is robust, introducing a comprehensive benchmark (MAEB) that spans multiple audio tasks and languages. The authors provide a clear rationale for the selection of tasks and models, and the integration into the MTEB ecosystem is a significant step towards unified evaluation across modalities. However, the paper could benefit from a more detailed description of the benchmarking process and the specific metrics used for evaluation.
The experiments are extensive, evaluating over 50 models across 30 tasks. The results highlight the performance discrepancies between models trained for different audio tasks, which is a valuable insight for future research. However, the paper lacks a thorough statistical analysis of the results, which would strengthen the claims made regarding model performance.
The authors have committed to releasing code and a leaderboard, which is commendable and supports reproducibility. However, the paper should include more detailed instructions on how to replicate the experiments, including specific configurations and hyperparameters used for each model.
One limitation noted is the performance of models on clustering tasks, where even the best-performing model achieves only modest results. Additionally, the paper acknowledges the trade-offs between acoustic understanding and linguistic tasks, which may limit the applicability of certain models across all tasks.
The MAEB benchmark has the potential to significantly impact the field of audio machine learning by providing a standardized evaluation framework. This could lead to improved model development and encourage further research into multilingual and cross-modal audio tasks. The release of the benchmark also promotes collaboration and transparency in the research community. The paper introduces the Massive Audio Embedding Benchmark (MAEB), a significant contribution to the field of audio machine learning that provides a comprehensive evaluation framework across diverse tasks and languages. The methodology and experimental results offer valuable insights into model performance, although further statistical analysis and detailed reproducibility guidelines would enhance its impact.
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
The paper presents a novel agent-based framework, Resp-Agent, which integrates multimodal data (audio and EHR) for respiratory sound generation and diagnosis. The methodology is innovative, utilizing an Active Adversarial Curriculum Agent (Thinker-A$^2$CA) to dynamically identify weaknesses in diagnostics and schedule targeted synthesis. The Modality-Weaving Diagnoser and Flow Matching Generator are well-conceived to address the representation and data gaps in respiratory sound analysis. The use of large language models (LLMs) for generating clinical narratives and the careful design of the dataset (Resp-229k) enhance the robustness of the approach. However, while the methodology is sound, it heavily relies on the quality of the underlying data and the effectiveness of the LLMs used for synthesis.
The experiments are comprehensive, evaluating the proposed system against multiple benchmarks and comparing it with existing methods. The results demonstrate significant improvements in diagnostic accuracy and robustness, particularly in handling class imbalance and data scarcity. The use of a strict cross-domain evaluation protocol adds rigor to the assessment of generalization capabilities. The paper also includes detailed ablation studies that validate the contributions of various components of the system, further strengthening the findings.
The authors have made significant efforts to ensure reproducibility by providing access to the code and dataset. The detailed descriptions of the architecture, training procedures, and evaluation metrics contribute to the transparency of the research. However, the reliance on LLMs and the complexity of the system may pose challenges for complete replication without adequate computational resources.
One limitation is the dependency on the quality and diversity of the Resp-229k dataset, which may affect the generalizability of the findings. Additionally, while the paper addresses class imbalance, the performance on extremely rare conditions may still be limited. The complexity of the system could also hinder its practical deployment in clinical settings without further validation.
The proposed framework has the potential to significantly advance the field of respiratory sound analysis and diagnosis, offering a robust tool for clinicians to improve diagnostic accuracy and support medical education. The integration of generative modeling with diagnostic capabilities could lead to more effective training datasets and enhance the understanding of respiratory diseases. However, ethical considerations regarding the use of AI in clinical decision-making must be addressed. The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.
Primary: unknown
All Institutions: unknown
The paper presents LongAudio-RAG, a novel framework for event-grounded question answering over multi-hour audio, significantly advancing the capabilities of audio processing systems. The detailed methodology and experimental validation underscore its potential impact in the field of machine learning, particularly in audio-language integration and real-time analytics.
The methodology presented in the paper is robust and well-structured, introducing a hybrid framework that effectively combines audio grounding with large language models (LLMs) for long audio question answering. The use of SQL databases for structured event records and the detailed approach to temporal reference resolution and intent classification are commendable. The paper clearly outlines the steps taken to convert long audio into actionable data, which is a significant advancement in the field of audio processing and natural language understanding.
The experimental evaluation is thorough, utilizing a synthetic long-audio benchmark that allows for controlled testing of the proposed system against various baselines, including RAG and text-to-SQL approaches. The results demonstrate a clear improvement in accuracy and response quality, validating the effectiveness of the proposed method. The use of both automated and human evaluations adds credibility to the findings.
The paper provides a detailed description of the implementation stack and methodologies used, which enhances reproducibility. However, the lack of a public repository or demo URL limits the ability for others to replicate the work fully. The modular service-oriented architecture described could facilitate reproducibility if made available.
The paper acknowledges limitations related to the accuracy of the Audio Grounding Model (AGM), which may affect downstream reasoning. Additionally, the synthetic nature of the benchmark may not fully capture the complexities of real-world audio environments, potentially limiting the generalizability of the results.
The proposed system has significant potential applications in various domains, including industrial monitoring, smart home technologies, and security systems. By enabling precise question answering over long audio recordings, it could enhance user interaction with audio data and improve operational efficiencies in many sectors. The paper presents LongAudio-RAG, a novel framework for event-grounded question answering over multi-hour audio, significantly advancing the capabilities of audio processing systems. The detailed methodology and experimental validation underscore its potential impact in the field of machine learning, particularly in audio-language integration and real-time analytics.
Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.
Primary: unknown
All Institutions: unknown
The paper presents S-PRESSO, a diffusion autoencoder for ultra-low bitrate audio compression, achieving significant improvements in audio quality while maintaining high compression rates. This work highlights the potential of generative models to redefine audio compression standards, pushing the boundaries of what is achievable in the field.
The paper introduces S-PRESSO, a novel approach to audio compression utilizing a diffusion autoencoder framework. The methodology is well-structured, comprising a three-step training process that includes continuous diffusion autoencoder training, offline quantization, and diffusion decoder finetuning. This approach effectively leverages the generative capabilities of diffusion models to enhance audio quality at ultra-low bitrates. The use of pretrained models for both the latent encoder and the diffusion decoder is a strong point, as it allows for the incorporation of learned representations that can significantly improve the compression process. However, the paper could benefit from a more detailed explanation of the quantization process and its impact on the overall performance.
The experimental setup is robust, utilizing a diverse set of datasets that cover various audio types, which enhances the generalizability of the results. The authors provide a thorough comparison against both continuous and discrete baseline models, demonstrating significant improvements in audio quality metrics such as FAD, KAD, and Si-SDR. The subjective evaluation through MUSHRA tests adds credibility to the findings, although the paper does not discuss the statistical significance of the results in detail. Overall, the experiments convincingly support the claims made about the performance of S-PRESSO.
The paper includes sufficient implementation details, including training parameters and architecture specifications, which aids in reproducibility. However, the absence of publicly available code or models limits the ability of other researchers to replicate the results fully. The authors mention the use of specific datasets but do not provide access to these datasets, which could hinder reproducibility for others in the field.
One notable limitation is the focus on sound effects, which may restrict the applicability of the proposed method to other audio domains such as music or speech. Additionally, while the results are promising, the trade-off between compression rate and audio fidelity could be further explored, particularly at the lowest bitrates. The paper also acknowledges the need for improvements in inference speed, which is crucial for practical applications.
The advancements in ultra-low bitrate audio compression have significant implications for various applications, including gaming, virtual reality, and streaming services, where bandwidth is a critical concern. By shifting the focus from strict fidelity to acoustic similarity, this work opens new avenues for audio representation and synthesis, potentially enhancing user experiences in interactive media. The findings could also inspire further research into generative models for audio processing. The paper presents S-PRESSO, a diffusion autoencoder for ultra-low bitrate audio compression, achieving significant improvements in audio quality while maintaining high compression rates. This work highlights the potential of generative models to redefine audio compression standards, pushing the boundaries of what is achievable in the field.
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through both training-based and training-free approaches. We introduce MUKA, a multi-kernel adaptation framework that combines the fine-grained, context-dependent representations of instruction-tuning based models like Pengi with the global semantic representations of contrastive pretraining methods like CLAP. By constructing a product kernel that aligns local similarity with global semantics, MUKA enhances representational power while preserving the theoretical guarantees of kernel methods and avoiding additional training. Extensive experiments across 11 diverse audio datasets demonstrate that MUKA achieves state-of-the-art performance among training-free methods and even surpasses training-based adapters in several scenarios, offering a compelling balance between adaptability and efficiency.
Primary: IMT Atlantique
All Institutions: IMT Atlantique, Polytechnique Montréal, Inria, University Rennes, IRISA, CNRS, Université de Montpellier
The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
The methodology proposed in MUKA is innovative as it introduces a multi-kernel product approach that effectively combines the strengths of different audio-language models, specifically Pengi and CLAP. This combination allows for a more nuanced representation of audio data, capturing both fine-grained details and broader semantic contexts. The theoretical grounding in kernel methods adds robustness to the approach, and the avoidance of additional training enhances its practicality in few-shot scenarios. However, the paper could benefit from a more detailed explanation of the kernel design choices and how they were empirically validated.
The experiments are extensive, covering 11 diverse audio datasets, which demonstrates the versatility of the proposed method. The results indicate that MUKA achieves state-of-the-art performance among training-free methods and competes well with training-based methods. The use of cross-validation and clear reporting of accuracy metrics strengthens the experimental rigor. However, the paper lacks a discussion on the statistical significance of the results, which would provide a clearer picture of the performance improvements.
The paper outlines the experimental setup and methodology sufficiently to allow for reproducibility. It mentions the use of specific datasets and the pre-trained models employed, along with the computational resources used for experiments. However, the absence of a public code repository or demo limits the ease of reproducibility for other researchers.
One limitation is the reliance on existing models (Pengi and CLAP) without exploring the potential for developing new models tailored specifically for audio-language tasks. Additionally, while the paper claims efficiency, it does not provide a detailed computational complexity analysis of MUKA compared to other methods. The scope of datasets, while diverse, may not cover all potential audio-language applications, which could limit the generalizability of the findings.
The implications of this work are significant for the field of audio processing and multimodal learning. By improving few-shot adaptation in audio-language models, MUKA could facilitate advancements in applications such as audio classification, emotion recognition, and sound event detection. The proposed methodology could also inspire further research into kernel methods and their applications in other domains, potentially leading to more efficient machine learning models. The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Alibaba Group, Carnegie Mellon University, Microsoft Corporation, Queen Mary University of London, Shanghai Jiao Tong University
The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
The paper presents a well-structured methodology for evaluating audio reasoning models through the introduction of the MMAR-Rubrics, which emphasizes the quality of reasoning chains rather than just final answers. This is a significant shift in evaluation paradigms, addressing the limitations of existing benchmarks that focus primarily on accuracy. The dual-track design allows for a comprehensive exploration of both end-to-end models and agent-based systems, providing insights into different architectural approaches. The use of instance-level evaluation criteria enhances the reliability and stability of the assessment process.
The experimental setup is robust, with a large number of participants (156 teams from 18 countries) demonstrating significant interest and engagement in the challenge. The results indicate a clear performance differentiation between agent systems and single models, with detailed analyses of top-performing systems providing valuable insights into effective strategies. The use of rigorous evaluation metrics, including reliability and human alignment studies, strengthens the credibility of the findings.
The paper provides sufficient details regarding the evaluation protocols and the challenge design, including the release of the MMAR-Rubrics benchmark data and evaluation scripts. However, the reproducibility of the models themselves may be limited due to the proprietary nature of some systems and the lack of detailed descriptions of their architectures and training processes.
One limitation is the potential variability in the quality of the reasoning paths generated by different models, which may not be fully captured by the evaluation metrics. Additionally, the reliance on LLMs for scoring may introduce biases or inconsistencies, although the authors have taken steps to mitigate this through their instance-level rubric approach. The challenge also does not address the scalability of the proposed evaluation methods to more complex real-world scenarios.
The findings from this research have significant implications for the development of explainable AI in audio processing, particularly in applications requiring robust reasoning capabilities, such as automated transcription services, audio analysis for accessibility, and interactive audio agents. By focusing on the reasoning process, this work contributes to enhancing the transparency and trustworthiness of AI systems in critical domains. The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.
Primary: unknown
All Institutions: unknown
The paper presents Bengali-Loop, a significant contribution to the field of speech technology for the Bengali language, providing essential benchmarks for long-form ASR and speaker diarization. The methodology is sound, and the technical contributions are likely to foster further advancements in this under-resourced area, although some limitations and areas for improvement remain.
The methodology presented in the paper is robust, focusing on the collection and verification of long-form ASR and speaker diarization datasets. The use of a human-in-the-loop approach for transcript verification enhances the quality of the data, addressing common pitfalls in automated transcription. The standardized evaluation protocols and formats provided are essential for reproducibility and future research. However, the paper could benefit from a more detailed discussion on the specific challenges encountered during data collection and annotation, as well as the rationale behind the chosen methodologies.
The experimental evaluation is thorough, with clear baselines established for both ASR and diarization tasks. The reported results, including WER and DER, provide a solid foundation for assessing the performance of the proposed benchmarks. However, the paper lacks a comparative analysis with existing benchmarks in other languages, which could further contextualize the results and demonstrate the significance of the contributions made.
The authors emphasize reproducibility by providing detailed descriptions of the data collection process, annotation guidelines, and evaluation protocols. They also plan to release scripts for standardizing audio and running baseline evaluations, which is commendable. However, the lack of a publicly available code repository limits the ease with which other researchers can reproduce the results.
The paper acknowledges several limitations, including the limited dialectal diversity of the datasets and the simplification of the diarization overlap policy. Additionally, the focus on specific types of media (e.g., Bangla drama) may not fully represent the diversity of spoken Bengali in other contexts. These limitations should be addressed in future work to enhance the applicability of the benchmarks.
The development of Bengali-Loop has significant implications for the advancement of speech technology in under-resourced languages. By providing high-quality datasets and standardized evaluation protocols, this work can facilitate further research and development in Bangla ASR and speaker diarization. The benchmarks can also serve as a foundation for community-driven efforts to improve speech technology for other low-resource languages, potentially leading to broader accessibility and inclusion in technology. The paper presents Bengali-Loop, a significant contribution to the field of speech technology for the Bengali language, providing essential benchmarks for long-form ASR and speaker diarization. The methodology is sound, and the technical contributions are likely to foster further advancements in this under-resourced area, although some limitations and areas for improvement remain.
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning, matching or surpassing multiple 7B to 30B audio and omni-modal baselines. The model adopts a unified end-to-end architecture composed of a lightweight language backbone, a Whisper-based audio encoder, and a sparsely activated Mixture-of-Experts (MoE) adapter that explicitly accounts for audio heterogeneity and alleviates cross-modal optimization conflicts under limited capacity. To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio. Extensive evaluations across ASR, knowledge reasoning, safety, instruction following, and paralinguistic benchmarks, demonstrate that Eureka-Audio achieves an efficient balance between computational cost and performance. These results establish Eureka Audio as a strong and practical baseline for lightweight audio understanding models.
Primary: Inner Mongolia University
All Institutions: Baidu Inc., College of Computer Science, Inner Mongolia University, Tsinghua Shenzhen International Graduate School, Tsinghua University
The main contribution of this paper is the introduction of Eureka-Audio, a compact audio language model that achieves competitive performance against much larger models while employing innovative techniques for audio understanding and data synthesis. This work represents a meaningful advancement in the field of audio processing, particularly in developing efficient models that maintain high performance.
The methodology presented in the paper is robust, featuring a unified end-to-end architecture that integrates a lightweight language backbone with a Whisper-based audio encoder and a Mixture-of-Experts (MoE) adapter. This approach effectively addresses audio heterogeneity and cross-modal optimization conflicts, which are common challenges in audio processing tasks. The introduction of the DataFlux pipeline for synthesizing and verifying audio instruction data is particularly innovative, as it enhances the model's ability to reason about paralinguistic features. The model's architecture is well-justified, and the combination of techniques appears to be a significant advancement in the field of audio language models.
The experimental evaluation is comprehensive, covering a wide range of benchmarks including ASR, audio understanding, and dense audio captioning. The results demonstrate that Eureka-Audio outperforms or matches larger models, which is a significant achievement given its compact size of 1.7B parameters. The paper provides detailed comparisons with various baselines, and the metrics used for evaluation are appropriate and well-explained. However, the lack of real-world application scenarios in the experiments could limit the practical understanding of the model's performance.
The paper includes a project URL that suggests the availability of code and models, which is crucial for reproducibility. However, the paper does not provide extensive details on the training procedures, hyperparameters, or datasets used, which could hinder full reproducibility by other researchers. More transparency in these areas would enhance the paper's contribution to the community.
One limitation of the study is the potential overfitting to the benchmarks used for evaluation, as the model's performance is primarily reported on standard datasets. Additionally, the reliance on a closed-loop data synthesis approach may introduce biases or limitations in the quality of the generated data. The paper could also explore the model's performance in diverse real-world scenarios beyond the controlled benchmarks.
Eureka-Audio has the potential to significantly impact various applications in audio understanding, including accessibility technologies, voice-activated systems, and interactive AI agents. Its compact size makes it suitable for deployment in resource-constrained environments, which could broaden the accessibility of advanced audio processing capabilities. The advancements in paralinguistic reasoning could also lead to more nuanced interactions in human-computer communication. The main contribution of this paper is the introduction of Eureka-Audio, a compact audio language model that achieves competitive performance against much larger models while employing innovative techniques for audio understanding and data synthesis. This work represents a meaningful advancement in the field of audio processing, particularly in developing efficient models that maintain high performance.