Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
The paper presents a novel agent-based framework, Resp-Agent, which integrates multimodal data (audio and EHR) for respiratory sound generation and diagnosis. The methodology is innovative, utilizing an Active Adversarial Curriculum Agent (Thinker-A$^2$CA) to dynamically identify weaknesses in diagnostics and schedule targeted synthesis. The Modality-Weaving Diagnoser and Flow Matching Generator are well-conceived to address the representation and data gaps in respiratory sound analysis. The use of large language models (LLMs) for generating clinical narratives and the careful design of the dataset (Resp-229k) enhance the robustness of the approach. However, while the methodology is sound, it heavily relies on the quality of the underlying data and the effectiveness of the LLMs used for synthesis.
The experiments are comprehensive, evaluating the proposed system against multiple benchmarks and comparing it with existing methods. The results demonstrate significant improvements in diagnostic accuracy and robustness, particularly in handling class imbalance and data scarcity. The use of a strict cross-domain evaluation protocol adds rigor to the assessment of generalization capabilities. The paper also includes detailed ablation studies that validate the contributions of various components of the system, further strengthening the findings.
The authors have made significant efforts to ensure reproducibility by providing access to the code and dataset. The detailed descriptions of the architecture, training procedures, and evaluation metrics contribute to the transparency of the research. However, the reliance on LLMs and the complexity of the system may pose challenges for complete replication without adequate computational resources.
One limitation is the dependency on the quality and diversity of the Resp-229k dataset, which may affect the generalizability of the findings. Additionally, while the paper addresses class imbalance, the performance on extremely rare conditions may still be limited. The complexity of the system could also hinder its practical deployment in clinical settings without further validation.
The proposed framework has the potential to significantly advance the field of respiratory sound analysis and diagnosis, offering a robust tool for clinicians to improve diagnostic accuracy and support medical education. The integration of generative modeling with diagnostic capabilities could lead to more effective training datasets and enhance the understanding of respiratory diseases. However, ethical considerations regarding the use of AI in clinical decision-making must be addressed. The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through both training-based and training-free approaches. We introduce MUKA, a multi-kernel adaptation framework that combines the fine-grained, context-dependent representations of instruction-tuning based models like Pengi with the global semantic representations of contrastive pretraining methods like CLAP. By constructing a product kernel that aligns local similarity with global semantics, MUKA enhances representational power while preserving the theoretical guarantees of kernel methods and avoiding additional training. Extensive experiments across 11 diverse audio datasets demonstrate that MUKA achieves state-of-the-art performance among training-free methods and even surpasses training-based adapters in several scenarios, offering a compelling balance between adaptability and efficiency.
Primary: IMT Atlantique
All Institutions: IMT Atlantique, Polytechnique Montréal, Inria, University Rennes, IRISA, CNRS, Université de Montpellier
The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
The methodology proposed in MUKA is innovative as it introduces a multi-kernel product approach that effectively combines the strengths of different audio-language models, specifically Pengi and CLAP. This combination allows for a more nuanced representation of audio data, capturing both fine-grained details and broader semantic contexts. The theoretical grounding in kernel methods adds robustness to the approach, and the avoidance of additional training enhances its practicality in few-shot scenarios. However, the paper could benefit from a more detailed explanation of the kernel design choices and how they were empirically validated.
The experiments are extensive, covering 11 diverse audio datasets, which demonstrates the versatility of the proposed method. The results indicate that MUKA achieves state-of-the-art performance among training-free methods and competes well with training-based methods. The use of cross-validation and clear reporting of accuracy metrics strengthens the experimental rigor. However, the paper lacks a discussion on the statistical significance of the results, which would provide a clearer picture of the performance improvements.
The paper outlines the experimental setup and methodology sufficiently to allow for reproducibility. It mentions the use of specific datasets and the pre-trained models employed, along with the computational resources used for experiments. However, the absence of a public code repository or demo limits the ease of reproducibility for other researchers.
One limitation is the reliance on existing models (Pengi and CLAP) without exploring the potential for developing new models tailored specifically for audio-language tasks. Additionally, while the paper claims efficiency, it does not provide a detailed computational complexity analysis of MUKA compared to other methods. The scope of datasets, while diverse, may not cover all potential audio-language applications, which could limit the generalizability of the findings.
The implications of this work are significant for the field of audio processing and multimodal learning. By improving few-shot adaptation in audio-language models, MUKA could facilitate advancements in applications such as audio classification, emotion recognition, and sound event detection. The proposed methodology could also inspire further research into kernel methods and their applications in other domains, potentially leading to more efficient machine learning models. The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Alibaba Group, Carnegie Mellon University, Microsoft Corporation, Queen Mary University of London, Shanghai Jiao Tong University
The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
The paper presents a well-structured methodology for evaluating audio reasoning models through the introduction of the MMAR-Rubrics, which emphasizes the quality of reasoning chains rather than just final answers. This is a significant shift in evaluation paradigms, addressing the limitations of existing benchmarks that focus primarily on accuracy. The dual-track design allows for a comprehensive exploration of both end-to-end models and agent-based systems, providing insights into different architectural approaches. The use of instance-level evaluation criteria enhances the reliability and stability of the assessment process.
The experimental setup is robust, with a large number of participants (156 teams from 18 countries) demonstrating significant interest and engagement in the challenge. The results indicate a clear performance differentiation between agent systems and single models, with detailed analyses of top-performing systems providing valuable insights into effective strategies. The use of rigorous evaluation metrics, including reliability and human alignment studies, strengthens the credibility of the findings.
The paper provides sufficient details regarding the evaluation protocols and the challenge design, including the release of the MMAR-Rubrics benchmark data and evaluation scripts. However, the reproducibility of the models themselves may be limited due to the proprietary nature of some systems and the lack of detailed descriptions of their architectures and training processes.
One limitation is the potential variability in the quality of the reasoning paths generated by different models, which may not be fully captured by the evaluation metrics. Additionally, the reliance on LLMs for scoring may introduce biases or inconsistencies, although the authors have taken steps to mitigate this through their instance-level rubric approach. The challenge also does not address the scalability of the proposed evaluation methods to more complex real-world scenarios.
The findings from this research have significant implications for the development of explainable AI in audio processing, particularly in applications requiring robust reasoning capabilities, such as automated transcription services, audio analysis for accessibility, and interactive audio agents. By focusing on the reasoning process, this work contributes to enhancing the transparency and trustworthiness of AI systems in critical domains. The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Adobe Research, OpenAI
This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
The paper presents a novel framework, AudioChat, which integrates audio generation, editing, and understanding through a unified model. The methodology leverages a tool-calling agent, AudioCopilot, to synthesize training data through simulated user interactions, which is innovative in addressing the data scarcity issue in complex audio scene processing. The introduction of the Audio Transfusion Forcing objective is a significant advancement, allowing the model to perform structured reasoning and multi-turn interactions effectively. The architecture employs a continuous audio tokenizer and a multi-modal language model, which are well-justified and contribute to the model's performance.
The experiments are comprehensive, evaluating AudioChat against various baselines across multiple tasks including storytelling, editing, and understanding. The use of novel evaluation metrics like multiFLAM and editFLAM provides a more nuanced assessment of the model's capabilities compared to traditional metrics. The results indicate that AudioChat outperforms existing models, demonstrating its effectiveness in handling complex audio tasks. However, the paper could benefit from more detailed comparisons with a broader range of existing methods.
The authors provide ample details regarding the training data, hyperparameters, and methodology, which supports reproducibility. However, the proprietary nature of some training data may limit full replication of the results. The paper does a commendable job of outlining the architecture and training process, allowing for potential implementation by other researchers.
One limitation is the reliance on synthetic data generated by AudioCopilot, which may not capture the full diversity of real-world audio scenarios. Additionally, while the model shows promise, its performance in edge cases or highly nuanced audio tasks remains to be thoroughly evaluated. The potential ethical implications of audio generation technologies, such as misuse for impersonation, are acknowledged but not deeply explored.
The development of AudioChat has significant implications for various applications in multimedia, including film, gaming, and virtual reality, where immersive audio storytelling is crucial. The ability to generate and edit complex audio scenes could enhance user experiences in these domains. However, the potential for misuse in creating deceptive audio content raises ethical concerns that need to be addressed by the research community. This paper introduces AudioChat, a pioneering framework for multi-source audio storytelling, editing, and understanding, which utilizes innovative methodologies to advance the field of audio processing in machine learning. The comprehensive evaluation of its technical contributions, methodology, and implications for future research underscores its significance in the domain.
We propose CC-G2PnP, a streaming grapheme-to-phoneme and prosody (G2PnP) model to connect large language model and text-to-speech in a streaming manner. CC-G2PnP is based on Conformer-CTC architecture. Specifically, the input grapheme tokens are processed chunk by chunk, which enables streaming inference of phonemic and prosodic (PnP) labels. By guaranteeing minimal look-ahead size to each input token, the proposed model can consider future context in each token, which leads to stable PnP label prediction. Unlike previous streaming methods that depend on explicit word boundaries, the CTC decoder in CC-G2PnP effectively learns the alignment between graphemes and phonemes during training, making it applicable to unsegmented languages. Experiments on a Japanese dataset, which has no explicit word boundaries, show that CC-G2PnP significantly outperforms the baseline streaming G2PnP model in the accuracy of PnP label prediction.
Primary: LY Corporation
All Institutions: LY Corporation
The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
The proposed CC-G2PnP model employs a Conformer-CTC architecture that innovatively processes grapheme tokens in chunks, allowing for streaming inference of phonemic and prosodic labels. The introduction of minimum look-ahead (MLA) is a significant methodological advancement, as it addresses the limitations of previous streaming models that rely on explicit word boundaries. This approach is particularly beneficial for unsegmented languages like Japanese, where word boundaries are not clearly defined. The integration of self-conditioned CTC into the architecture further enhances the model's performance by allowing dynamic learning of alignments between graphemes and phonemes.
The experiments conducted on a Japanese dataset demonstrate the effectiveness of CC-G2PnP, showing significant improvements in character error rate (CER) and sentence error rate (SER) compared to baseline models. The use of both objective metrics and subjective assessments of TTS naturalness provides a comprehensive evaluation of the model's performance. The dataset preparation and experimental conditions are well-documented, allowing for a clear understanding of the model's capabilities and limitations.
While the paper provides detailed descriptions of the model architecture and training procedures, the lack of a publicly available code repository or demo URL limits reproducibility. The absence of specific hyperparameters and training configurations in a readily accessible format could hinder other researchers from replicating the results.
One limitation noted is the reliance on a large amount of training data to achieve optimal performance, which may not be feasible for all applications. Additionally, while the model performs well in terms of accuracy, the subjective evaluation of TTS naturalness could vary based on the speaker used during testing, which may not generalize across different voices.
The CC-G2PnP model has the potential to significantly enhance text-to-speech systems, particularly for languages without explicit word boundaries. This could lead to more natural and efficient human-machine interactions in various applications, including virtual assistants, language learning tools, and accessibility technologies for the visually impaired. The advancements in streaming G2PnP could also inspire further research in related areas, such as real-time speech synthesis and multilingual processing. The paper presents CC-G2PnP, a novel streaming model for grapheme-to-phoneme and prosody conversion that addresses the challenges of unsegmented languages. Its innovative methodology and robust experimental results position it as a significant contribution to the field of audio processing and speech synthesis.
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.
Primary: Stanford University
All Institutions: Stanford University, SCB 10X, OpenAthena, University of Southern California, University of Cambridge
The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
The methodology presented in the paper is robust and systematic, focusing on the design choices that influence the performance of audio foundation models. The authors thoroughly investigate various aspects, including data sources, text mixture ratios, and token composition, which are critical for optimizing model performance. The introduction of the SODA model, which integrates semantic, acoustic, and text tokens, represents a significant advancement in audio modeling. The use of next-token prediction at scale is a novel approach that extends the capabilities of existing models.
The paper includes a comprehensive empirical evaluation, particularly through the IsoFLOP analysis that examines scaling laws for discrete audio models. The authors provide extensive experimentation across 64 models, which is a commendable effort to validate their findings. The results indicate that optimal data grows faster than model size, which is a valuable insight for future research in this area. However, the paper could benefit from more detailed comparisons with existing models beyond the scaling predictions.
While the authors mention establishing a validated training recipe, the paper lacks specific implementation details that would facilitate reproducibility. Providing access to code or detailed hyperparameter settings would enhance the paper's contribution to the community and allow for independent verification of results.
One limitation is the reliance on a specific architecture for the SODA model, which may not generalize well to all audio tasks. Additionally, the paper does not address potential biases in the training data or the implications of using large-scale models in real-world applications. The scaling law findings, while insightful, may also be context-dependent and require further validation across diverse datasets.
The implications of this research are significant, as it opens up new avenues for audio generation and cross-modal tasks, such as speech-to-speech translation. The ability to model semantic content alongside acoustic details can enhance applications in various domains, including entertainment, accessibility, and communication technologies. The findings could influence future research directions and encourage the development of more sophisticated audio models. The main contribution of this paper is the introduction of SODA, a scalable audio foundation model that effectively integrates semantic, acoustic, and text tokens, providing a comprehensive framework for advancing audio modeling. This work significantly enhances the understanding of scaling laws in audio models and sets a foundation for future innovations in the field.
Probing is widely adopted in computer vision to faithfully evaluate self-supervised learning (SSL) embeddings, as fine-tuning may misrepresent their inherent quality. In contrast, audio SSL models still rely on fine-tuning because simple probing fails to unlock their full potential and alters their rankings when competing for SOTA on AudioSet. Hence, a robust and efficient probing mechanism is required to guide the trajectory of audio SSL towards reliable and reproducible methods. We introduce Convex Gated Probing (CGP), a prototype-based method that drastically closes the gap between fine-tuning and probing in audio. CGP efficiently utilizes all frozen layers via a gating mechanism and exposes the location of latent task-relevant information. Guided by CGP, we rework the entire SSL pipeline of current SOTA audio models that use legacy implementations of prior SSL methods. By refining data preprocessing, model architecture, and pre-training recipe, we introduce Better Audio Transformer (BAT), and establish new SOTA on audio benchmarks.
Primary: Ghent University
All Institutions: Ghent University, Fraunhofer IEE, University of Kassel
The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
The paper introduces Convex Gated Probing (CGP), a novel probing method that leverages a gating mechanism to efficiently utilize all frozen layers of audio SSL models. This approach addresses the limitations of existing probing techniques, which often fail to capture the full potential of audio embeddings. The methodology is well-structured, presenting a clear rationale for the design choices and improvements made to the SSL pipeline, leading to the development of the Better Audio Transformer (BAT). The integration of CGP into the SSL framework is innovative and shows promise in enhancing model evaluation and performance.
The experiments are comprehensive, demonstrating the effectiveness of BAT across various audio benchmarks. The authors provide detailed comparisons against state-of-the-art models, showcasing significant performance improvements in both frozen-feature probing and fine-tuning scenarios. The results are well-documented, with sufficient statistical rigor to support the claims made regarding the superiority of BAT over existing models.
The authors emphasize the importance of reproducibility and provide a new PyTorch implementation to facilitate this. However, the paper mentions challenges in replicating results from existing models, which raises questions about the reliability of previous benchmarks. The authors' efforts to standardize methodologies and hyperparameters contribute positively to the reproducibility aspect, although the lack of a public code repository limits accessibility.
One limitation noted is the reliance on the specific architecture of the Better Audio Transformer, which may not generalize across different audio tasks or datasets. Additionally, while the CGP method shows promise, its effectiveness in more complex audio scenarios or with other model architectures remains to be validated. The paper also acknowledges the challenges of hyperparameter sensitivity in fine-tuning, which could affect the generalizability of results.
The advancements presented in this work have the potential to significantly impact the field of self-supervised audio representation learning. By improving the evaluation methods and model architectures, the research could lead to more efficient and accessible audio models, reducing computational overhead and fostering innovation in audio-related applications. The focus on reproducibility and transparency also aligns with broader efforts to enhance the reliability of machine learning research. The paper presents a significant advancement in audio self-supervised learning through the introduction of Convex Gated Probing and the Better Audio Transformer, addressing critical gaps in evaluation methodologies and model performance. The comprehensive experimental validation and emphasis on reproducibility enhance its contributions to the field.
In audio-related creative tasks, sound designers often seek to extend and morph different sounds from their libraries. Generative audio models, capable of creating audio using examples as references, offer promising solutions. By masking the noisy latents of a DiT and applying a novel variant of classifier-free guidance on such masked latents, we demonstrate that: (i) given an audio reference, we can extend it both forward and backward for a specified duration, and (ii) given two audio references, we can morph them seamlessly for the desired duration. Furthermore, we show that by fine-tuning the model on different types of stationary audio data we mitigate potential hallucinations. The effectiveness of our method is supported by objective metrics, with the generated audio achieving Fréchet Audio Distances (FADs) comparable to those of real samples from the training data. Additionally, we validate our results through a subjective listener test, where subjects gave positive ratings to the proposed model generations. This technique paves the way for more controllable and expressive generative sound frameworks, enabling sound designers to focus less on tedious, repetitive tasks and more on their actual creative process.
Primary: unknown
All Institutions: unknown
The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
The methodology presented in this paper is robust and innovative, leveraging Diffusion Transformers and a novel Audio Prompt Guidance technique to effectively extend and morph audio. The authors provide a clear description of their approach, including the masking function and the fine-tuning strategy using the Noise Floor Dataset to mitigate hallucinations. However, while the methodology is well-structured, it could benefit from a more detailed exploration of the limitations of the masking function and guidance techniques in varying audio contexts.
The experimental evaluation is comprehensive, employing both objective metrics (Fréchet Audio Distance) and subjective listener tests to validate the effectiveness of the proposed model. The use of a large dataset for training and the careful selection of evaluation clips from sound design professionals enhances the credibility of the results. However, the paper could improve by including more diverse audio samples and comparing against a broader range of existing methods.
The paper provides sufficient detail on the architecture, training process, and evaluation metrics, which aids in reproducibility. However, the absence of specific code or model weights limits the ease with which other researchers can replicate the results. Including a GitHub repository or similar resource would significantly enhance reproducibility.
The paper acknowledges the potential for hallucinations in generated audio, particularly with stationary sounds, and discusses the trade-off between reducing hallucinations and maintaining fidelity to the original prompts. However, it does not thoroughly address how the model performs with non-stationary sounds or in complex soundscapes, which could be a significant limitation for practical applications.
The proposed model has the potential to significantly impact the field of sound design by automating tedious tasks and enhancing the creative process for sound designers. The ability to generate high-quality audio extensions and morphs could streamline workflows in various industries, including film, gaming, and virtual reality. Furthermore, the methodology could inspire future research in generative audio models and their applications in other domains. The paper presents a novel approach for generating high-quality audio extensions and morphs using Diffusion Transformers and a variant of classifier-free guidance. The technical contributions are significant, addressing real-world challenges faced by sound designers and demonstrating promising results through rigorous evaluation.
Large Audio Language Models struggle to disentangle overlapping events in complex acoustic scenes, yielding temporally inconsistent captions and frequent hallucinations. We introduce Timestamped Audio Captioner (TAC), a model that produces temporally grounded audio descriptions at varying degrees of detail and resolution. TAC is trained with a synthetic data pipeline that constructs challenging and dynamic mixtures from real-world audio sources, enabling robust learning under realistic polyphonic conditions. Across event detection and dense captioning, TAC outperforms all competing methods, with a low hallucination rate and accurate temporal grounding. We also introduce TAC-V, an audio-visual pipeline to generate semantically rich audio-visual descriptions. We then show that TAC and TAC-V serves as a "semantic bridge" for a text-only reasoner: a simple TAC$\rightarrow$LLM and TAC-V$\rightarrow$LLM cascade achieves state-of-the-art scores on benchmarks for both audio (MMAU-Pro, MMSU, MMAR) and audio-visual (DailyOmni, VideoHolmes) understanding and reasoning respectively.
Primary: University of Maryland, College Park
All Institutions: University of Maryland, College Park, Adobe Research, OpenAI
The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
The paper introduces the Timestamped Audio Captioner (TAC) and its extension TAC-V, which leverage a synthetic data pipeline to create temporally grounded audio descriptions. The methodology is innovative, utilizing a dynamic acoustic mixer to generate complex audio mixtures with precise temporal annotations, addressing the limitations of traditional audio captioning methods that often rely on sparse annotations. The approach of separating the audio captioning task from reasoning tasks through a cascade with a text-only LLM is particularly noteworthy, allowing for independent scaling and improved performance.
The experiments are comprehensive, comparing TAC against state-of-the-art models on multiple benchmarks, including MMAU-Pro, MMSU, and others. The results demonstrate significant improvements in temporal grounding and reduced hallucination rates, validating the effectiveness of the proposed methods. The ablation studies provide insights into the importance of various components of the model, further strengthening the findings.
The paper provides sufficient detail regarding the implementation, including the use of specific architectures (Qwen2-Audio) and training procedures (LoRA). However, the reliance on synthetic data may introduce challenges in replicating results in real-world scenarios, which could limit reproducibility.
The authors acknowledge limitations related to the synthetic data approach, including potential biases and a sim-to-real gap. Additionally, the model may struggle with fine-grained musical precision, which could affect its applicability in certain contexts.
The work has significant implications for improving the reliability of audio understanding systems, particularly in safety-critical applications and accessibility tools for the hearing impaired. However, the potential for misuse in surveillance contexts raises ethical considerations that must be addressed. The main contribution of this paper is the development of TAC, a model that produces temporally grounded audio captions with low hallucination rates, significantly advancing the state of audio understanding. This work addresses critical shortcomings in existing models and presents a robust framework for future research in audio and audio-visual reasoning.
Neural audio codecs (NACs) typically encode the short-term energy (gain) and normalized structure (shape) of speech/audio signals jointly within the same latent space. As a result, they are poorly robust to a global variation of the input signal level in the sense that such variation has strong influence on the embedding vectors at the output of the encoder and their quantization. This methodology is inherently inefficient, leading to codebook redundancy and suboptimal bitrate-distortion performance. To address these limitations, we propose to introduce shape-gain decomposition, widely used in classical speech/audio coding, into the NAC framework. The principle of the proposed Equalizer methodology is to decompose the input signal -- before the NAC encoder -- into gain and normalized shape vector on a short-term basis. The shape vector is processed by the NAC, while the gain is quantized with scalar quantization and transmitted separately. The output (decoded) signal is reconstructed from the normalized output of the NAC and the quantized gain. Our experiments conducted on speech signals show that this general methodology, easily applicable to any NAC, enables a substantial gain in bitrate-distortion performance, as well as a massive reduction in complexity.
Primary: Inria at Univ. Grenoble Alpes
All Institutions: Inria at Univ. Grenoble Alpes, CNRS, LJK, Univ. Grenoble Alpes, Grenoble-INP, GIPSA-lab
The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
The proposed methodology, The Equalizer, introduces a novel shape-gain decomposition approach to neural audio codecs (NACs), which is a significant departure from traditional methods that encode gain and shape jointly. The paper effectively integrates classical signal processing concepts into modern NAC frameworks, demonstrating a clear understanding of both domains. The methodology is well-structured, involving the decomposition of input signals into gain and shape vectors before encoding, and the subsequent reconstruction of the output signal. This approach not only enhances bitrate-distortion performance but also reduces complexity, making it a valuable contribution to the field.
The experiments are robust, utilizing a substantial dataset (LibriSpeech) and comparing the proposed method against several state-of-the-art NACs. The evaluation metrics—STOI, PESQ, and SI-SDR—are appropriate for assessing audio quality and intelligibility. The results clearly demonstrate the advantages of the proposed method over traditional NACs, particularly in terms of robustness to gain variations and overall performance across different bitrates. The paper provides comprehensive experimental results that substantiate the claims made about the effectiveness of The Equalizer.
The paper includes detailed implementation details, including the training setup, evaluation metrics, and specific configurations used for the NACs. However, the lack of a publicly available project URL or demo limits the reproducibility of the results. Future work could benefit from making the code and models available to the community to facilitate further exploration and validation of the proposed methodology.
One limitation of the study is the focus on speech signals, which may not generalize to other audio types. Additionally, while the paper discusses the potential for future work, it does not explore the implications of the normalization on the embedding vectors in detail, which could be crucial for understanding the full impact of the proposed method.
The proposed methodology has significant implications for audio coding and compression, particularly in applications where efficient transmission and storage of audio data are critical, such as in telecommunications and streaming services. By improving the robustness and efficiency of NACs, this work could lead to better audio quality in various consumer and professional audio applications. The main contribution of this paper is the introduction of The Equalizer, a novel methodology that applies shape-gain decomposition to enhance the performance of neural audio codecs. This work bridges classical signal processing techniques with modern machine learning approaches, providing a significant advancement in the efficiency and robustness of audio coding systems.
Neural autoencoders underpin generative models. Practical, large-scale use of neural autoencoders for generative modeling necessitates fast encoding, low latent rates, and a single model across representations. Existing approaches are reconstruction-first: they incur high latent rates, slow encoding, and separate architectures for discrete vs. continuous latents and for different audio channel formats, hindering workflows from preprocessing to inference conditioning. We introduce a generative-first architecture for audio autoencoding that increases temporal downsampling from 2048x to 3360x and supports continuous and discrete representations and common audio channel formats in one model. By balancing compression, quality, and speed, it delivers 10x faster encoding, 1.6x lower rates, and eliminates channel-format-specific variants while maintaining competitive reconstruction quality. This enables applications previously constrained by processing costs: a 60-second mono signal compresses to 788 tokens, making generative modeling more tractable.
Primary: unknown
All Institutions: unknown
The paper presents a novel generative-first neural audio autoencoder that significantly improves encoding speed and compression efficiency while maintaining high reconstruction quality. This work is a meaningful contribution to the field of audio processing, addressing key limitations of existing models and opening avenues for practical applications in generative audio tasks.
The paper introduces a generative-first architecture for audio autoencoding, which is a significant departure from the traditional reconstruction-first approach. The methodology is well-structured, with clear architectural modifications aimed at improving efficiency and flexibility. The use of efficient activations, early downsampling, and the incorporation of mel-spectrograms to capture high-frequency information are notable innovations. The post-training adaptation to support both continuous and discrete latents without retraining is particularly clever and enhances the model's applicability.
The experimental setup is robust, with thorough evaluations of speed, quality, and generative utility. The benchmarks against state-of-the-art codecs demonstrate the effectiveness of GenAE in achieving better compression and reconstruction quality. The use of multiple metrics (SI-SDR, STFT loss, mel-spectrogram L1 distance) adds credibility to the results. However, the absence of a clear comparison with a wider range of existing models could limit the perceived impact.
The paper provides detailed implementation specifics, including architecture choices, training configurations, and evaluation metrics, which are essential for reproducibility. However, the lack of accessible code or a demo limits the practical reproducibility of the results.
The paper does not address potential limitations in terms of the generalizability of the model across different audio types beyond instrumental music. Additionally, the computational resources required for training (8 A100 GPUs for a week) may not be accessible to all researchers, which could hinder broader adoption.
The advancements in audio autoencoding presented in this paper have the potential to significantly impact various applications, including music generation, audio compression, and real-time audio processing. The ability to handle multiple audio formats with a single model streamlines workflows and could lead to more efficient use of computational resources in audio-related tasks. The paper presents a novel generative-first neural audio autoencoder that significantly improves encoding speed and compression efficiency while maintaining high reconstruction quality. This work is a meaningful contribution to the field of audio processing, addressing key limitations of existing models and opening avenues for practical applications in generative audio tasks.
We introduce the Massive Audio Embedding Benchmark (MAEB), a large-scale benchmark covering 30 tasks across speech, music, environmental sounds, and cross-modal audio-text reasoning in 100+ languages. We evaluate 50+ models and find that no single model dominates across all tasks: contrastive audio-text models excel at environmental sound classification (e.g., ESC50) but score near random on multilingual speech tasks (e.g., SIB-FLEURS), while speech-pretrained models show the opposite pattern. Clustering remains challenging for all models, with even the best-performing model achieving only modest results. We observe that models excelling on acoustic understanding often perform poorly on linguistic tasks, and vice versa. We also show that the performance of audio encoders on MAEB correlates highly with their performance when used in audio large language models. MAEB is derived from MAEB+, a collection of 98 tasks. MAEB is designed to maintain task diversity while reducing evaluation cost, and it integrates into the MTEB ecosystem for unified evaluation across text, image, and audio modalities. We release MAEB and all 98 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.
Primary: Carleton University
All Institutions: Carleton University, Zendesk, Durham University, Salute Devices, MIRAI, Stanford University, Aarhus University, Indian Institute of Technology, Kharagpur, Harvard University, Capital One
The paper introduces the Massive Audio Embedding Benchmark (MAEB), a significant contribution to the field of audio machine learning that provides a comprehensive evaluation framework across diverse tasks and languages. The methodology and experimental results offer valuable insights into model performance, although further statistical analysis and detailed reproducibility guidelines would enhance its impact.
The methodology presented in the paper is robust, introducing a comprehensive benchmark (MAEB) that spans multiple audio tasks and languages. The authors provide a clear rationale for the selection of tasks and models, and the integration into the MTEB ecosystem is a significant step towards unified evaluation across modalities. However, the paper could benefit from a more detailed description of the benchmarking process and the specific metrics used for evaluation.
The experiments are extensive, evaluating over 50 models across 30 tasks. The results highlight the performance discrepancies between models trained for different audio tasks, which is a valuable insight for future research. However, the paper lacks a thorough statistical analysis of the results, which would strengthen the claims made regarding model performance.
The authors have committed to releasing code and a leaderboard, which is commendable and supports reproducibility. However, the paper should include more detailed instructions on how to replicate the experiments, including specific configurations and hyperparameters used for each model.
One limitation noted is the performance of models on clustering tasks, where even the best-performing model achieves only modest results. Additionally, the paper acknowledges the trade-offs between acoustic understanding and linguistic tasks, which may limit the applicability of certain models across all tasks.
The MAEB benchmark has the potential to significantly impact the field of audio machine learning by providing a standardized evaluation framework. This could lead to improved model development and encourage further research into multilingual and cross-modal audio tasks. The release of the benchmark also promotes collaboration and transparency in the research community. The paper introduces the Massive Audio Embedding Benchmark (MAEB), a significant contribution to the field of audio machine learning that provides a comprehensive evaluation framework across diverse tasks and languages. The methodology and experimental results offer valuable insights into model performance, although further statistical analysis and detailed reproducibility guidelines would enhance its impact.
Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present Resp-Agent, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A$^2$CA). Unlike static pipelines, Thinker-A$^2$CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a Modality-Weaving Diagnoser that weaves EHR data with audio tokens via Strategic Global Attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. To address the data gap, we design a Flow Matching Generator that adapts a text-only Large Language Model (LLM) via modality injection, decoupling pathological content from acoustic style to synthesize hard-to-diagnose samples. As a foundation for these efforts, we introduce Resp-229k, a benchmark corpus of 229k recordings paired with LLM-distilled clinical narratives. Extensive experiments demonstrate that Resp-Agent consistently outperforms prior approaches across diverse evaluation settings, improving diagnostic robustness under data scarcity and long-tailed class imbalance. Our code and data are available at https://github.com/zpforlove/Resp-Agent.
Primary: The Hong Kong University of Science and Technology (Guangzhou)
All Institutions: The Hong Kong University of Science and Technology (Guangzhou)
The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
The paper presents a novel agent-based framework, Resp-Agent, which integrates multimodal data (audio and EHR) for respiratory sound generation and diagnosis. The methodology is innovative, utilizing an Active Adversarial Curriculum Agent (Thinker-A$^2$CA) to dynamically identify weaknesses in diagnostics and schedule targeted synthesis. The Modality-Weaving Diagnoser and Flow Matching Generator are well-conceived to address the representation and data gaps in respiratory sound analysis. The use of large language models (LLMs) for generating clinical narratives and the careful design of the dataset (Resp-229k) enhance the robustness of the approach. However, while the methodology is sound, it heavily relies on the quality of the underlying data and the effectiveness of the LLMs used for synthesis.
The experiments are comprehensive, evaluating the proposed system against multiple benchmarks and comparing it with existing methods. The results demonstrate significant improvements in diagnostic accuracy and robustness, particularly in handling class imbalance and data scarcity. The use of a strict cross-domain evaluation protocol adds rigor to the assessment of generalization capabilities. The paper also includes detailed ablation studies that validate the contributions of various components of the system, further strengthening the findings.
The authors have made significant efforts to ensure reproducibility by providing access to the code and dataset. The detailed descriptions of the architecture, training procedures, and evaluation metrics contribute to the transparency of the research. However, the reliance on LLMs and the complexity of the system may pose challenges for complete replication without adequate computational resources.
One limitation is the dependency on the quality and diversity of the Resp-229k dataset, which may affect the generalizability of the findings. Additionally, while the paper addresses class imbalance, the performance on extremely rare conditions may still be limited. The complexity of the system could also hinder its practical deployment in clinical settings without further validation.
The proposed framework has the potential to significantly advance the field of respiratory sound analysis and diagnosis, offering a robust tool for clinicians to improve diagnostic accuracy and support medical education. The integration of generative modeling with diagnostic capabilities could lead to more effective training datasets and enhance the understanding of respiratory diseases. However, ethical considerations regarding the use of AI in clinical decision-making must be addressed. The main contribution of this paper is the introduction of Resp-Agent, a multimodal framework that effectively synthesizes respiratory sounds and integrates clinical context for improved disease diagnosis. This work represents a significant advancement in the application of machine learning to healthcare, particularly in addressing the challenges of data scarcity and class imbalance in respiratory sound analysis.
Long-duration audio is increasingly common in industrial and consumer settings, yet reviewing multi-hour recordings is impractical, motivating systems that answer natural-language queries with precise temporal grounding and minimal hallucination. Existing audio-language models show promise, but long-audio question answering remains difficult due to context-length limits. We introduce LongAudio-RAG (LA-RAG), a hybrid framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections rather than raw audio. Multi-hour streams are converted into structured event records stored in an SQL database, and at inference time the system resolves natural-language time references, classifies intent, retrieves only the relevant events, and generates answers using this constrained evidence. To evaluate performance, we construct a synthetic long-audio benchmark by concatenating recordings with preserved timestamps and generating template-based question-answer pairs for detection, counting, and summarization tasks. Finally, we demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment, where the audio grounding model runs on-device on IoT-class hardware while the LLM is hosted on a GPU-backed server. This architecture enables low-latency event extraction at the edge and high-quality language reasoning in the cloud. Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmented Generation (RAG) or text-to-SQL approaches.
Primary: unknown
All Institutions: unknown
The paper presents LongAudio-RAG, a novel framework for event-grounded question answering over multi-hour audio, significantly advancing the capabilities of audio processing systems. The detailed methodology and experimental validation underscore its potential impact in the field of machine learning, particularly in audio-language integration and real-time analytics.
The methodology presented in the paper is robust and well-structured, introducing a hybrid framework that effectively combines audio grounding with large language models (LLMs) for long audio question answering. The use of SQL databases for structured event records and the detailed approach to temporal reference resolution and intent classification are commendable. The paper clearly outlines the steps taken to convert long audio into actionable data, which is a significant advancement in the field of audio processing and natural language understanding.
The experimental evaluation is thorough, utilizing a synthetic long-audio benchmark that allows for controlled testing of the proposed system against various baselines, including RAG and text-to-SQL approaches. The results demonstrate a clear improvement in accuracy and response quality, validating the effectiveness of the proposed method. The use of both automated and human evaluations adds credibility to the findings.
The paper provides a detailed description of the implementation stack and methodologies used, which enhances reproducibility. However, the lack of a public repository or demo URL limits the ability for others to replicate the work fully. The modular service-oriented architecture described could facilitate reproducibility if made available.
The paper acknowledges limitations related to the accuracy of the Audio Grounding Model (AGM), which may affect downstream reasoning. Additionally, the synthetic nature of the benchmark may not fully capture the complexities of real-world audio environments, potentially limiting the generalizability of the results.
The proposed system has significant potential applications in various domains, including industrial monitoring, smart home technologies, and security systems. By enabling precise question answering over long audio recordings, it could enhance user interaction with audio data and improve operational efficiencies in many sectors. The paper presents LongAudio-RAG, a novel framework for event-grounded question answering over multi-hour audio, significantly advancing the capabilities of audio processing systems. The detailed methodology and experimental validation underscore its potential impact in the field of machine learning, particularly in audio-language integration and real-time analytics.
Neural audio compression models have recently achieved extreme compression rates, enabling efficient latent generative modeling. Conversely, latent generative models have been applied to compression, pushing the limits of continuous and discrete approaches. However, existing methods remain constrained to low-resolution audio and degrade substantially at very low bitrates, where audible artifacts are prominent. In this paper, we present S-PRESSO, a 48kHz sound effect compression model that produces both continuous and discrete embeddings at ultra-low bitrates, down to 0.096 kbps, via offline quantization. Our model relies on a pretrained latent diffusion model to decode compressed audio embeddings learned by a latent encoder. Leveraging the generative priors of the diffusion decoder, we achieve extremely low frame rates, down to 1Hz (750x compression rate), producing convincing and realistic reconstructions at the cost of exact fidelity. Despite operating at high compression rates, we demonstrate that S-PRESSO outperforms both continuous and discrete baselines in audio quality, acoustic similarity and reconstruction metrics.
Primary: unknown
All Institutions: unknown
The paper presents S-PRESSO, a diffusion autoencoder for ultra-low bitrate audio compression, achieving significant improvements in audio quality while maintaining high compression rates. This work highlights the potential of generative models to redefine audio compression standards, pushing the boundaries of what is achievable in the field.
The paper introduces S-PRESSO, a novel approach to audio compression utilizing a diffusion autoencoder framework. The methodology is well-structured, comprising a three-step training process that includes continuous diffusion autoencoder training, offline quantization, and diffusion decoder finetuning. This approach effectively leverages the generative capabilities of diffusion models to enhance audio quality at ultra-low bitrates. The use of pretrained models for both the latent encoder and the diffusion decoder is a strong point, as it allows for the incorporation of learned representations that can significantly improve the compression process. However, the paper could benefit from a more detailed explanation of the quantization process and its impact on the overall performance.
The experimental setup is robust, utilizing a diverse set of datasets that cover various audio types, which enhances the generalizability of the results. The authors provide a thorough comparison against both continuous and discrete baseline models, demonstrating significant improvements in audio quality metrics such as FAD, KAD, and Si-SDR. The subjective evaluation through MUSHRA tests adds credibility to the findings, although the paper does not discuss the statistical significance of the results in detail. Overall, the experiments convincingly support the claims made about the performance of S-PRESSO.
The paper includes sufficient implementation details, including training parameters and architecture specifications, which aids in reproducibility. However, the absence of publicly available code or models limits the ability of other researchers to replicate the results fully. The authors mention the use of specific datasets but do not provide access to these datasets, which could hinder reproducibility for others in the field.
One notable limitation is the focus on sound effects, which may restrict the applicability of the proposed method to other audio domains such as music or speech. Additionally, while the results are promising, the trade-off between compression rate and audio fidelity could be further explored, particularly at the lowest bitrates. The paper also acknowledges the need for improvements in inference speed, which is crucial for practical applications.
The advancements in ultra-low bitrate audio compression have significant implications for various applications, including gaming, virtual reality, and streaming services, where bandwidth is a critical concern. By shifting the focus from strict fidelity to acoustic similarity, this work opens new avenues for audio representation and synthesis, potentially enhancing user experiences in interactive media. The findings could also inspire further research into generative models for audio processing. The paper presents S-PRESSO, a diffusion autoencoder for ultra-low bitrate audio compression, achieving significant improvements in audio quality while maintaining high compression rates. This work highlights the potential of generative models to redefine audio compression standards, pushing the boundaries of what is achievable in the field.
Multimodal foundation models have demonstrated impressive generalization capabilities, yet efficiently adapting them to new tasks in a few-shot setting remains a critical challenge. In this work, we investigate the few-shot adaptation of Large Audio-Language Models (ALMs) through both training-based and training-free approaches. We introduce MUKA, a multi-kernel adaptation framework that combines the fine-grained, context-dependent representations of instruction-tuning based models like Pengi with the global semantic representations of contrastive pretraining methods like CLAP. By constructing a product kernel that aligns local similarity with global semantics, MUKA enhances representational power while preserving the theoretical guarantees of kernel methods and avoiding additional training. Extensive experiments across 11 diverse audio datasets demonstrate that MUKA achieves state-of-the-art performance among training-free methods and even surpasses training-based adapters in several scenarios, offering a compelling balance between adaptability and efficiency.
Primary: IMT Atlantique
All Institutions: IMT Atlantique, Polytechnique Montréal, Inria, University Rennes, IRISA, CNRS, Université de Montpellier
The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
The methodology proposed in MUKA is innovative as it introduces a multi-kernel product approach that effectively combines the strengths of different audio-language models, specifically Pengi and CLAP. This combination allows for a more nuanced representation of audio data, capturing both fine-grained details and broader semantic contexts. The theoretical grounding in kernel methods adds robustness to the approach, and the avoidance of additional training enhances its practicality in few-shot scenarios. However, the paper could benefit from a more detailed explanation of the kernel design choices and how they were empirically validated.
The experiments are extensive, covering 11 diverse audio datasets, which demonstrates the versatility of the proposed method. The results indicate that MUKA achieves state-of-the-art performance among training-free methods and competes well with training-based methods. The use of cross-validation and clear reporting of accuracy metrics strengthens the experimental rigor. However, the paper lacks a discussion on the statistical significance of the results, which would provide a clearer picture of the performance improvements.
The paper outlines the experimental setup and methodology sufficiently to allow for reproducibility. It mentions the use of specific datasets and the pre-trained models employed, along with the computational resources used for experiments. However, the absence of a public code repository or demo limits the ease of reproducibility for other researchers.
One limitation is the reliance on existing models (Pengi and CLAP) without exploring the potential for developing new models tailored specifically for audio-language tasks. Additionally, while the paper claims efficiency, it does not provide a detailed computational complexity analysis of MUKA compared to other methods. The scope of datasets, while diverse, may not cover all potential audio-language applications, which could limit the generalizability of the findings.
The implications of this work are significant for the field of audio processing and multimodal learning. By improving few-shot adaptation in audio-language models, MUKA could facilitate advancements in applications such as audio classification, emotion recognition, and sound event detection. The proposed methodology could also inspire further research into kernel methods and their applications in other domains, potentially leading to more efficient machine learning models. The paper presents MUKA, a novel multi-kernel adaptation framework for audio-language models that enhances few-shot learning efficiency and performance. This work significantly contributes to the field by addressing the challenges of adapting large models to audio tasks, demonstrating both theoretical and practical advancements in multimodal learning.
Recent Large Audio Language Models (LALMs) excel in understanding but often lack transparent reasoning. To address this "black-box" limitation, we organized the Audio Reasoning Challenge at Interspeech 2026, the first shared task dedicated to evaluating Chain-of-Thought (CoT) quality in the audio domain. The challenge introduced MMAR-Rubrics, a novel instance-level protocol assessing the factuality and logic of reasoning chains. Featured Single Model and Agent tracks, the competition attracting 156 teams from 18 countries and regions. Results show agent systems currently lead in reasoning quality, utilizing iterative tool orchestration and cross-modal analysis. Besides, single models are rapidly advancing via reinforcement learning and sophisticated data pipeline. We details the challenge design, methodology, and a comprehensive analysis of state-of-the-art systems, providing new insights for explainable audio intelligence.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Alibaba Group, Carnegie Mellon University, Microsoft Corporation, Queen Mary University of London, Shanghai Jiao Tong University
The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
The paper presents a well-structured methodology for evaluating audio reasoning models through the introduction of the MMAR-Rubrics, which emphasizes the quality of reasoning chains rather than just final answers. This is a significant shift in evaluation paradigms, addressing the limitations of existing benchmarks that focus primarily on accuracy. The dual-track design allows for a comprehensive exploration of both end-to-end models and agent-based systems, providing insights into different architectural approaches. The use of instance-level evaluation criteria enhances the reliability and stability of the assessment process.
The experimental setup is robust, with a large number of participants (156 teams from 18 countries) demonstrating significant interest and engagement in the challenge. The results indicate a clear performance differentiation between agent systems and single models, with detailed analyses of top-performing systems providing valuable insights into effective strategies. The use of rigorous evaluation metrics, including reliability and human alignment studies, strengthens the credibility of the findings.
The paper provides sufficient details regarding the evaluation protocols and the challenge design, including the release of the MMAR-Rubrics benchmark data and evaluation scripts. However, the reproducibility of the models themselves may be limited due to the proprietary nature of some systems and the lack of detailed descriptions of their architectures and training processes.
One limitation is the potential variability in the quality of the reasoning paths generated by different models, which may not be fully captured by the evaluation metrics. Additionally, the reliance on LLMs for scoring may introduce biases or inconsistencies, although the authors have taken steps to mitigate this through their instance-level rubric approach. The challenge also does not address the scalability of the proposed evaluation methods to more complex real-world scenarios.
The findings from this research have significant implications for the development of explainable AI in audio processing, particularly in applications requiring robust reasoning capabilities, such as automated transcription services, audio analysis for accessibility, and interactive audio agents. By focusing on the reasoning process, this work contributes to enhancing the transparency and trustworthiness of AI systems in critical domains. The paper introduces the Audio Reasoning Challenge and the MMAR-Rubrics, marking a pivotal advancement in evaluating audio reasoning models by emphasizing the quality of reasoning processes. This comprehensive analysis highlights the innovative methodology, robust experimental design, and significant implications for the field of explainable audio intelligence.
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.
Primary: unknown
All Institutions: unknown
The paper presents Bengali-Loop, a significant contribution to the field of speech technology for the Bengali language, providing essential benchmarks for long-form ASR and speaker diarization. The methodology is sound, and the technical contributions are likely to foster further advancements in this under-resourced area, although some limitations and areas for improvement remain.
The methodology presented in the paper is robust, focusing on the collection and verification of long-form ASR and speaker diarization datasets. The use of a human-in-the-loop approach for transcript verification enhances the quality of the data, addressing common pitfalls in automated transcription. The standardized evaluation protocols and formats provided are essential for reproducibility and future research. However, the paper could benefit from a more detailed discussion on the specific challenges encountered during data collection and annotation, as well as the rationale behind the chosen methodologies.
The experimental evaluation is thorough, with clear baselines established for both ASR and diarization tasks. The reported results, including WER and DER, provide a solid foundation for assessing the performance of the proposed benchmarks. However, the paper lacks a comparative analysis with existing benchmarks in other languages, which could further contextualize the results and demonstrate the significance of the contributions made.
The authors emphasize reproducibility by providing detailed descriptions of the data collection process, annotation guidelines, and evaluation protocols. They also plan to release scripts for standardizing audio and running baseline evaluations, which is commendable. However, the lack of a publicly available code repository limits the ease with which other researchers can reproduce the results.
The paper acknowledges several limitations, including the limited dialectal diversity of the datasets and the simplification of the diarization overlap policy. Additionally, the focus on specific types of media (e.g., Bangla drama) may not fully represent the diversity of spoken Bengali in other contexts. These limitations should be addressed in future work to enhance the applicability of the benchmarks.
The development of Bengali-Loop has significant implications for the advancement of speech technology in under-resourced languages. By providing high-quality datasets and standardized evaluation protocols, this work can facilitate further research and development in Bangla ASR and speaker diarization. The benchmarks can also serve as a foundation for community-driven efforts to improve speech technology for other low-resource languages, potentially leading to broader accessibility and inclusion in technology. The paper presents Bengali-Loop, a significant contribution to the field of speech technology for the Bengali language, providing essential benchmarks for long-form ASR and speaker diarization. The methodology is sound, and the technical contributions are likely to foster further advancements in this under-resourced area, although some limitations and areas for improvement remain.
We present Eureka-Audio, a compact yet high-performance audio language model that achieves competitive performance against models that are 4 to 18 times larger across a broad range of audio understanding benchmarks. Despite containing only 1.7B parameters, Eureka-Audio demonstrates strong performance on automatic speech recognition (ASR), audio understanding, and dense audio captioning, matching or surpassing multiple 7B to 30B audio and omni-modal baselines. The model adopts a unified end-to-end architecture composed of a lightweight language backbone, a Whisper-based audio encoder, and a sparsely activated Mixture-of-Experts (MoE) adapter that explicitly accounts for audio heterogeneity and alleviates cross-modal optimization conflicts under limited capacity. To further enhance paralinguistic reasoning, we introduce DataFlux, a closed loop audio instruction data synthesis and verification pipeline that constructs high quality, logically consistent supervision from raw audio. Extensive evaluations across ASR, knowledge reasoning, safety, instruction following, and paralinguistic benchmarks, demonstrate that Eureka-Audio achieves an efficient balance between computational cost and performance. These results establish Eureka Audio as a strong and practical baseline for lightweight audio understanding models.
Primary: Inner Mongolia University
All Institutions: Baidu Inc., College of Computer Science, Inner Mongolia University, Tsinghua Shenzhen International Graduate School, Tsinghua University
The main contribution of this paper is the introduction of Eureka-Audio, a compact audio language model that achieves competitive performance against much larger models while employing innovative techniques for audio understanding and data synthesis. This work represents a meaningful advancement in the field of audio processing, particularly in developing efficient models that maintain high performance.
The methodology presented in the paper is robust, featuring a unified end-to-end architecture that integrates a lightweight language backbone with a Whisper-based audio encoder and a Mixture-of-Experts (MoE) adapter. This approach effectively addresses audio heterogeneity and cross-modal optimization conflicts, which are common challenges in audio processing tasks. The introduction of the DataFlux pipeline for synthesizing and verifying audio instruction data is particularly innovative, as it enhances the model's ability to reason about paralinguistic features. The model's architecture is well-justified, and the combination of techniques appears to be a significant advancement in the field of audio language models.
The experimental evaluation is comprehensive, covering a wide range of benchmarks including ASR, audio understanding, and dense audio captioning. The results demonstrate that Eureka-Audio outperforms or matches larger models, which is a significant achievement given its compact size of 1.7B parameters. The paper provides detailed comparisons with various baselines, and the metrics used for evaluation are appropriate and well-explained. However, the lack of real-world application scenarios in the experiments could limit the practical understanding of the model's performance.
The paper includes a project URL that suggests the availability of code and models, which is crucial for reproducibility. However, the paper does not provide extensive details on the training procedures, hyperparameters, or datasets used, which could hinder full reproducibility by other researchers. More transparency in these areas would enhance the paper's contribution to the community.
One limitation of the study is the potential overfitting to the benchmarks used for evaluation, as the model's performance is primarily reported on standard datasets. Additionally, the reliance on a closed-loop data synthesis approach may introduce biases or limitations in the quality of the generated data. The paper could also explore the model's performance in diverse real-world scenarios beyond the controlled benchmarks.
Eureka-Audio has the potential to significantly impact various applications in audio understanding, including accessibility technologies, voice-activated systems, and interactive AI agents. Its compact size makes it suitable for deployment in resource-constrained environments, which could broaden the accessibility of advanced audio processing capabilities. The advancements in paralinguistic reasoning could also lead to more nuanced interactions in human-computer communication. The main contribution of this paper is the introduction of Eureka-Audio, a compact audio language model that achieves competitive performance against much larger models while employing innovative techniques for audio understanding and data synthesis. This work represents a meaningful advancement in the field of audio processing, particularly in developing efficient models that maintain high performance.