Recent advances in spoken dialogue language models have shifted from turn-based to full-duplex designs, where the model continuously listens to the user while generating responses. However, existing duplex backbones still lack a native channel for in-conversation planning and tool calling, leaving real-time agentic behaviour either tied to turn boundaries or relegated to an external cascade. We propose DuplexSLA, a native full-duplex Speech-Language-Action foundation model that decodes assistant audio together with a structured action stream on a shared 160 ms chunk timeline. DuplexSLA is built on a dual-stream three-channel formulation: a continuous user audio channel, a discrete assistant audio channel, and a rate-limited textual action channel, all decoded jointly by a single backbone, so that listening, speaking, planning, and tool calling unfold on one shared clock. Two capabilities define the model: (1) semantic-driven turn-taking control, where interruption, pause, and backchannel are handled inside the same backbone instead of by an external semantic VAD; and (2) in-conversation planning and tool calling, where planning text and structured tool calls are emitted on the action channel without halting assistant audio, so that multi-action and backchannel-triggered tool use are interleaved with ongoing speech. To evaluate these capabilities together, we further construct DuplexSLA-Bench, a duplex benchmark covering pause, interrupt, and backchannel turn-taking together with three styles of in-conversation tool calling. Our project page, interactive demos, and the DuplexSLA-Bench evaluation suite are publicly available at https://github.com/hyzhang24/DuplexSLA.
Primary: StepFun
All Institutions: StepFun, Peking University, Nanyang Technological University, Shanghai Jiao Tong University, University of New South Wales, Imperial College London
The main contribution of this paper is the introduction of DuplexSLA, a novel full-duplex Speech-Language-Action model that enables seamless integration of listening, speaking, and action planning in real-time dialogue systems. This work represents a significant step forward in the field of spoken dialogue systems, addressing key limitations of traditional turn-based models and paving the way for more natural and efficient human-computer interactions.
The methodology proposed in DuplexSLA is innovative, leveraging a dual-stream three-channel architecture that integrates user audio, assistant audio, and an action channel into a synchronous framework. This design allows for real-time interaction without the latency typically associated with turn-based systems. The paper effectively outlines the architecture and the rationale behind the choice of chunk size and channel design, which are crucial for achieving seamless interaction in spoken dialogue systems. The integration of semantic-driven turn-taking and in-conversation planning is a significant advancement, as it allows for more natural interactions compared to existing models.
The experimental evaluation is robust, utilizing the newly created DuplexSLA-Bench to assess the model's performance across various scenarios, including pause, interrupt, and backchannel turn-taking, as well as tool calling. The results demonstrate that DuplexSLA achieves sub-second latency with competitive accuracy compared to traditional ASR + LLM cascades, which is a notable achievement. The comprehensive evaluation metrics, including accuracy and delay, provide a clear picture of the model's performance.
The paper provides a detailed description of the training process, data construction, and evaluation protocols, which enhances reproducibility. The availability of the project page and code repository further supports efforts to replicate the findings. However, the complexity of the model and the specific training data used may pose challenges for complete reproducibility without access to the same datasets.
One limitation of the study is the reliance on a specific chunk size (160 ms), which may not generalize well to all conversational contexts or languages. Additionally, while the model shows promise in handling interruptions and backchannels, its performance in more complex multi-turn dialogues or less structured conversations remains to be evaluated. The paper does not address the potential biases in the training data, which could affect the model's performance in real-world applications.
The implications of DuplexSLA are significant for the development of more interactive and responsive spoken dialogue systems. By enabling real-time planning and tool calling, this model could enhance user experience in various applications, including virtual assistants, customer service bots, and interactive voice response systems. The ability to handle natural conversational phenomena like interruptions and backchannels could lead to more human-like interactions, thereby increasing user satisfaction and engagement. The main contribution of this paper is the introduction of DuplexSLA, a novel full-duplex Speech-Language-Action model that enables seamless integration of listening, speaking, and action planning in real-time dialogue systems. This work represents a significant step forward in the field of spoken dialogue systems, addressing key limitations of traditional turn-based models and paving the way for more natural and efficient human-computer interactions.
Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose LatentOmni, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct LatentOmni-Instruct-35K, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.
Primary: Peking University
All Institutions: Nanjing University, Peking University, Renmin University of China, Shanghai Jiao Tong University, Tsinghua University
The main contribution of this paper is the development of LatentOmni, a framework that enhances audio-visual reasoning by integrating textual and latent representations, leading to improved performance on multimodal reasoning tasks. This work addresses critical limitations in existing models and sets a foundation for future research in unified multimodal understanding.
The paper introduces LatentOmni, a novel framework that integrates audio-visual reasoning within a unified latent space. The methodology is well-structured, emphasizing the importance of maintaining temporal consistency and grounding reasoning in sensory evidence. The introduction of feature-level supervision and Omni-Sync Position Embedding (OSPE) represents a significant advancement in addressing the limitations of traditional text-based chain-of-thought (CoT) methods. The approach is innovative, as it combines textual reasoning with continuous latent states, allowing for a more nuanced understanding of multimodal interactions.
The authors conduct extensive experiments across multiple benchmarks, demonstrating that LatentOmni outperforms existing models, including both open-source and proprietary systems. The evaluation metrics are comprehensive, covering various reasoning tasks that highlight the framework's strengths in audio-visual alignment and commonsense reasoning. The results are statistically significant and provide strong evidence for the effectiveness of the proposed method.
The paper provides detailed implementation details, including the training process, dataset construction, and hyperparameter settings. However, the absence of a publicly available project URL limits the reproducibility of the results, as external researchers cannot directly access the code or datasets used in the experiments.
While the framework shows promise, it shares common limitations with current multimodal systems, particularly in terms of modality coverage. The authors acknowledge the challenge of extending the framework to incorporate additional sensory modalities beyond audio and visual, which could limit its applicability in more complex real-world scenarios.
The implications of this work are significant for fields requiring robust multimodal understanding, such as robotics, autonomous systems, and human-computer interaction. By improving the reasoning capabilities of models in a unified latent space, the research could enhance applications in video analysis, interactive AI systems, and assistive technologies. The main contribution of this paper is the development of LatentOmni, a framework that enhances audio-visual reasoning by integrating textual and latent representations, leading to improved performance on multimodal reasoning tasks. This work addresses critical limitations in existing models and sets a foundation for future research in unified multimodal understanding.
Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.
Primary: StepFun-Audio Team
All Institutions: StepFun-Audio Team
PROJECT
Coordinate-conditioned neural networks can generate head-tracked personal sound zone (PSZ) loudspeaker filters in real time, but they are sensitive to localization uncertainty. Small fluctuations in estimated listener coordinates, caused by optical distortion, temporary occlusions, or tracking jitter, may produce large filter changes even when listeners are physically stationary. This paper proposes neighbor-consistent neural filters that regularize the coordinate-to-filter mapping by penalizing filter differences at randomly perturbed neighboring coordinates during training. To evaluate robustness against tracking noise, we introduce a decoupled protocol that fixes the acoustic transfer functions at a physical anchor while perturbing only the coordinate inputs used for filter generation. Isolation quality and local stability are evaluated using neighborhood median and lower-tail statistics of inter-zone and inter-program isolation, together with spatial variation rates that quantify metric sensitivity within a coordinate neighborhood. In simulation with a split-band woofer-tweeter system and 25 randomly sampled anchor positions, neighbor consistency reduces the root-mean-square (RMS) variation rate by up to 55.9% in the woofer band and 30.3% in the tweeter band while largely preserving isolation quality and improving lower-tail robustness. In in-situ measurements using a 24-driver array and two stationary head-and-torso simulators, the proposed regularization improves worst-case neighborhood isolation by up to 16.9% and reduces spatial variation rates by up to 61.8%. These results demonstrate that neighbor-consistency regularization effectively stabilizes PSZ rendering under localization uncertainty.
Primary: Princeton University
All Institutions: Princeton University
The main contribution of this paper is the introduction of neighbor-consistent neural filters that effectively stabilize personal sound zone rendering under localization uncertainty. This work represents a meaningful advancement in audio processing, addressing a critical challenge in the field and providing a foundation for future research on robust audio systems.
The paper introduces a novel approach to enhance the robustness of personal sound zones (PSZs) against localization uncertainty through neighbor-consistent neural filters. The methodology employs a regularization technique that penalizes filter differences at perturbed neighboring coordinates, effectively stabilizing the coordinate-to-filter mapping. This innovative approach is well-grounded in existing literature, yet it distinguishes itself by addressing the specific issue of localization uncertainty in PSZ systems, which has not been sufficiently tackled in prior works. The proposed decoupled evaluation protocol is also a significant methodological advancement, allowing for a clearer assessment of robustness independent of physical listener motion.
The experiments conducted in both simulation and in-situ measurements provide a comprehensive evaluation of the proposed method. The use of a split-band woofer-tweeter system and the detailed metrics for isolation quality and spatial stability demonstrate a rigorous approach to validating the effectiveness of the neighbor-consistency regularization. The results indicate substantial improvements in stability metrics while preserving isolation quality, showcasing the practical applicability of the proposed method in real-world scenarios.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed hyperparameter settings beyond the sensitivity analysis. While the methodology is well described, the absence of a public repository or demo limits the ability of other researchers to replicate the findings independently.
One limitation of the study is the focus on a specific configuration of the loudspeaker system, which may not generalize to all PSZ applications. Additionally, while the proposed method shows promise in improving robustness, the long-term effects of neighbor-consistency regularization on overall audio quality and listener experience remain unexplored. The paper also does not address potential computational overhead introduced by the additional regularization during training.
The findings of this research have significant implications for the development of robust audio systems in shared environments, such as vehicles and public spaces, where accurate sound localization is critical. The proposed method could enhance user experience by providing stable and high-fidelity audio zones, thereby facilitating applications in personal audio systems and immersive sound environments. The main contribution of this paper is the introduction of neighbor-consistent neural filters that effectively stabilize personal sound zone rendering under localization uncertainty. This work represents a meaningful advancement in audio processing, addressing a critical challenge in the field and providing a foundation for future research on robust audio systems.
User-defined keyword spotting (KWS) is crucial for personalized voice interaction, yet existing methods face several challenges: (1) insufficient discriminability among confusable words, (2) performance inconsistency across speakers with varying pronunciations, and (3) high data cost to ensure reliable wake-word performance. In this paper, we introduce DMA-KWS, an efficient and robust framework for user-defined keyword spotting. First, it adopts a dual-stage matching pipeline: CTC decoding with streaming phoneme search to locate candidate segments, followed by QbyT with a phoneme matcher for fine-grained verification, enabling it to better distinguish confusable words. Next, multi-modal enrollment fuses user-specific speech with text embeddings to further improve accuracy for registered users. Finally, a parameter-efficient continual adaptation mechanism performs lightweight updates using synthetic and real data. Extensive experiments demonstrate the superior performance of DMA-KWS. On the LibriPhrase Hard subset, it achieves 97.85% AUC and 6.13% EER, reaching state-of-the-art performance. In speaker-dependent settings, DMA-KWS consistently outperforms text-only enrollment, demonstrating significant performance gains. Moreover, the proposed parameter-efficient fine-tuning mechanism adapts DMA-KWS with only 187k updated parameters, further enhancing KWS performance while ensuring suitability for on-device deployment.
Primary: Shanghai University
All Institutions: Shanghai University, Xi'an Jiaotong-Liverpool University, New York University
The DMA-KWS framework effectively addresses critical challenges in user-defined keyword spotting through innovative methodologies and extensive experimental validation, marking a significant contribution to the field of audio processing and machine learning.
The proposed DMA-KWS framework introduces a novel dual-stage matching architecture that combines CTC-based phoneme search with a QbyT phoneme matcher, enhancing keyword spotting accuracy, particularly in distinguishing confusable words. The integration of multi-modal enrollment allows for improved performance by leveraging user-specific audio and text embeddings, while the continual adaptation mechanism ensures efficient updates with minimal data. This comprehensive approach addresses key challenges in user-defined keyword spotting, such as speaker variability and data scarcity.
The authors conducted extensive experiments across multiple datasets, including LibriPhrase, GSC, and Qcomm, demonstrating superior performance metrics such as AUC and EER. The results indicate that DMA-KWS consistently outperforms existing state-of-the-art methods, particularly in challenging scenarios involving confusable keywords and speaker-dependent settings. The evaluation metrics used, including AUROC and EER, are appropriate for the task and provide a clear picture of the model's effectiveness.
The paper provides detailed descriptions of the model architecture, training procedures, and datasets used, which enhances reproducibility. Additionally, the authors have made the code available on GitHub, further supporting the community's ability to replicate and build upon their work.
While the paper presents a robust framework, it does not extensively address the computational efficiency of the dual-stage matching during real-time inference, which could be a concern for on-device applications. Furthermore, the reliance on synthetic data for continual adaptation may introduce biases if the synthetic data does not accurately reflect real-world scenarios.
The advancements in user-defined keyword spotting have significant implications for personalized voice interaction technologies, enhancing user experience in smart devices and applications. The framework's ability to adapt to individual user characteristics could lead to broader adoption of voice-activated systems in various domains, including home automation, accessibility, and personal assistants. The DMA-KWS framework effectively addresses critical challenges in user-defined keyword spotting through innovative methodologies and extensive experimental validation, marking a significant contribution to the field of audio processing and machine learning.
Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.
Primary: UC San Diego
All Institutions: UC San Diego, MIT, Adobe
This paper presents a novel approach to interactive music generation through the introduction of Live Music Diffusion Models, which enhance the efficiency and applicability of diffusion models for real-time performance. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate significant technical contributions that could impact both research and practical applications in music technology.
The paper introduces Live Music Diffusion Models (LMDMs), which modify existing diffusion models to enable efficient block-wise KV caching and a novel post-training method called ARC-Forcing. The methodology effectively addresses the computational inefficiencies of traditional autoregressive models by leveraging the strengths of diffusion models while maintaining real-time performance suitable for consumer hardware. The proposed routing mechanism and attention masking techniques are well-justified and demonstrate a clear understanding of the limitations of prior models.
The experiments are comprehensive, comparing LMDMs against state-of-the-art models in various creative domains, including text-conditioned generation and live musician interaction. The use of multiple evaluation metrics (FD, KL, CLAP) provides a robust framework for assessing model performance. The results indicate that LMDMs achieve competitive quality and significantly reduced latency, highlighting their practical applicability in real-world scenarios.
The paper provides sufficient details regarding the training and inference setup, including model parameters and datasets used. However, the absence of a publicly available code repository may hinder full reproducibility. The authors do provide a demo URL, which partially mitigates this concern by allowing users to experience the model's capabilities firsthand.
The authors acknowledge limitations regarding genre bias in training data and the model's responsiveness to text features compared to clean audio content. Additionally, the need for further latency reduction is identified as a critical area for future work, particularly for achieving seamless real-time interaction.
The development of LMDMs represents a significant advancement in interactive music generation, potentially transforming live performance and co-creation experiences. By making high-quality generative models accessible on consumer hardware, this work could democratize music creation and inspire new forms of artistic collaboration between musicians and AI. This paper presents a novel approach to interactive music generation through the introduction of Live Music Diffusion Models, which enhance the efficiency and applicability of diffusion models for real-time performance. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate significant technical contributions that could impact both research and practical applications in music technology.
While flow-matching text-to-speech (TTS) achieves strong zero-shot speaker similarity and naturalness, it remains susceptible to content fidelity issues, particularly skip and repeat errors from imperfect alignment. We propose RobustSpeechFlow, a training strategy that improves alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations. Requiring no external aligners or preference data, our method directly penalizes realistic failure modes and readily integrates into existing pipelines. On Seed-TTS-eval, it reduces the word error rate (WER) from 1.44 to 1.38 using only 0.06B parameters. On our ZERO500 benchmark, it delivers consistent intelligibility improvements across diverse speaker and prosody conditions; at NFE=24, it reduces English character error rate (CER) from 0.48\% to 0.35\% and Korean CER from 0.81\% to 0.57\%. Audio samples: https://robustspeechflow.github.io/
Primary: Supertone Inc, South Korea
All Institutions: Supertone Inc, Independent Researcher
The main contribution of this paper is the introduction of RobustSpeechFlow, a novel training strategy that enhances alignment robustness in TTS systems through the use of augmentation-based contrastive flow matching. This approach effectively addresses common content fidelity issues, demonstrating substantial improvements in performance across diverse conditions, thereby advancing the state of the art in TTS technology.
The methodology presented in RobustSpeechFlow is innovative, leveraging contrastive flow matching with latent augmentations to address specific alignment issues in TTS systems. The approach is well-structured, utilizing hard negatives that simulate realistic TTS failure modes, which is a significant advancement over traditional methods that rely on random negatives. This targeted strategy enhances the model's ability to maintain content fidelity without the need for additional external models or datasets, making it practical for deployment.
The experiments conducted are robust, utilizing both the Seed-TTS-eval benchmark and the newly introduced ZERO500 benchmark, which reflects a diverse set of conditions. The reported improvements in word error rate (WER) and character error rate (CER) are statistically significant and demonstrate the effectiveness of the proposed method across different languages and speaker conditions. The use of objective metrics such as WER and CER provides a clear quantitative assessment of the model's performance.
The paper provides detailed implementation details, including training data specifications, model architecture, and hyperparameter settings. However, the lack of a publicly available code repository limits full reproducibility. Future work could benefit from sharing the model and training code to allow for independent validation of results.
One limitation noted is the trade-off between alignment robustness and speaker similarity, which could affect the model's performance in applications requiring high fidelity to speaker characteristics. Additionally, the reliance on objective ASR-based metrics may introduce biases, and the paper acknowledges the need for subjective assessments in future evaluations.
The proposed method has significant implications for the TTS field, particularly in applications requiring high content fidelity, such as virtual assistants, audiobooks, and automated customer service systems. By improving alignment robustness, this work can enhance user experience and trust in TTS technologies. The main contribution of this paper is the introduction of RobustSpeechFlow, a novel training strategy that enhances alignment robustness in TTS systems through the use of augmentation-based contrastive flow matching. This approach effectively addresses common content fidelity issues, demonstrating substantial improvements in performance across diverse conditions, thereby advancing the state of the art in TTS technology.
Audio context determines which sound components and sources are relevant and which can be perceived as irrelevant (noise) by listeners. For example, traffic noise is informative in urban surveillance but noise for a phone call at the same location. Most current audio denoising systems apply fixed target-noise definitions, often removing useful components in one context while failing to suppress irrelevant components. To address this, we introduce the concept automatic contextual audio denoising (ACAD) which defines target and noise based on the inferred context. In this work, we restrict context to be associated with an acoustic scene class. We label sound events outside the event distribution of a scene class (noise) as out-of-context (OC) and events typical for that scene as in-context (IC). We implement a deep learning method that automatically infers the context of the audio signal and removes OC components, and benchmark it against variants: without context inference, with oracle context, and with separately provided uninformative context. On paired clean/noisy data across diverse contexts, where OC components in one context may be IC in another, our proposed method outperforms other approaches across standard objective metrics, indicating that the model can infer context and context-dependent processing can enhance denoising.
Primary: Tampere University
All Institutions: Tampere University, Nokia
The main contribution of this paper is the introduction of automatic contextual audio denoising (ACAD), which adapts audio denoising processes based on inferred context, significantly improving the relevance of retained audio components. This work represents a meaningful advancement in the field of audio processing, combining deep learning techniques with a novel approach to context inference, thereby enhancing the quality of audio signals in various applications.
The paper introduces a novel framework for audio denoising that incorporates context inference, which is a significant advancement over traditional fixed-target noise definitions. The methodology is well-structured, employing a two-stage training process that includes pretraining a context extractor followed by a denoising model. The use of deep learning techniques, specifically CRNN and UNet architectures, is appropriate for the task, and the integration of context through feature-wise linear modulation (FiLM) layers is innovative. However, the reliance on a fixed set of acoustic scene classes may limit the generalizability of the approach.
The experiments are comprehensive, utilizing a well-constructed dataset that includes diverse acoustic scenes and out-of-context components. The evaluation metrics, such as SI-SDR and SDR, are standard in the field and provide a solid basis for comparison. The results demonstrate clear improvements over baseline models, indicating the effectiveness of the proposed method. However, further exploration of subjective evaluation metrics could enhance the robustness of the findings.
The paper provides sufficient detail regarding the dataset construction and model architecture, which aids in reproducibility. The authors also release their dataset publicly, which is a positive aspect for future research. However, the absence of code or a detailed implementation guide may hinder some researchers from replicating the results exactly.
One limitation is the potential for the model to learn statistical mismatches rather than true contextual cues, as noted by the authors. Additionally, the fixed set of acoustic scene classes may not capture the full diversity of real-world audio contexts, which could affect the model's applicability in broader scenarios. The reliance on synthetic noise addition could also introduce biases that do not reflect real-world conditions.
The proposed method has significant implications for various applications, including urban surveillance, telecommunication, and any domain where audio clarity is crucial. By enhancing audio denoising based on contextual understanding, this research could lead to improved user experiences in consumer electronics, smart devices, and assistive technologies. The approach also opens avenues for future research in context-aware audio processing. The main contribution of this paper is the introduction of automatic contextual audio denoising (ACAD), which adapts audio denoising processes based on inferred context, significantly improving the relevance of retained audio components. This work represents a meaningful advancement in the field of audio processing, combining deep learning techniques with a novel approach to context inference, thereby enhancing the quality of audio signals in various applications.
Multimodal Emotion Recognition (MER) focuses on identifying and interpreting emotions from modality-compound inputs. Closely mirroring human cognitive processes in real-world environments, MER has drawn substantial attention from both academia and industry. Recently, a paradigm shift has been unveiled in MER, from leveraging small-scale, task-specific models to Large Language Models (LLMs). We refer to the latter as the MER-with-LLMs paradigm, which offers unprecedented generality, spurring numerous empirical attempts, even alongside speculation about LLMs' potential to achieve general emotional intelligence. However, with these new opportunities come new challenges, including the scarcity of emotionally annotated data, the affective gap both within and across modalities, and the opacity of affective interpretation. To systematically review existing research and guide future exploration, this paper categorizes prior works according to their focus on addressing these challenges into three directions: Affective Data Augmentation, Multimodal Affective Representation, and Multimodal Affective Reasoning. By thoroughly tracing the development, emerging trends, and remaining issues within each direction, this paper aims to provide a clear academic map of the MER-with-LLMs paradigm and foster its structured advancement.
Primary: Tsinghua University
All Institutions: Tsinghua University, Academy of Cyber, Chinese Academy of Sciences, Harbin Institute of Technology, Nankai University
The main contribution of this paper is its comprehensive review and systematic categorization of the emerging MER-with-LLMs paradigm, highlighting key challenges and methodologies while providing a roadmap for future research in the field. The significance lies in its potential to unify disparate research efforts and stimulate further exploration of multimodal emotion recognition technologies.
The paper presents a comprehensive review of the Multimodal Emotion Recognition (MER) paradigm utilizing Large Language Models (LLMs). It categorizes existing research into three primary challenges: Affective Data Augmentation, Multimodal Affective Representation, and Multimodal Affective Reasoning. The authors propose a systematic taxonomy that not only highlights the methodologies but also emphasizes the importance of bridging gaps in emotional understanding across modalities. The approach is well-structured and offers a clear academic map for future research, although it primarily focuses on reviewing existing literature rather than presenting novel experimental methodologies.
The paper does not present original experiments or datasets but instead synthesizes existing research findings and benchmarks. It discusses various methodologies and their performance metrics, providing a comparative analysis of different approaches in the MER domain. While the lack of original experimental results limits the paper's impact, the thorough review of existing benchmarks and methodologies is valuable for guiding future research.
The paper lacks specific implementation details or code repositories that would enhance reproducibility. As it primarily serves as a survey, it does not provide new algorithms or models that could be reproduced. However, it does summarize existing works, which can serve as a reference for researchers looking to replicate or build upon the findings discussed.
The paper's limitations include its focus on reviewing existing literature rather than presenting new experimental results or methodologies. Additionally, the challenges identified, such as data scarcity and the affective gap, are broad and may require more specific solutions or methodologies to address them effectively. The paper also does not delve into the practical implications of the proposed MER-with-LLMs paradigm in real-world applications.
The paper has significant implications for both academia and industry, as it addresses a critical area of research in emotion recognition, which has applications in various fields such as healthcare, education, and human-computer interaction. By providing a structured overview of the MER-with-LLMs paradigm, it can guide future research directions and foster advancements in emotional intelligence in AI systems. The main contribution of this paper is its comprehensive review and systematic categorization of the emerging MER-with-LLMs paradigm, highlighting key challenges and methodologies while providing a roadmap for future research in the field. The significance lies in its potential to unify disparate research efforts and stimulate further exploration of multimodal emotion recognition technologies.
Reasoning has become a defining capability of modern foundation models, yet its development in the audio modality remains limited. Audio poses challenges that are distinct from those of text and vision. It is continuous, temporally dense, and contains linguistic, paralinguistic, and environmental information at multiple time scales. As a result, audio reasoning models must align acoustic signals with the discrete semantic space of large language models, while still preserving fine-grained information needed for reliable inference. Progress is also limited by three major obstacles: the scarcity of genuinely audio-grounded reasoning data, shortcut learning and modality hallucination, and the tension between reasoning depth and real-time latency in spoken interaction. In this paper, we present the first dedicated survey of audio reasoning. We provide a unified formulation that distinguishes direct predictive modeling from reasoning-augmented generation, review the architectural and training foundations of audio reasoning models, and systematically organize recent advances in Audio-to-Text, Audio-to-Speech, Audio-Visual Reasoning and Agentic Audio Reasoning. We further examine emerging paradigms such as Chain-of-Thought prompting, supervised fine-tuning, reinforcement learning, and latency-aware spoken interaction, and discuss evaluation practices, open challenges, and future directions. Our goal is to offer a coherent roadmap for developing robust, efficient, and natively grounded audio reasoning systems.
Primary: The Chinese University of Hong Kong (CUHK)
All Institutions: The Chinese University of Hong Kong (CUHK), The University of Hong Kong (HKU), National Taiwan University, The Hong Kong University of Science and Technology (HKUST)
The main contribution of this paper is its comprehensive survey of audio reasoning in multimodal foundation models, providing a structured framework for understanding the unique challenges and opportunities in this emerging field. The detailed taxonomy and critical review of existing methodologies lay the groundwork for future advancements in audio reasoning systems, although the lack of original empirical contributions limits its immediate impact.
The paper presents a comprehensive survey of audio reasoning in multimodal foundation models, establishing a unified formulation and taxonomy that distinguishes various reasoning paradigms. It critically reviews model architectures and training methods, emphasizing the unique challenges posed by audio reasoning compared to text and vision. The authors effectively categorize existing works and identify gaps in the literature, providing a structured approach to understanding the state of the field. However, the survey lacks experimental validation or novel empirical results, which could have strengthened its contributions.
The paper does not include original experiments or datasets, focusing instead on reviewing existing literature and methodologies. While it provides a thorough overview of current techniques and challenges, the lack of empirical results limits the assessment of the proposed frameworks. The discussion on evaluation practices is insightful but would benefit from concrete examples or case studies.
As a survey paper, it does not present original experiments or code implementations, which limits reproducibility. However, the comprehensive review of existing methods and datasets allows researchers to build upon the findings. The paper could enhance reproducibility by providing clearer references to datasets and methodologies discussed.
The primary limitation is the absence of original experimental results or benchmarks, which would have provided empirical evidence for the claims made. Additionally, while the survey identifies challenges in audio reasoning, it does not propose concrete solutions or methodologies to address these issues.
The paper has significant implications for the development of audio reasoning systems, which are crucial for advancing human-computer interaction and AI applications in real-world scenarios. By highlighting the unique challenges and opportunities in audio reasoning, it sets the stage for future research that could lead to more robust and efficient multimodal systems. The main contribution of this paper is its comprehensive survey of audio reasoning in multimodal foundation models, providing a structured framework for understanding the unique challenges and opportunities in this emerging field. The detailed taxonomy and critical review of existing methodologies lay the groundwork for future advancements in audio reasoning systems, although the lack of original empirical contributions limits its immediate impact.
This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietary datasets with industrial-scale computational resources, creating a significant barrier for academic research. To address this, the ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music. The challenge is divided into two tracks: the Efficiency Track (limited to 500M parameters) and the Performance Track (no parameter limit). Submissions are evaluated through a multi-stage process involving objective metrics, including Frechet Audio Distance, CLAP score, and a novel Concept Coverage Score (CCS), followed by a subjective listening test. By providing open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for computing FAD and CLAP, this challenge aims to facilitate and promote TTM research in academic contexts.
Primary: National Taiwan University
All Institutions: National Taiwan University, University of Michigan
The paper presents the ICME 2026 Grand Challenge on Academic Text-to-Music Generation, establishing a rigorous framework for academic research in TTM. By focusing on reproducibility and providing a novel evaluation metric, it significantly advances the field and encourages broader participation in generative audio research.
The paper introduces a structured framework for the Academic Text-to-Music Generation Grand Challenge, emphasizing reproducibility and transparency in generative model training. The requirement for participants to train models from scratch using a curated dataset is a significant methodological innovation that addresses the barriers posed by proprietary datasets. The introduction of the Concept Coverage Score (CCS) as a novel evaluation metric adds depth to the assessment of generated music, allowing for a more nuanced understanding of model performance in relation to specific musical concepts.
The experimental setup is robust, featuring a multi-stage evaluation process that combines objective metrics (FAD, CLAP, CCS) with subjective listening tests. This dual approach ensures that both quantitative and qualitative aspects of the generated music are assessed. The challenge's design, which includes distinct tracks for efficiency and performance, allows for a wide range of contributions and innovations from participants, enhancing the overall experimental rigor.
The authors have made significant efforts to ensure reproducibility by providing open-source baselines, preprocessing pipelines, and evaluation code. This transparency is crucial for academic research, allowing other researchers to replicate the experiments and build upon the findings. The detailed description of the dataset curation and preprocessing steps further supports reproducibility.
One limitation is the focus solely on instrumental music, which may restrict the applicability of the findings to broader music generation contexts that include vocals. Additionally, while the CCS provides a novel evaluation metric, its effectiveness and reliability in capturing the nuances of musical quality may require further validation across diverse datasets and contexts.
The establishment of a fair-play benchmark for text-to-music generation has the potential to democratize access to cutting-edge generative models, enabling academic researchers to contribute to the field without the need for extensive computational resources. This initiative could foster innovation in music generation, leading to new applications in education, entertainment, and creative industries. The paper presents the ICME 2026 Grand Challenge on Academic Text-to-Music Generation, establishing a rigorous framework for academic research in TTM. By focusing on reproducibility and providing a novel evaluation metric, it significantly advances the field and encourages broader participation in generative audio research.
Text-to-music generation has advanced rapidly, with modern autoregressive and diffusion-based models producing convincing music from natural-language prompts. However, much of this progress relies on large-scale training data and external pretraining, making it difficult to isolate which design choices remain effective when data and pretraining are controlled. We study this setting using a Diffusion Transformer backbone with lyric and timbre conditioning, adapted to an instrumental-only text-to-music task in which the auxiliary lyric and timbre branches receive only degenerate conditioning signals. Through controlled ablations, we find that models retrained without these branches score lower across AudioBox aesthetics, LLM-as-judge, and human MOS, and that reinvesting the saved parameters as additional DiT depth recovers only marginally. This suggests the auxiliary branches may act as training-time architectural anchors whose contribution goes beyond their explicit conditioning content. We validate the same model through comparisons with external instrumental baselines and through our submission to the ICME 2026 Academic Text-to-Music (ATTM) Grand Challenge, where our Performance submission ranked first under both the objective metrics and the subsequent organizer-administered MOS over 35 raters, attaining the highest overall MOS across all challenge submissions, while our Efficiency submission was a finalist that tied for second under the objective metrics.
Primary: Yonsei University
All Institutions: Yonsei University, MAAP, KRAFTON
The main contribution of this paper is the exploration of auxiliary conditioning branches in a Diffusion Transformer for instrumental text-to-music generation, revealing their critical role in enhancing model performance despite limited training data. This work provides valuable insights into architectural design and training strategies in generative audio systems, potentially influencing future research and applications in the field.
The paper employs a Diffusion Transformer architecture with innovative auxiliary conditioning branches for text-to-music generation. The use of lyric and timbre conditioning, even with degenerate inputs, is a novel approach that challenges existing paradigms in music generation. The controlled ablation studies provide a rigorous examination of the architectural contributions of these branches, revealing their role as training-time anchors, which is a significant insight into model design and training strategies.
The experiments are well-structured, utilizing both objective metrics (FAD, CLAP) and human evaluations (MOS) to assess model performance. The paper's submission to the ICME 2026 ATTM Grand Challenge, where it achieved top rankings, adds credibility to the findings. The comparative analysis with external baselines strengthens the results, demonstrating the effectiveness of the proposed model.
The paper provides detailed descriptions of the model architecture, training strategies, and evaluation metrics, which enhances reproducibility. However, the lack of publicly available code or demo URLs limits the ability for others to replicate the results easily.
The study is constrained by the limited dataset of 457 hours of audio, which may affect the generalizability of the findings. Additionally, the reliance on single-rater human evaluations for some metrics raises concerns about the robustness of the subjective assessments. The precise mechanisms behind the architectural benefits of the auxiliary branches remain unexplored, which could be a direction for future research.
The findings have significant implications for the field of generative music models, particularly in understanding the role of conditioning branches in multimodal architectures. This research could influence future designs of text-to-music systems and enhance the quality of generated audio, making it more applicable in creative industries such as music production, gaming, and film scoring. The main contribution of this paper is the exploration of auxiliary conditioning branches in a Diffusion Transformer for instrumental text-to-music generation, revealing their critical role in enhancing model performance despite limited training data. This work provides valuable insights into architectural design and training strategies in generative audio systems, potentially influencing future research and applications in the field.
We propose a deep beamforming framework for enhancing target speaker(s) in multi-speaker environments. A deep neural network (DNN) is trained to estimate beamforming weights directly from noisy multichannel inputs while satisfying linear spatial constraints through an adaptive multi-term loss inspired by the augmented Lagrangian framework. The loss combines signal reconstruction with penalties that enforce a distortionless response toward the target and suppress the interference subspace. The model is further guided by the target relative transfer function (RTF) and the estimated interference subspace. The proposed model can direct a beam toward the target speaker while directing nulls toward the interfering sources, achieving superior overall enhancement performance compared with the classical LCMV beamformer constructed by the same estimated spatial signatures. Furthermore, compared with the LCMV beamformer, the proposed model produces more controlled sidelobes and improved background-noise attenuation.
Primary: Bar-Ilan University
All Institutions: Bar-Ilan University
The paper presents a novel DNN-based beamforming framework that effectively enhances target speakers in multi-speaker environments by leveraging explicit spatial guidance. This work combines deep learning with classical beamforming principles, showcasing significant improvements in speech enhancement performance and spatial selectivity, thus contributing meaningfully to the field of audio signal processing.
The proposed methodology integrates deep learning with classical beamforming techniques, specifically using a DNN to estimate beamforming weights while enforcing linear spatial constraints. The use of an adaptive multi-term loss function inspired by the augmented Lagrangian framework is innovative, as it allows for a balance between signal reconstruction and interference suppression. The architecture employs a U-Net with attention mechanisms to fuse spatial guidance information, enhancing the model's ability to focus on the target speaker while nulling out interference. This approach effectively combines machine learning with traditional signal processing principles, making it a significant contribution to the field.
The experimental evaluation is thorough, utilizing a well-defined dataset generation process that simulates realistic multi-speaker environments. The paper reports results using multiple metrics, including SI-SDR, SNR, and SIR, which are critical for assessing speech enhancement performance. The comparison against classical LCMV beamformers shows clear advantages in terms of performance, particularly in terms of noise attenuation and spatial selectivity. However, the paper could benefit from more extensive ablation studies to further validate the contributions of each component of the proposed method.
The paper provides a GitHub repository link for the implementation, which is a positive aspect for reproducibility. However, details on the exact training procedures, hyperparameter settings, and specific datasets used for training and evaluation are somewhat limited. Including these details would enhance the reproducibility of the results.
One limitation is the reliance on accurate RTF estimation, which may not always be feasible in practical scenarios. The model's performance could degrade in environments where spatial signatures are not accurately captured. Additionally, the experiments primarily focus on specific configurations (two and three speakers), which may not generalize to more complex scenarios with varying speaker dynamics.
The proposed framework has significant potential applications in real-time speech enhancement systems, particularly in environments such as conference calls, hearing aids, and assistive listening devices. By improving the clarity of target speakers in noisy environments, this research could enhance communication accessibility for individuals with hearing impairments and improve user experience in various audio applications. The paper presents a novel DNN-based beamforming framework that effectively enhances target speakers in multi-speaker environments by leveraging explicit spatial guidance. This work combines deep learning with classical beamforming principles, showcasing significant improvements in speech enhancement performance and spatial selectivity, thus contributing meaningfully to the field of audio signal processing.
Recent advances in text-to-speech (TTS) models show impressive speech naturalness and quality, yet the role of large-scale open data in driving this progress remains underexplored. In this work, we introduce Raon-OpenTTS, an open TTS model that performs competitively with state-of-the-art closed-data TTS models, and Raon-OpenTTS-Pool, a large-scale open dataset for reproducible TTS training. Raon-OpenTTS-Pool consists of 615K hours of 240M speech segments aggregated from publicly available English speech corpora and web-sourced recordings. With a model-based filtering pipeline applied to Raon-OpenTTS-Pool, we derive Raon-OpenTTS-Core, a curated, high-quality subset of 510K hours and 194M speech segments. Using Raon-OpenTTS-Core, we train Raon-OpenTTS, a series of diffusion transformer (DiT)-based TTS models from 0.3B to 1B parameters. On multiple benchmarks, Raon-OpenTTS-1B shows comparable performance to state-of-the-art models such as Qwen3-TTS and CosyVoice 3, which are trained on several million hours of proprietary speech data. Notably, on Seed-TTS-Eval, Raon-OpenTTS-1B achieves a word error rate (WER) of 1.78% and a speaker similarity (SIM) of 0.749, ranking second on WER and first on SIM among recent open-weight TTS baselines. On CV3-Hard-EN, Raon-OpenTTS-1B achieves a WER of 6.15% and a SIM of 0.775, ranking first on both metrics. Furthermore, to support robust evaluation, we introduce Raon-OpenTTS-Eval, a structured benchmark for assessing TTS robustness across diverse acoustic conditions including clean, noisy, in-the-wild, and expressive speech. On Raon-OpenTTS-Eval, Raon-OpenTTS-1B achieves the best average WER and SIM among all evaluated models, and the second-best human preference, as measured by comparative mean opinion score (CMOS). Our data pool, filtering pipeline, training code, and checkpoints are publicly available at https://github.com/krafton-ai/RAON-OpenTTS.
Primary: KRAFTON
All Institutions: KRAFTON, Ludo Robotics, Seoul National University, University of Wisconsin-Madison, Stanford University
The main contribution of this paper is the introduction of Raon-OpenTTS, a competitive open TTS model trained on a large-scale, high-quality dataset, along with a structured evaluation benchmark for assessing TTS robustness. This work addresses the critical need for open data and models in TTS research, facilitating reproducibility and advancing the state of the art in the field.
The paper presents a robust methodology for constructing a large-scale open TTS dataset, Raon-OpenTTS-Pool, and a high-quality subset, Raon-OpenTTS-Core, which are critical for training the Raon-OpenTTS models. The authors utilize a model-based filtering pipeline to ensure data quality, which is a significant improvement over existing methods that often rely on proprietary datasets. The use of diffusion transformers (DiT) for TTS synthesis is an innovative approach that enhances the model's performance across various benchmarks. The introduction of a structured evaluation benchmark, Raon-OpenTTS-Eval, further strengthens the methodology by enabling comprehensive assessments across diverse acoustic conditions.
The experimental evaluation is thorough, with multiple benchmarks used to assess the performance of Raon-OpenTTS-1B against state-of-the-art models. The results demonstrate competitive performance in terms of word error rate (WER) and speaker similarity (SIM), indicating that the model can effectively generalize across different acoustic environments. The inclusion of human preference evaluations adds a valuable subjective dimension to the assessment, reinforcing the model's practical applicability.
The paper emphasizes reproducibility by providing access to the dataset, model weights, and training code. This transparency is crucial for the research community, as it allows others to replicate the study and build upon the findings. The detailed description of the data collection and filtering process enhances the reproducibility of the results.
One limitation is the focus on English speech data, which restricts the applicability of the findings to other languages. Additionally, while the filtering process improves data quality, it may also lead to the exclusion of potentially useful data. The authors acknowledge the need for future work to explore multilingual settings and more effective data mixing strategies.
The development of Raon-OpenTTS and its associated datasets has the potential to significantly impact the field of TTS by providing a high-quality, open-source alternative to proprietary models. This could democratize access to advanced TTS technology, enabling researchers and developers to create more inclusive and diverse applications. The structured evaluation benchmark also sets a precedent for future TTS research, promoting more rigorous assessments of model performance. The main contribution of this paper is the introduction of Raon-OpenTTS, a competitive open TTS model trained on a large-scale, high-quality dataset, along with a structured evaluation benchmark for assessing TTS robustness. This work addresses the critical need for open data and models in TTS research, facilitating reproducibility and advancing the state of the art in the field.
This study aims to enhance the quality of music generation using Transformers by incorporating meta-information. While Transformer-based approaches are effective at capturing long-term dependencies in musical compositions, the music they generate often suffers from issues such as excessive repetition or duplication of notes, leading to unnatural melodies. To address these limitations, we propose Musical Attention, a mechanism that incorporates meta-information such as bar numbers, key, signatures, and tempos into the attention process. Musical Attention explicitly leverages both the structural properties of music and its associated metadata, enabling the Transformer's attention mechanism to operate more effectively and thereby improving the quality of the generated output. In our framework, each musical note is represented as a combination of five events-pitch, bar number, onset, duration, and velocity in addition to the three metadata elements. The attention mechanism is then modified to reflect the correlations among these eight features, allowing the model to better capture the inherent characteristics of musical composition. Experimental results demonstrate that the model incorporating Musical Attention outperforms prior methods, such as Full Attention and Strided Attention, in terms of musical coherence, variation, and overall quality. Notably, it significantly reduces repetition and enhances the model's ability to generate diverse, harmonically consistent melodies. Musical Attention thus represents a meaningful advancement in AI-driven music generation, facilitating the creation of more natural and expressive compositions.
Primary: Meiji University
All Institutions: Meiji University
The main contribution of this paper is the introduction of the Musical Attention mechanism, which enhances music generation quality by incorporating meta-information into the Transformer architecture. This advancement represents a meaningful step forward in AI-driven music generation, addressing key limitations of existing models and paving the way for more expressive and coherent musical compositions.
The proposed methodology introduces a novel attention mechanism, Musical Attention, which effectively integrates meta-information into the Transformer architecture for music generation. This approach is well-grounded in music theory, addressing the limitations of previous models by focusing on structural dependencies and contextual relationships among musical elements. The representation of musical notes as a combination of multiple features, alongside the incorporation of meta-information, enhances the model's ability to generate coherent and expressive music. However, the methodology could benefit from more detailed explanations of the attention patterns and their implementation.
The experiments are thorough, utilizing a substantial dataset of MIDI files and comparing the proposed model against established baselines (Full Attention and Strided Attention). The evaluation metrics are well-defined, focusing on various aspects of musical generation quality. The results demonstrate clear improvements in key areas such as Bar Error and Key Error, indicating the effectiveness of the Musical Attention mechanism. However, the paper could enhance the evaluation by including qualitative assessments or human evaluations of the generated music.
The paper provides a reasonable level of detail regarding the model architecture and training setup, which aids in reproducibility. However, it lacks specific hyperparameter settings and the exact training environment details, which are crucial for ensuring that other researchers can replicate the results accurately. Additionally, the absence of a demo URL limits immediate access to generated samples for further validation.
The paper identifies key limitations, including the lack of dynamic variation in the generated music and occasional unnatural chord progressions. These limitations suggest that while the model performs well in certain metrics, it may not fully capture the nuances of musical expression and creativity. Future work should address these aspects to improve the model's overall musicality.
The proposed model has significant implications for the field of AI-driven music generation, particularly in enhancing the quality and expressiveness of generated compositions. By integrating music theory into the generation process, the research opens avenues for more sophisticated music generation systems that can cater to various musical styles and preferences. This work could also inspire further research into the intersection of machine learning and music theory, potentially leading to advancements in music education and composition tools. The main contribution of this paper is the introduction of the Musical Attention mechanism, which enhances music generation quality by incorporating meta-information into the Transformer architecture. This advancement represents a meaningful step forward in AI-driven music generation, addressing key limitations of existing models and paving the way for more expressive and coherent musical compositions.