Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Primary: Nanjing University
All Institutions: Nanjing University, WeNet Open Source Community
The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
The proposed methodology introduces a novel framework for long-form speech synthesis that emphasizes the importance of global context and paralinguistic cues. The "Labeling over filtering/cleaning" strategy is innovative, as it challenges conventional practices in data preparation by advocating for the inclusion of complex, noisy data that reflects real-world speech dynamics. The Global-Sentence-Token hierarchical annotation schema is a significant advancement, enabling a structured approach to capturing the nuances of speech synthesis. The integration of Chain-of-Thought reasoning and Dimension Dropout enhances the model's ability to follow complex instructions, which is a notable methodological improvement over existing TTS systems.
The paper lacks quantitative evaluations of the proposed system's performance, particularly in terms of emotional arc coherence and multi-speaker interaction naturalness. While it discusses the challenges of evaluating borderless long audio synthesis, it does not provide concrete experimental results or comparisons with existing methods. The absence of benchmark results limits the ability to assess the system's effectiveness rigorously. Future work is needed to establish robust evaluation metrics that can capture the richness of the proposed framework.
The paper does not provide sufficient implementation details or access to code and datasets, which raises concerns about reproducibility. The lack of a demo or project URL further complicates the ability for other researchers to replicate the findings or build upon this work. Clearer documentation and shared resources would enhance reproducibility.
The system is currently optimized for content creation rather than real-time interactions, which limits its applicability in dynamic environments. Additionally, the training data is primarily speech-centric, and the system's emergent capabilities for sound effects and music are not fully developed. These limitations suggest that while the framework is promising, it requires further refinement and expansion to address broader applications.
The potential applications of this research extend beyond traditional TTS systems, offering possibilities for enhanced audio experiences in content creation, gaming, and virtual environments. The ability to synthesize speech with rich emotional and contextual cues could significantly improve user engagement and interaction quality in various multimedia applications. However, the challenges in real-time synthesis and the need for more diverse training data must be addressed to realize its full impact. The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.
Primary: Carnegie Mellon University
All Institutions: Brno University of Technology, Carnegie Mellon University, Johns Hopkins University
The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
The methodology presented in this paper is innovative in its application of shuffle products and partial order finite-state automata (FSAs) for modeling overlapped speech. The authors effectively leverage these mathematical constructs to create a framework for alignment and transcription of multi-talker recordings. The approach of using (token, speaker) tuples for speaker attribution is particularly noteworthy, as it directly addresses a significant challenge in the field of speech processing. The imposition of temporal constraints to reduce graph size is a practical consideration that enhances the efficiency of the proposed method.
The experiments conducted on synthetic LibriSpeech overlaps provide a solid basis for evaluating the proposed methods. The paper compares the performance of the shuffle product FSA against traditional methods, demonstrating a clear advantage in terms of alignment accuracy. However, the reliance on synthetic data may limit the generalizability of the results to real-world scenarios. The metrics used for evaluation are appropriate, but further validation on diverse datasets would strengthen the findings.
The paper mentions that all algorithms are implemented using k2 / Icefall, which is a positive aspect for reproducibility. However, the lack of a publicly available code repository or detailed implementation instructions may hinder other researchers from replicating the results. Providing a GitHub repository or similar resource would greatly enhance the reproducibility of the work.
One limitation of the study is the use of synthetic data for training and evaluation, which may not fully capture the complexities of real-world overlapped speech scenarios. Additionally, while the proposed method shows promise, the paper does not provide extensive comparisons with other state-of-the-art techniques, which could have offered more context regarding its performance.
The ability to accurately transcribe and attribute overlapped speech has significant implications for various applications, including automated transcription services, assistive technologies for the hearing impaired, and improvements in human-computer interaction. The proposed method could pave the way for advancements in multi-talker speech recognition systems, making them more robust and effective. The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
Modern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems using unique keys but a shared structure, mixing, separating, and decoding from each output. A naive pipeline (robust watermarking + off-the-shelf separation) yields poor bit recovery, showing robustness to generic distortions does not ensure robustness to separation artifacts. To enable this, we jointly train the watermark system and the separator in an end-to-end manner, encouraging the separator to preserve watermark cues while adapting embedding to separation-specific distortions. Experiments on speech+music and vocal+accompaniment mixtures show substantial gains in post-separation recovery while maintaining perceptual quality.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, The Chinese University of Hong Kong
The paper presents a novel approach to multi-stream audio watermarking that effectively addresses the challenges posed by source separation. By jointly training the watermarking and separation systems, the authors demonstrate substantial improvements in watermark recovery while maintaining audio quality, marking a significant contribution to the field of audio processing and copyright protection.
The paper introduces a novel separation-first, multi-stream audio watermarking framework that jointly trains a watermarking system and a source separator in an end-to-end manner. This approach addresses the challenge of preserving watermark cues during the separation process, which is often overlooked in traditional watermarking methods. The methodology is well-structured, with a clear problem setup and a detailed description of the joint training pipeline, including the use of a key-conditioned Conformer architecture for watermarking and the Demucs separator for audio separation. The approach is innovative in its integration of watermarking and separation, which is a significant advancement in the field.
The experiments are comprehensive, utilizing multiple datasets and evaluating the performance of the proposed method against several baselines. The results demonstrate substantial improvements in post-separation watermark recovery, with a significant reduction in bit error rates compared to existing methods. The evaluation metrics used, including average bit error rate and perceptual quality measures (e.g., SNR and ViSQOL), provide a robust assessment of the method's effectiveness. The experiments also highlight the importance of joint training in enhancing both watermark robustness and separation integrity.
The paper provides sufficient implementation details, including the architecture of the watermarking system and the separation network, as well as the training setup and loss functions. However, the reproducibility could be improved by providing access to the code and detailed instructions for replicating the experiments. The mention of hardware specifications and training duration is helpful, but a public repository would enhance transparency.
One limitation is that the framework is currently limited to two-stem mixtures, which may restrict its applicability in more complex audio scenarios. Additionally, while the joint training approach improves robustness, it may introduce trade-offs in terms of the imperceptibility of the watermark, as indicated by the results showing that separation-aware models do not outperform single-carrier baselines in direct encoding/decoding settings.
The proposed method has significant implications for copyright protection and content authenticity in the age of AI-generated audio. As audio content becomes increasingly mixed and generated from multiple sources, the ability to independently watermark and recover information from different stems is crucial. This research could pave the way for more secure and reliable audio watermarking techniques, potentially influencing industry standards in digital rights management. The paper presents a novel approach to multi-stream audio watermarking that effectively addresses the challenges posed by source separation. By jointly training the watermarking and separation systems, the authors demonstrate substantial improvements in watermark recovery while maintaining audio quality, marking a significant contribution to the field of audio processing and copyright protection.
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Primary: Nanjing University
All Institutions: Nanjing University, WeNet Open Source Community
The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
The proposed methodology introduces a novel framework for long-form speech synthesis that emphasizes the importance of global context and paralinguistic cues. The "Labeling over filtering/cleaning" strategy is innovative, as it challenges conventional practices in data preparation by advocating for the inclusion of complex, noisy data that reflects real-world speech dynamics. The Global-Sentence-Token hierarchical annotation schema is a significant advancement, enabling a structured approach to capturing the nuances of speech synthesis. The integration of Chain-of-Thought reasoning and Dimension Dropout enhances the model's ability to follow complex instructions, which is a notable methodological improvement over existing TTS systems.
The paper lacks quantitative evaluations of the proposed system's performance, particularly in terms of emotional arc coherence and multi-speaker interaction naturalness. While it discusses the challenges of evaluating borderless long audio synthesis, it does not provide concrete experimental results or comparisons with existing methods. The absence of benchmark results limits the ability to assess the system's effectiveness rigorously. Future work is needed to establish robust evaluation metrics that can capture the richness of the proposed framework.
The paper does not provide sufficient implementation details or access to code and datasets, which raises concerns about reproducibility. The lack of a demo or project URL further complicates the ability for other researchers to replicate the findings or build upon this work. Clearer documentation and shared resources would enhance reproducibility.
The system is currently optimized for content creation rather than real-time interactions, which limits its applicability in dynamic environments. Additionally, the training data is primarily speech-centric, and the system's emergent capabilities for sound effects and music are not fully developed. These limitations suggest that while the framework is promising, it requires further refinement and expansion to address broader applications.
The potential applications of this research extend beyond traditional TTS systems, offering possibilities for enhanced audio experiences in content creation, gaming, and virtual environments. The ability to synthesize speech with rich emotional and contextual cues could significantly improve user engagement and interaction quality in various multimedia applications. However, the challenges in real-time synthesis and the need for more diverse training data must be addressed to realize its full impact. The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.
Primary: Sogang University
All Institutions: Sogang University
The main contribution of this paper is the introduction of CAF-Score, a novel reference-free metric for audio captioning evaluation that effectively combines the coarse-grained semantic alignment of CLAP with the fine-grained comprehension and syntactic awareness of LALMs. This work represents a significant step forward in the evaluation of audio captioning systems, addressing key challenges in the field and providing a foundation for future research and development.
The methodology presented in this paper is innovative, combining the strengths of CLAP and LALMs to create a reference-free evaluation metric for audio captioning. The use of a sliding-window approach with max pooling to enhance alignment accuracy is particularly noteworthy, as is the adaptation of the FLEUR metric for audio evaluation. The hybrid design effectively addresses the limitations of both models, allowing for more nuanced assessments of audio-text alignment.
The experiments conducted on the BRACE benchmark are extensive and well-structured, demonstrating the effectiveness of CAF-Score in comparison to both reference-based and existing reference-free metrics. The paper provides a thorough analysis of the performance across multiple models and configurations, showcasing the robustness of the proposed metric in various scenarios, including hallucination detection.
The implementation details are clearly outlined, and the authors provide a GitHub repository with code and results, enhancing the reproducibility of the study. However, the reliance on specific model configurations and the computational overhead of LALMs may pose challenges for some researchers attempting to replicate the results.
The paper acknowledges that the performance of CAF-Score is bounded by the capabilities of the underlying models, and instances of simultaneous misalignment between CLAP and LALMs can lead to failures in evaluation. Additionally, the fixed weighting parameter may not be optimal for all audio-caption pairs, suggesting a need for further exploration of adaptive strategies.
The proposed CAF-Score metric has significant implications for the field of audio captioning, providing a scalable and robust evaluation framework that does not rely on costly ground-truth annotations. This advancement could facilitate the development of more effective audio understanding and captioning systems, ultimately enhancing the accessibility and usability of audio content across various applications. The main contribution of this paper is the introduction of CAF-Score, a novel reference-free metric for audio captioning evaluation that effectively combines the coarse-grained semantic alignment of CLAP with the fine-grained comprehension and syntactic awareness of LALMs. This work represents a significant step forward in the evaluation of audio captioning systems, addressing key challenges in the field and providing a foundation for future research and development.
Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
Primary: Zhejiang University
All Institutions: Zhejiang University, The State Key Lab of Brain-Machine Intelligence, Zhejiang University
FoleyDirector introduces a novel framework for fine-grained temporal control in video-to-audio generation, significantly advancing the state-of-the-art in this domain. The combination of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and audio synthesis.
The methodology presented in FoleyDirector is innovative, particularly with the introduction of Structured Temporal Scripts (STS) and the Script-Guided Temporal Fusion Module. These components allow for fine-grained temporal control in video-to-audio generation, addressing a significant gap in existing methods that struggle with complex audio generation scenarios. The integration of Bi-Frame Sound Synthesis further enhances the capability to manage both in-frame and out-of-frame audio, showcasing a thoughtful approach to improving controllability in audio synthesis. The methodology is well-structured and provides a clear framework for implementation.
The experimental section demonstrates a robust evaluation of the proposed framework. The construction of the DirectorSound dataset and the introduction of evaluation benchmarks (VGGSoundDirector and DirectorBench) are commendable, as they provide necessary resources for training and evaluation. The experiments effectively illustrate the improvements in temporal controllability and audio fidelity, with results that substantiate the claims made in the paper. However, details on the evaluation metrics used and their significance could be elaborated further to enhance clarity.
While the paper outlines the methodology and experiments, it lacks explicit details regarding the implementation and availability of the code or datasets, which could hinder reproducibility. Providing a link to a project repository or supplementary materials would greatly enhance the paper's reproducibility and allow other researchers to build upon this work.
One limitation is the potential complexity in user interaction with the system, as fine-grained control may require a steep learning curve for users unfamiliar with audio synthesis. Additionally, the paper does not address the scalability of the framework in real-world applications or the computational resources required for training and inference.
The advancements made in FoleyDirector have significant implications for various applications, including film production, video game development, and virtual reality, where precise audio generation is critical. By empowering users to act as Foley directors, the framework can enhance the creative process in multimedia content creation, potentially leading to more immersive experiences. FoleyDirector introduces a novel framework for fine-grained temporal control in video-to-audio generation, significantly advancing the state-of-the-art in this domain. The combination of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and audio synthesis.
Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
MOSS-TTSD presents a significant advancement in spoken dialogue generation, effectively addressing key challenges in the field. The comprehensive evaluation framework and the model's capabilities for long-form synthesis and multi-party interactions mark a notable contribution to the audio processing landscape.
The methodology presented in MOSS-TTSD is robust and well-structured, addressing significant challenges in spoken dialogue generation. The use of a fully discrete speech generation paradigm, combined with a multi-head delay pattern for autoregressive prediction, is innovative. The model's ability to handle long-form synthesis and multi-party dialogue through explicit speaker tagging and zero-shot voice cloning is a notable advancement. The introduction of the TTSD-eval framework for objective evaluation is a significant contribution, as it addresses the limitations of existing metrics that rely on speaker diarization.
The experiments conducted are comprehensive, utilizing both objective and subjective evaluation methods. The paper provides a clear comparison against strong open-source and proprietary baselines, demonstrating the superiority of MOSS-TTSD in terms of speaker consistency and intelligibility. The use of diverse test sets and the detailed description of the evaluation metrics enhance the credibility of the results.
The paper lacks specific URLs for the code and models, which hinders reproducibility. While the methodology is described in detail, the absence of a public repository makes it difficult for other researchers to replicate the results. Providing access to the code and trained models would significantly improve the reproducibility of the findings.
One limitation is the reliance on high-quality training data, which may not be readily available for all languages and scenarios. Additionally, while the model supports multiple languages, the performance across less common languages is not thoroughly evaluated. The potential for biases in the voice cloning process, particularly with limited reference audio, is another area that could be explored further.
The implications of MOSS-TTSD are substantial, particularly in applications such as podcasts, dynamic commentary, and entertainment content. The ability to generate coherent and natural multi-party dialogues opens new avenues for automated content creation and enhances user interaction in various multimedia applications. The model's multilingual capabilities also contribute to its broader applicability in global contexts. MOSS-TTSD presents a significant advancement in spoken dialogue generation, effectively addressing key challenges in the field. The comprehensive evaluation framework and the model's capabilities for long-form synthesis and multi-party interactions mark a notable contribution to the audio processing landscape.
Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Gesture2Speech, a multimodal TTS framework that leverages hand gestures to enhance prosody in synthesized speech, showcasing a novel approach to integrating visual cues in speech synthesis. This work represents a significant step forward in the field of expressive speech synthesis, combining advanced machine learning techniques with insights from human communication to create more natural and engaging speech outputs.
The proposed Gesture2Speech framework introduces a novel multimodal TTS architecture that integrates hand gestures as dynamic control signals for prosody modulation in synthesized speech. The use of a Mixture-of-Experts (MoE) architecture to dynamically fuse linguistic and gesture features is innovative, allowing for flexible and context-aware speech synthesis. The introduction of a gesture-speech alignment loss to ensure temporal synchrony between gestures and prosodic contours is a significant methodological advancement. However, the paper could benefit from a more detailed explanation of the training process and the specific configurations of the MoE modules.
The experiments conducted on the PATS dataset demonstrate the effectiveness of the Gesture2Speech framework in improving speech naturalness and gesture-speech synchrony compared to state-of-the-art baselines. The use of both objective metrics (e.g., WER, CER, UTMOS) and subjective evaluations (Mean Opinion Scores) provides a comprehensive assessment of the model's performance. The results indicate that the proposed multimodal approach significantly enhances prosodic expressiveness and alignment, although further exploration of different datasets and real-world applications could strengthen the findings.
The paper provides a clear description of the experimental setup, including the dataset, model configurations, and evaluation metrics, which aids reproducibility. However, the lack of a publicly available code repository limits the ability for others to replicate the results directly. Including implementation details such as hyperparameters and training procedures would further enhance reproducibility.
One notable limitation is the reliance on the PATS dataset, which may not encompass a diverse range of cultural and emotional expressions. Additionally, the framework's performance in real-world scenarios, where full-body visibility or high-resolution hand tracking may not be feasible, remains uncertain. The paper also does not address potential computational overhead associated with the MoE architecture, which could impact deployment in resource-constrained environments.
The Gesture2Speech framework has significant implications for applications in areas such as virtual assistants, dubbing, and interactive storytelling, where expressive speech synthesis is crucial. By incorporating hand gestures into TTS systems, the research paves the way for more natural and engaging human-computer interactions. Furthermore, the findings could inspire future research into multimodal communication and the integration of additional non-verbal cues. The main contribution of this paper is the introduction of Gesture2Speech, a multimodal TTS framework that leverages hand gestures to enhance prosody in synthesized speech, showcasing a novel approach to integrating visual cues in speech synthesis. This work represents a significant step forward in the field of expressive speech synthesis, combining advanced machine learning techniques with insights from human communication to create more natural and engaging speech outputs.
Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence.
Primary: University of Auckland
All Institutions: University of Auckland
This paper introduces APUN-Bench, a pioneering benchmark for evaluating audio pun understanding in large audio language models, significantly advancing the field of multimodal language processing. The comprehensive methodology and rigorous experimental evaluation highlight the challenges faced by current models, providing actionable insights for future research.
The paper presents a novel approach to benchmarking audio pun understanding through the creation of APUN-Bench, which includes a comprehensive dataset of 4,434 audio samples annotated across three distinct stages: pun recognition, pun word location, and pun meaning inference. The methodology is robust, utilizing both synthetic and real-world data, and incorporates human verification to ensure data quality. The multi-stage evaluation framework is well-structured and addresses a significant gap in the understanding of audio puns, making it a valuable contribution to the field.
The experiments conducted on 10 state-of-the-art large audio language models (LALMs) provide a thorough analysis of their performance across the three evaluation stages. The results reveal substantial performance gaps, particularly in pun word location and meaning inference, highlighting the limitations of current models. The use of statistical tests to validate findings adds rigor to the experimental evaluation.
The paper provides sufficient detail regarding the dataset construction, evaluation metrics, and model configurations, which facilitates reproducibility. However, the lack of publicly available URLs for the dataset and models limits the ease of access for other researchers.
The study acknowledges several limitations, including the restricted scope of pun types examined, the focus on single-sentence instances rather than multi-turn dialogues, and the limited size of the real-world corpus. These factors may constrain the generalizability of the findings.
The research has significant implications for advancing audio understanding in natural language processing, particularly in applications related to humor, education, and voice assistants. By addressing the complexities of audio puns, the work paves the way for more sophisticated models that can better understand and generate humor in spoken language. This paper introduces APUN-Bench, a pioneering benchmark for evaluating audio pun understanding in large audio language models, significantly advancing the field of multimodal language processing. The comprehensive methodology and rigorous experimental evaluation highlight the challenges faced by current models, providing actionable insights for future research.
Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of FLAC, a novel probabilistic approach to few-shot acoustic synthesis that effectively captures the uncertainty of scene acoustics, establishing a new direction for data-efficient audio generation. The paper's methodology, experimental results, and potential applications highlight its significance in advancing the field of machine learning for audio synthesis.
The paper introduces a novel approach to few-shot acoustic synthesis using a probabilistic framework that leverages flow-matching and diffusion transformers. This methodology is significant as it addresses the limitations of deterministic models by capturing the uncertainty in room impulse responses (RIRs) under sparse context. The integration of multimodal cues (spatial, geometric, and acoustic) enhances the generation process, making it more adaptable to novel environments. The use of a flow-matching objective is a fresh perspective in the domain of acoustic synthesis, providing a robust foundation for future research.
The authors conducted experiments using two datasets, AcousticRooms and Hearing Anything Anywhere, demonstrating that their method outperforms existing eight-shot baselines with only one-shot training. The introduction of AGREE, a joint acoustic-geometry embedding for evaluation, is a valuable contribution that allows for a more nuanced assessment of generated RIRs. The results are compelling, showcasing the effectiveness of FLAC in generating acoustically consistent outputs, although further details on the statistical significance of the results would strengthen the claims.
The paper lacks detailed implementation specifics, which are crucial for reproducibility. While the methodology is well-articulated, the absence of code or supplementary materials limits the ability of other researchers to replicate the findings. Including a GitHub repository or supplementary materials would significantly enhance the reproducibility of the work.
One limitation is the reliance on specific datasets, which may not generalize across all acoustic environments. Additionally, the method's performance in highly complex or dynamic scenes remains untested. The paper could also benefit from a more thorough exploration of the computational efficiency of the proposed method, especially in real-time applications.
This research has the potential to significantly impact fields such as virtual reality, gaming, and architectural acoustics by enabling more realistic sound generation in immersive environments. The ability to synthesize acoustic responses with minimal data requirements can lead to broader applications in audio engineering and sound design, making it easier to create immersive experiences without extensive recording setups. The main contribution of this work is the introduction of FLAC, a novel probabilistic approach to few-shot acoustic synthesis that effectively captures the uncertainty of scene acoustics, establishing a new direction for data-efficient audio generation. The paper's methodology, experimental results, and potential applications highlight its significance in advancing the field of machine learning for audio synthesis.
Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning.
Primary: Concordia University
All Institutions: Concordia University, Mila-Quebec AI Institute
The main contribution of this paper is the introduction of a reinforcement learning-based timestamp grounding strategy for large audio-language models, which enhances reasoning accuracy and model interpretability in multimodal tasks. This work represents a meaningful advancement in the integration of temporal awareness in audio processing, addressing a critical gap in existing methodologies and paving the way for future research in this domain.
The paper introduces a reinforcement learning-based strategy for timestamp grounding in large audio-language models, which is a novel approach in the context of multimodal reasoning. The methodology is well-structured, detailing how the model utilizes explicit timestamp annotations to enhance reasoning outputs. The integration of grounding mechanisms is a significant contribution, as it addresses a gap in existing models that often lack temporal awareness. The proposed method is theoretically sound and builds upon existing frameworks, yet it also innovatively extends them by incorporating timestamp grounding, which is a fresh perspective in the field.
The experiments are comprehensive, utilizing four benchmark datasets that are relevant to the task of speech-based reasoning. The results demonstrate a clear improvement over both zero-shot reasoning and fine-tuning approaches without timestamp grounding. The evaluation metrics used are appropriate, and the authors provide a thorough analysis of the model's performance across different scenarios. However, the paper could benefit from additional qualitative assessments to complement the quantitative results, such as user studies or case analyses.
The paper lacks detailed implementation specifics, such as hyperparameter settings, training duration, and hardware specifications, which are crucial for reproducibility. While the methodology is clearly described, the absence of a code repository or supplementary materials limits the ability of other researchers to replicate the findings. Including such details would significantly enhance the paper's impact and utility.
One notable limitation is the reliance on timestamp annotations, which may not be universally applicable across all audio tasks. Additionally, the paper does not address potential scalability issues when applying the proposed method to larger datasets or more complex audio scenarios. The authors also do not discuss the computational overhead introduced by the reinforcement learning component, which could be a concern in real-time applications.
The proposed approach has the potential to significantly advance the field of multimodal reasoning and audio processing. By grounding reasoning in temporal audio segments, it opens avenues for applications in areas such as automated transcription, audio-visual content analysis, and interactive voice response systems. The implications for improving model interpretability and reliability in audio tasks are substantial, making this research relevant for both academic and industrial applications. The main contribution of this paper is the introduction of a reinforcement learning-based timestamp grounding strategy for large audio-language models, which enhances reasoning accuracy and model interpretability in multimodal tasks. This work represents a meaningful advancement in the integration of temporal awareness in audio processing, addressing a critical gap in existing methodologies and paving the way for future research in this domain.
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.
Primary: Carnegie Mellon University
All Institutions: Brno University of Technology, Carnegie Mellon University, Johns Hopkins University
The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
The methodology presented in this paper is innovative in its application of shuffle products and partial order finite-state automata (FSAs) for modeling overlapped speech. The authors effectively leverage these mathematical constructs to create a framework for alignment and transcription of multi-talker recordings. The approach of using (token, speaker) tuples for speaker attribution is particularly noteworthy, as it directly addresses a significant challenge in the field of speech processing. The imposition of temporal constraints to reduce graph size is a practical consideration that enhances the efficiency of the proposed method.
The experiments conducted on synthetic LibriSpeech overlaps provide a solid basis for evaluating the proposed methods. The paper compares the performance of the shuffle product FSA against traditional methods, demonstrating a clear advantage in terms of alignment accuracy. However, the reliance on synthetic data may limit the generalizability of the results to real-world scenarios. The metrics used for evaluation are appropriate, but further validation on diverse datasets would strengthen the findings.
The paper mentions that all algorithms are implemented using k2 / Icefall, which is a positive aspect for reproducibility. However, the lack of a publicly available code repository or detailed implementation instructions may hinder other researchers from replicating the results. Providing a GitHub repository or similar resource would greatly enhance the reproducibility of the work.
One limitation of the study is the use of synthetic data for training and evaluation, which may not fully capture the complexities of real-world overlapped speech scenarios. Additionally, while the proposed method shows promise, the paper does not provide extensive comparisons with other state-of-the-art techniques, which could have offered more context regarding its performance.
The ability to accurately transcribe and attribute overlapped speech has significant implications for various applications, including automated transcription services, assistive technologies for the hearing impaired, and improvements in human-computer interaction. The proposed method could pave the way for advancements in multi-talker speech recognition systems, making them more robust and effective. The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or losses. However, they overlook the fact that each modality has finite information capacity. In this work, we propose IIBalance, a multimodal learning framework that aligns the modality contributions with Intrinsic Information Budgets (IIB). We propose a task-grounded estimator of each modality's IIB, transforming its capacity into a global prior over modality contributions. Anchored by the highest-budget modality, we design a prototype-based relative alignment mechanism that corrects semantic drift only when weaker modalities deviate from their budgeted potential, rather than forcing imitation. During inference, we propose a probabilistic gating module that integrates the global budgets with sample-level uncertainty to generate calibrated fusion weights. Experiments on three representative benchmarks demonstrate that IIBalance consistently outperforms state-of-the-art balancing methods and achieves better utilization of complementary modality cues. Our code is available at: https://github.com/XiongZechang/IIBalance.
Primary: Alibaba Group
All Institutions: Alibaba Group, Beijing Jiaotong University
The main contribution of this paper is the introduction of IIBalance, a multimodal learning framework that utilizes Intrinsic Information Budgets to optimize modality contributions, leading to improved performance in scenarios with imbalanced modalities. This work significantly advances the understanding of modality interplay in multimodal systems and offers a practical solution to a common challenge in the field.
The paper introduces a novel framework, IIBalance, that addresses the issue of modality dominance in multimodal learning by proposing the concept of Intrinsic Information Budgets (IIB). This approach emphasizes the importance of recognizing each modality's information capacity and adapting their contributions accordingly. The methodology is well-structured, with a clear two-stage process that includes prototype-guided relative alignment and uncertainty-aware Bayesian fusion. The use of a dataset-level prior for modality contributions is particularly innovative, allowing for a more nuanced understanding of how different modalities should contribute based on their intrinsic capabilities.
The experimental validation is robust, employing three representative benchmarks (Kinetics-Sounds, CREMA-D, and AVE) to demonstrate the effectiveness of IIBalance. The results indicate consistent improvements over state-of-the-art methods, showcasing not only higher overall accuracy but also better performance in weaker modalities. The paper provides a thorough analysis of the contributions of various components of the proposed method, reinforcing the value of the IIB concept and its implementation.
The paper includes sufficient implementation details, such as training procedures, model architectures, and hyperparameter settings, which facilitate reproducibility. The authors have also made their code publicly available, further enhancing the potential for others to replicate and build upon their work.
While the proposed method shows promising results, the paper does not extensively discuss the scalability of the approach to more complex multimodal scenarios or its performance in real-world applications. Additionally, the reliance on a fixed IIB prior during training may limit adaptability in dynamic environments where modality reliability can change rapidly.
The implications of this work extend to various applications in audio-visual recognition, human-computer interaction, and any domain where multimodal data is prevalent. By improving how models leverage complementary information from different modalities, this research could enhance the robustness and accuracy of systems in fields such as robotics, surveillance, and multimedia content analysis. The main contribution of this paper is the introduction of IIBalance, a multimodal learning framework that utilizes Intrinsic Information Budgets to optimize modality contributions, leading to improved performance in scenarios with imbalanced modalities. This work significantly advances the understanding of modality interplay in multimodal systems and offers a practical solution to a common challenge in the field.
This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
The main contribution of this paper is the introduction of MOSS-TTS, a scalable speech generation model that emphasizes control and efficiency in audio synthesis. The technical contribution is significant, addressing current limitations in speech generation while providing a flexible framework for future research and application in the audio domain.
The methodology presented in MOSS-TTS is well-structured, leveraging a causal Transformer tokenizer and autoregressive modeling to achieve efficient speech generation. The introduction of MOSS-Audio-Tokenizer for compressing audio and the dual generator architecture (MOSS-TTS and MOSS-TTS-Local-Transformer) demonstrates a thoughtful approach to scalability and control in speech synthesis. The focus on zero-shot voice cloning and token-level control adds significant value to the framework, indicating a robust understanding of current challenges in the field.
The paper outlines a comprehensive evaluation across multilingual and open-domain settings, which is essential for demonstrating the model's versatility. However, the lack of detailed quantitative results or comparisons with existing state-of-the-art models limits the assessment of its performance. The empirical characteristics are mentioned, but more rigorous benchmarking against established metrics would strengthen the claims of superiority.
While the paper provides a clear design and training recipe, it lacks specific implementation details and code availability, which are critical for reproducibility. The absence of a project URL further complicates the ability for other researchers to replicate the results or build upon this work.
The paper does not adequately address potential limitations of the MOSS-TTS framework, such as the computational resources required for training and inference or the potential biases in voice cloning across different demographics. Additionally, the evaluation metrics used for assessing model performance could be more thoroughly discussed.
MOSS-TTS has the potential to significantly impact various applications, including virtual assistants, content creation, and accessibility tools for individuals with speech impairments. The ability to perform zero-shot voice cloning and nuanced control over speech generation could lead to more personalized and engaging user experiences. The main contribution of this paper is the introduction of MOSS-TTS, a scalable speech generation model that emphasizes control and efficiency in audio synthesis. The technical contribution is significant, addressing current limitations in speech generation while providing a flexible framework for future research and application in the audio domain.
Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge's reasoning quality metric.
Primary: Tallinn University of Technology
All Institutions: Tallinn University of Technology
The paper presents a novel multi-source evidence fusion approach for audio question answering, achieving top performance in reasoning quality while addressing the challenges of reliability and transparency in LALMs. The comprehensive methodology and strong experimental results contribute significantly to the field of audio understanding and reasoning, paving the way for future advancements in multimodal AI systems.
The paper presents a robust multi-source ensemble pipeline that effectively combines two large audio language models (LALMs) with a tiered reliability framework for acoustic tools. The methodology emphasizes dual-source evidence fusion and a structured contradiction detection mechanism, which enhances the reasoning quality of the system. The approach of grounding in reliability-tagged evidence is innovative and addresses the common issue of hallucination in LALMs, making the reasoning process more transparent and verifiable.
The evaluation is conducted on the Interspeech 2026 Audio Reasoning Challenge dataset, which is comprehensive and includes a diverse range of audio scenarios. The reported results demonstrate a strong performance, with the system achieving the highest reasoning quality score and competitive accuracy. Ablation studies provide statistical significance to the improvements gained from the dual-source evidence fusion, reinforcing the effectiveness of the proposed methodology.
The paper provides detailed implementation details, including the models and tools used, which enhances reproducibility. However, the reliance on empirical tuning of reliability weights and confidence caps without a data-driven approach may pose challenges for complete reproducibility in other contexts.
The system's end-to-end latency of 8-10 minutes per sample limits its applicability in real-time scenarios. Additionally, while the architecture is well-suited for the challenge, its generalizability to other reasoning tasks remains to be fully validated. The empirical tuning of parameters may also restrict the adaptability of the system to different datasets or tasks.
The proposed system has significant implications for audio understanding and reasoning, particularly in applications such as automated audio analysis, content moderation, and interactive audio systems. By improving the transparency and reliability of audio question answering, it opens avenues for more trustworthy AI applications in various domains, including education, entertainment, and accessibility. The paper presents a novel multi-source evidence fusion approach for audio question answering, achieving top performance in reasoning quality while addressing the challenges of reliability and transparency in LALMs. The comprehensive methodology and strong experimental results contribute significantly to the field of audio understanding and reasoning, paving the way for future advancements in multimodal AI systems.
During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Quebec Artificial Intelligence Institute, Université de Montréal
The main contribution of this paper is the introduction of FLAIR, a framework that enables full-duplex spoken dialogue models to perform latent reasoning concurrently with speech perception, enhancing response quality and conversational dynamics. This innovative approach addresses a critical gap in current dialogue systems, allowing for more human-like interactions and setting a foundation for future research in the field.
The proposed methodology, FLAIR, introduces a novel approach to full-duplex spoken dialogue systems by integrating latent reasoning during the listening phase. This is achieved through an Evidence Lower Bound (ELBO)-based objective that allows for efficient supervised fine-tuning without requiring explicit reasoning annotations. The use of a Global-aware Expert model to derive latent embeddings is innovative, as it leverages the entire dialogue context to enhance response generation. The recursive feeding of latent embeddings during user speech is a significant departure from traditional autoregressive models, allowing for continuous reasoning without latency.
The experimental results demonstrate the effectiveness of FLAIR across multiple benchmarks, showcasing improvements in response quality and conversational dynamics. The paper provides a comprehensive evaluation on various tasks, including factual knowledge and multi-turn question answering, and compares FLAIR against existing models. The results indicate that FLAIR achieves competitive performance, particularly in scenarios requiring reasoning, which underscores its practical applicability in real-world dialogue systems.
The paper outlines the training process and architecture in detail, including the data generation methods and the training pipeline. However, the lack of a publicly available code repository or demo limits reproducibility, as external researchers cannot easily implement or test the proposed model.
One limitation is the reliance on synthetic data for training, which may not fully capture the nuances of real-world conversational interactions. Additionally, the paper does not address potential biases in the generated datasets or the implications of using large-scale models in diverse applications. The absence of a demo or project URL also hinders practical engagement with the work.
The advancements presented in this paper have the potential to significantly enhance human-computer interaction, making conversational agents more responsive and capable of handling complex dialogue scenarios. This could lead to more natural and efficient communication in various applications, including customer service, virtual assistants, and educational tools. The main contribution of this paper is the introduction of FLAIR, a framework that enables full-duplex spoken dialogue models to perform latent reasoning concurrently with speech perception, enhancing response quality and conversational dynamics. This innovative approach addresses a critical gap in current dialogue systems, allowing for more human-like interactions and setting a foundation for future research in the field.
Reliable Sound Source Localization (SSL) plays an essential role in many downstream tasks, where informed decision making depends not only on accurate localization but also on the confidence in each estimate. This need for reliability becomes even more pronounced in challenging conditions, such as reverberant environments and multi-source scenarios. However, existing SSL methods typically provide only point estimates, offering limited or no Uncertainty Quantification (UQ). We leverage the Conformal Prediction (CP) framework and its extensions for controlling general risk functions to develop two complementary UQ approaches for SSL. The first assumes that the number of active sources is known and constructs prediction regions that cover the true source locations. The second addresses the more challenging setting where the source count is unknown, first reliably estimating the number of active sources and then forming corresponding prediction regions. We evaluate the proposed methods on extensive simulations and real-world recordings across varying reverberation levels and source configurations. Results demonstrate reliable finite-sample guarantees and consistent performance for both known and unknown source-count scenarios, highlighting the practical utility of the proposed frameworks for uncertainty-aware SSL.
Primary: Tel-Aviv University
All Institutions: Tel-Aviv University
The main contribution of this paper is the development of a robust framework for uncertainty quantification in multi-speaker sound source localization, leveraging Conformal Prediction methods to provide reliable prediction regions and risk control. This work significantly advances the field by addressing the critical need for confidence measures in SSL, enabling more informed decision-making in complex acoustic environments.
The paper presents a novel approach to uncertainty quantification (UQ) in multi-speaker sound source localization (SSL) using the Conformal Prediction (CP) framework. It introduces two complementary methods: one for known source counts that constructs prediction regions around estimates, and another for unknown source counts that estimates the number of sources while forming corresponding prediction regions. The methodology is well-grounded in statistical theory and effectively addresses the limitations of existing SSL methods that typically provide only point estimates without quantifying uncertainty. The integration of risk control into the UQ framework is a significant advancement, allowing for more informed decision-making in practical applications.
The experimental evaluation is comprehensive, utilizing both simulated environments and real-world recordings to assess the proposed methods under varying conditions of reverberation and source configurations. The results demonstrate the effectiveness of the proposed frameworks, showing reliable finite-sample guarantees and consistent performance across different scenarios. The use of both classical and deep learning-based likelihood maps strengthens the validity of the findings. However, the paper could benefit from more detailed comparisons with existing state-of-the-art methods to contextualize the improvements achieved.
The authors provide a GitHub repository with the code, which enhances reproducibility. The detailed description of the experimental setup, including datasets and calibration processes, allows for other researchers to replicate the experiments. However, the paper could improve by including more specific instructions for running the code and potentially providing a demo or example outputs.
While the proposed methods show promise, there are limitations regarding the reliance on the accuracy of the likelihood maps generated by the underlying SSL methods. The performance may degrade in highly complex acoustic environments or with significant noise interference. Additionally, the paper does not address the computational efficiency of the proposed methods, which could be a concern for real-time applications.
The research has significant implications for various applications in audio signal processing, robotics, and human-computer interaction, where reliable sound source localization is critical. By providing a framework for uncertainty-aware SSL, this work could enhance the robustness of systems that rely on accurate localization, such as autonomous vehicles and assistive technologies for the hearing impaired. The integration of UQ into SSL methods could also pave the way for more advanced applications in augmented reality and immersive audio experiences. The main contribution of this paper is the development of a robust framework for uncertainty quantification in multi-speaker sound source localization, leveraging Conformal Prediction methods to provide reliable prediction regions and risk control. This work significantly advances the field by addressing the critical need for confidence measures in SSL, enabling more informed decision-making in complex acoustic environments.
Modern audio is created by mixing stems from different sources, raising the question: can we independently watermark each stem and recover all watermarks after separation? We study a separation-first, multi-stream watermarking framework-embedding distinct information into stems using unique keys but a shared structure, mixing, separating, and decoding from each output. A naive pipeline (robust watermarking + off-the-shelf separation) yields poor bit recovery, showing robustness to generic distortions does not ensure robustness to separation artifacts. To enable this, we jointly train the watermark system and the separator in an end-to-end manner, encouraging the separator to preserve watermark cues while adapting embedding to separation-specific distortions. Experiments on speech+music and vocal+accompaniment mixtures show substantial gains in post-separation recovery while maintaining perceptual quality.
Primary: Duke Kunshan University
All Institutions: Duke Kunshan University, The Chinese University of Hong Kong
The paper presents a novel approach to multi-stream audio watermarking that effectively addresses the challenges posed by source separation. By jointly training the watermarking and separation systems, the authors demonstrate substantial improvements in watermark recovery while maintaining audio quality, marking a significant contribution to the field of audio processing and copyright protection.
The paper introduces a novel separation-first, multi-stream audio watermarking framework that jointly trains a watermarking system and a source separator in an end-to-end manner. This approach addresses the challenge of preserving watermark cues during the separation process, which is often overlooked in traditional watermarking methods. The methodology is well-structured, with a clear problem setup and a detailed description of the joint training pipeline, including the use of a key-conditioned Conformer architecture for watermarking and the Demucs separator for audio separation. The approach is innovative in its integration of watermarking and separation, which is a significant advancement in the field.
The experiments are comprehensive, utilizing multiple datasets and evaluating the performance of the proposed method against several baselines. The results demonstrate substantial improvements in post-separation watermark recovery, with a significant reduction in bit error rates compared to existing methods. The evaluation metrics used, including average bit error rate and perceptual quality measures (e.g., SNR and ViSQOL), provide a robust assessment of the method's effectiveness. The experiments also highlight the importance of joint training in enhancing both watermark robustness and separation integrity.
The paper provides sufficient implementation details, including the architecture of the watermarking system and the separation network, as well as the training setup and loss functions. However, the reproducibility could be improved by providing access to the code and detailed instructions for replicating the experiments. The mention of hardware specifications and training duration is helpful, but a public repository would enhance transparency.
One limitation is that the framework is currently limited to two-stem mixtures, which may restrict its applicability in more complex audio scenarios. Additionally, while the joint training approach improves robustness, it may introduce trade-offs in terms of the imperceptibility of the watermark, as indicated by the results showing that separation-aware models do not outperform single-carrier baselines in direct encoding/decoding settings.
The proposed method has significant implications for copyright protection and content authenticity in the age of AI-generated audio. As audio content becomes increasingly mixed and generated from multiple sources, the ability to independently watermark and recover information from different stems is crucial. This research could pave the way for more secure and reliable audio watermarking techniques, potentially influencing industry standards in digital rights management. The paper presents a novel approach to multi-stream audio watermarking that effectively addresses the challenges posed by source separation. By jointly training the watermarking and separation systems, the authors demonstrate substantial improvements in watermark recovery while maintaining audio quality, marking a significant contribution to the field of audio processing and copyright protection.
The rapid proliferation of AI-Generated Content (AIGC) has necessitated robust metrics for perceptual quality assessment. However, automatic Mean Opinion Score (MOS) prediction models are often compromised by data scarcity, predisposing them to learn spurious correlations-- such as dataset-specific acoustic signatures-- rather than generalized quality features. To address this, we leverage domain adversarial training (DAT) to disentangle true quality perception from these nuisance factors. Unlike prior works that rely on static domain priors, we systematically investigate domain definition strategies ranging from explicit metadata-driven labels to implicit data-driven clusters. Our findings reveal that there is no "one-size-fits-all" domain definition; instead, the optimal strategy is highly dependent on the specific MOS aspect being evaluated. Experimental results demonstrate that our aspect-specific domain strategy effectively mitigates acoustic biases, significantly improving correlation with human ratings and achieving superior generalization on unseen generative scenarios.
Primary: National Taiwan Normal University
All Institutions: National Taiwan Normal University, Academia Sinica, E.SUN Financial Holding Co., Ltd., United Link Co., Ltd.
The paper presents a novel approach to audio quality assessment by leveraging domain adversarial training to disentangle quality perception from spurious correlations, significantly enhancing the reliability of automatic MOS prediction models. The comprehensive methodology and rigorous experimental validation contribute to its significance in the field, addressing a pressing challenge in evaluating AI-generated audio content.
The paper introduces a robust framework for Mean Opinion Score (MOS) prediction using Domain Adversarial Training (DAT) to mitigate spurious correlations in audio quality assessment. The methodology is well-structured, employing three distinct domain definition strategies: explicit metadata-driven labels, implicit K-means clustering, and random assignment, which are systematically analyzed for their effectiveness. The use of a pre-trained SSL feature extractor and a MultiGauss backbone for quality prediction adds depth to the approach, ensuring that the model captures intrinsic quality features while remaining invariant to domain-specific biases.
The experiments are comprehensive, utilizing the AES-Natural dataset with a well-defined split protocol for training, validation, and evaluation. The results demonstrate significant improvements in correlation with human ratings across various aspects of audio quality, showcasing the effectiveness of the proposed domain strategies. The statistical significance of the results is validated through rigorous testing, including t-tests, which strengthens the findings.
The paper provides sufficient details regarding the model architecture, training setup, and evaluation metrics, which facilitates reproducibility. The inclusion of a GitHub repository for code access further enhances the potential for others to replicate the study.
While the study presents a robust framework, it may be limited by the dataset used, which could affect the generalizability of the findings. Additionally, the reliance on specific domain definitions may not universally apply to all audio quality assessment scenarios, suggesting a need for further exploration of domain strategies across diverse datasets.
The implications of this research are significant, as it addresses a critical challenge in the evaluation of AI-generated audio content, which is increasingly relevant in the context of content creation and multimedia applications. The findings could influence the development of more reliable audio quality assessment tools, potentially impacting industries such as entertainment, broadcasting, and AI content generation. The paper presents a novel approach to audio quality assessment by leveraging domain adversarial training to disentangle quality perception from spurious correlations, significantly enhancing the reliability of automatic MOS prediction models. The comprehensive methodology and rigorous experimental validation contribute to its significance in the field, addressing a pressing challenge in evaluating AI-generated audio content.
Human listeners exhibit the remarkable ability to segregate a desired sound from complex acoustic scenes through selective auditory attention, motivating the study of Targeted Sound Detection (TSD). The task requires detecting and localizing a target sound in a mixture when a reference audio of that sound is provided. Prior approaches, rely on generating a sound-discriminative conditional embedding vector for the reference and pairing it with a mixture encoder, jointly optimized with a multi-task learning approach. In this work, we propose a unified encoder architecture that processes both the reference and mixture audio within a shared representation space, promoting stronger alignment while reducing architectural complexity. This design choice not only simplifies the overall framework but also enhances generalization to unseen classes. Following the multi-task training paradigm, our method achieves substantial improvements over prior approaches, surpassing existing methods and establishing a new state-of-the-art benchmark for targeted sound detection, with a segment-level F1 score of 83.15% and an overall accuracy of 95.17% on the URBAN-SED dataset.
Primary: Indian Institute of Technology Hyderabad
All Institutions: Indian Institute of Technology Hyderabad
The paper presents a unified encoder framework for reference-guided targeted sound detection, achieving state-of-the-art performance and demonstrating robustness in real-world applications. The methodology and results contribute meaningfully to the field of audio machine learning, particularly in enhancing sound event detection capabilities.
The proposed methodology introduces a unified encoder architecture that processes both reference and mixture audio in a shared representation space. This approach reduces architectural complexity and enhances feature alignment, which is a significant improvement over previous dual-branch designs. The methodology is well-structured, leveraging ConvNeXt for representation extraction and employing diverse fusion strategies, which are systematically evaluated. The multi-task learning paradigm further strengthens the model's performance by combining clip-level classification with frame-level detection, showcasing a comprehensive understanding of the task requirements.
The experimental evaluation is robust, utilizing well-defined datasets (URBAN-SED and UrbanSound8K) and establishing new benchmarks for performance metrics, particularly segment-level F1 scores. The results demonstrate substantial improvements over prior methods, indicating the effectiveness of the proposed approach. The evaluation also includes cross-domain generalization tests, which add depth to the findings and confirm the model's resilience to distributional shifts.
The paper provides sufficient implementation details, including architecture specifications, training configurations, and data augmentation strategies, which facilitate reproducibility. However, the absence of a public code repository or demo URL limits the ease of reproduction for external researchers.
One identified limitation is the reliance on specific datasets, which may not fully represent the diversity of real-world acoustic environments. Additionally, while the model shows strong performance on the benchmark datasets, its effectiveness in more complex or noisy real-world scenarios remains to be thoroughly validated.
The research has significant implications for various applications, including surveillance, multimedia retrieval, and smart assistants, where targeted sound detection is crucial. The ability to generalize to unseen classes enhances the model's applicability in real-world scenarios, potentially leading to advancements in audio processing technologies. The paper presents a unified encoder framework for reference-guided targeted sound detection, achieving state-of-the-art performance and demonstrating robustness in real-world applications. The methodology and results contribute meaningfully to the field of audio machine learning, particularly in enhancing sound event detection capabilities.
Multimodal generative models have shown remarkable progress in single-modality video and audio synthesis, yet truly joint audio-video generation remains an open challenge. In this paper, I explore four key contributions to advance this field. First, I release two high-quality, paired audio-video datasets. The datasets consisting on 13 hours of video-game clips and 64 hours of concert performances, each segmented into consistent 34-second samples to facilitate reproducible research. Second, I train the MM-Diffusion architecture from scratch on our datasets, demonstrating its ability to produce semantically coherent audio-video pairs and quantitatively evaluating alignment on rapid actions and musical cues. Third, I investigate joint latent diffusion by leveraging pretrained video and audio encoder-decoders, uncovering challenges and inconsistencies in the multimodal decoding stage. Finally, I propose a sequential two-step text-to-audio-video generation pipeline: first generating video, then conditioning on both the video output and the original prompt to synthesize temporally synchronized audio. My experiments show that this modular approach yields high-fidelity generations of audio video generation.
Primary: Duke University
All Institutions: Duke University
This paper presents a significant advancement in the field of joint audio-video generation through the introduction of novel methodologies and high-quality datasets. The contributions are well-aligned with current challenges in multimodal generative models, making it a valuable addition to the literature.
The paper proposes a novel MM-Diffusion architecture trained from scratch on newly released datasets, which is a significant methodological contribution. The sequential two-step text-to-audio-video generation pipeline is particularly innovative, as it addresses the challenge of synchronizing audio and video outputs effectively. The use of pretrained encoder-decoder models for joint latent diffusion adds depth to the methodology, although the paper could benefit from a more detailed explanation of the architecture and the training process.
The experiments are well-structured, utilizing high-quality datasets that enhance the validity of the results. The quantitative evaluation of alignment between audio and video is a strong point, although the paper could improve by including more comprehensive qualitative assessments, such as user studies or comparisons with existing state-of-the-art methods. The results demonstrate high fidelity in generated outputs, which is promising for future applications.
The paper mentions the release of datasets and code, which is crucial for reproducibility. However, it lacks detailed implementation specifics, such as hyperparameter settings and training configurations, which would aid other researchers in replicating the experiments effectively.
One limitation is the reliance on the quality of the datasets, which may not generalize well across different types of audio-video content. Additionally, the paper does not address potential biases in the datasets, nor does it explore the scalability of the proposed methods to larger or more diverse datasets. The challenges uncovered in the multimodal decoding stage could also benefit from more in-depth analysis.
The potential applications of this research are significant, particularly in entertainment, gaming, and educational content generation. By improving joint audio-video generation, the work could enhance user experiences in multimedia applications. However, ethical considerations around content generation and potential misuse should be addressed in future work. This paper presents a significant advancement in the field of joint audio-video generation through the introduction of novel methodologies and high-quality datasets. The contributions are well-aligned with current challenges in multimodal generative models, making it a valuable addition to the literature.