Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Primary: Nanjing University
All Institutions: Nanjing University, WeNet Open Source Community
The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
The proposed methodology introduces a novel framework for long-form speech synthesis that emphasizes the importance of global context and paralinguistic cues. The "Labeling over filtering/cleaning" strategy is innovative, as it challenges conventional practices in data preparation by advocating for the inclusion of complex, noisy data that reflects real-world speech dynamics. The Global-Sentence-Token hierarchical annotation schema is a significant advancement, enabling a structured approach to capturing the nuances of speech synthesis. The integration of Chain-of-Thought reasoning and Dimension Dropout enhances the model's ability to follow complex instructions, which is a notable methodological improvement over existing TTS systems.
The paper lacks quantitative evaluations of the proposed system's performance, particularly in terms of emotional arc coherence and multi-speaker interaction naturalness. While it discusses the challenges of evaluating borderless long audio synthesis, it does not provide concrete experimental results or comparisons with existing methods. The absence of benchmark results limits the ability to assess the system's effectiveness rigorously. Future work is needed to establish robust evaluation metrics that can capture the richness of the proposed framework.
The paper does not provide sufficient implementation details or access to code and datasets, which raises concerns about reproducibility. The lack of a demo or project URL further complicates the ability for other researchers to replicate the findings or build upon this work. Clearer documentation and shared resources would enhance reproducibility.
The system is currently optimized for content creation rather than real-time interactions, which limits its applicability in dynamic environments. Additionally, the training data is primarily speech-centric, and the system's emergent capabilities for sound effects and music are not fully developed. These limitations suggest that while the framework is promising, it requires further refinement and expansion to address broader applications.
The potential applications of this research extend beyond traditional TTS systems, offering possibilities for enhanced audio experiences in content creation, gaming, and virtual environments. The ability to synthesize speech with rich emotional and contextual cues could significantly improve user engagement and interaction quality in various multimedia applications. However, the challenges in real-time synthesis and the need for more diverse training data must be addressed to realize its full impact. The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.
Primary: Carnegie Mellon University
All Institutions: Brno University of Technology, Carnegie Mellon University, Johns Hopkins University
The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
The methodology presented in this paper is innovative in its application of shuffle products and partial order finite-state automata (FSAs) for modeling overlapped speech. The authors effectively leverage these mathematical constructs to create a framework for alignment and transcription of multi-talker recordings. The approach of using (token, speaker) tuples for speaker attribution is particularly noteworthy, as it directly addresses a significant challenge in the field of speech processing. The imposition of temporal constraints to reduce graph size is a practical consideration that enhances the efficiency of the proposed method.
The experiments conducted on synthetic LibriSpeech overlaps provide a solid basis for evaluating the proposed methods. The paper compares the performance of the shuffle product FSA against traditional methods, demonstrating a clear advantage in terms of alignment accuracy. However, the reliance on synthetic data may limit the generalizability of the results to real-world scenarios. The metrics used for evaluation are appropriate, but further validation on diverse datasets would strengthen the findings.
The paper mentions that all algorithms are implemented using k2 / Icefall, which is a positive aspect for reproducibility. However, the lack of a publicly available code repository or detailed implementation instructions may hinder other researchers from replicating the results. Providing a GitHub repository or similar resource would greatly enhance the reproducibility of the work.
One limitation of the study is the use of synthetic data for training and evaluation, which may not fully capture the complexities of real-world overlapped speech scenarios. Additionally, while the proposed method shows promise, the paper does not provide extensive comparisons with other state-of-the-art techniques, which could have offered more context regarding its performance.
The ability to accurately transcribe and attribute overlapped speech has significant implications for various applications, including automated transcription services, assistive technologies for the hearing impaired, and improvements in human-computer interaction. The proposed method could pave the way for advancements in multi-talker speech recognition systems, making them more robust and effective. The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.
Primary: University of Cambridge
All Institutions: University of Cambridge, Electronic Information School, Flower Labs, University of Auckland, University of Melbourne, Wuhan University
This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
The proposed adaptive federated fine-tuning framework introduces innovative mechanisms such as early exits and layer-wise partial aggregation, which effectively address the challenges posed by heterogeneity in federated learning environments. The methodology is well-structured, leveraging an elastic multi-branch architecture that allows clients to dynamically select their training depth based on local resources and task complexity. This approach not only enhances computational efficiency but also maintains performance across diverse speech tasks. The integration of lightweight prediction heads and depth-aware aggregation strategies is a significant advancement in federated learning for speech applications.
The experiments are comprehensive, covering five diverse downstream tasks that span various aspects of speech understanding. The results demonstrate the effectiveness of the proposed framework in reducing computational overhead while achieving competitive performance compared to centralized training. The evaluation metrics used, including word error rates and classification error rates, are appropriate for the tasks at hand. However, the paper could benefit from additional comparisons with existing state-of-the-art methods to further contextualize the results.
The paper provides a detailed description of the experimental setup, including datasets, model architectures, and training configurations, which aids reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the experiments. Including a link to the implementation would significantly enhance reproducibility.
One limitation is the reliance on a specific backbone model (Wav2Vec 2.0), which may not generalize to all speech tasks or architectures. Additionally, while the framework addresses resource constraints, it does not fully explore the implications of data heterogeneity beyond the basic partitioning strategy employed. The paper could also discuss potential trade-offs between performance and computational efficiency in more detail.
The proposed framework has significant implications for deploying speech recognition systems in privacy-sensitive environments, such as mobile devices and personal assistants. By enabling efficient fine-tuning without compromising user data privacy, this work contributes to the growing field of privacy-preserving machine learning. The methodology could be adapted to other domains where federated learning is applicable, potentially influencing future research in decentralized learning systems. This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
Integrating Federated Learning (FL) with self-supervised learning (SSL) enables privacy-preserving fine-tuning for speech tasks. However, federated environments exhibit significant heterogeneity: clients differ in computational capacity, causing straggler effects under unified fine-tuning, while diverse downstream tasks require different representation depths, making full-model updates inefficient. To address these challenges, we propose an adaptive federated fine-tuning framework with early exits. Lightweight prediction heads are inserted at intermediate layers of the SSL backbone, allowing clients to terminate computation based on local constraints and task requirements. We further introduce a layer-wise, depth-aware partial aggregation strategy to better utilize representations from different network depths. Experiments show that the framework reduces edge overhead, supports heterogeneous hardware, and maintains competitive performance in resource-constrained federated environments.
Primary: University of Cambridge
All Institutions: University of Cambridge, Electronic Information School, Flower Labs, University of Auckland, University of Melbourne, Wuhan University
This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
The proposed adaptive federated fine-tuning framework introduces innovative mechanisms such as early exits and layer-wise partial aggregation, which effectively address the challenges posed by heterogeneity in federated learning environments. The methodology is well-structured, leveraging an elastic multi-branch architecture that allows clients to dynamically select their training depth based on local resources and task complexity. This approach not only enhances computational efficiency but also maintains performance across diverse speech tasks. The integration of lightweight prediction heads and depth-aware aggregation strategies is a significant advancement in federated learning for speech applications.
The experiments are comprehensive, covering five diverse downstream tasks that span various aspects of speech understanding. The results demonstrate the effectiveness of the proposed framework in reducing computational overhead while achieving competitive performance compared to centralized training. The evaluation metrics used, including word error rates and classification error rates, are appropriate for the tasks at hand. However, the paper could benefit from additional comparisons with existing state-of-the-art methods to further contextualize the results.
The paper provides a detailed description of the experimental setup, including datasets, model architectures, and training configurations, which aids reproducibility. However, the lack of a publicly available code repository limits the ease with which others can replicate the experiments. Including a link to the implementation would significantly enhance reproducibility.
One limitation is the reliance on a specific backbone model (Wav2Vec 2.0), which may not generalize to all speech tasks or architectures. Additionally, while the framework addresses resource constraints, it does not fully explore the implications of data heterogeneity beyond the basic partitioning strategy employed. The paper could also discuss potential trade-offs between performance and computational efficiency in more detail.
The proposed framework has significant implications for deploying speech recognition systems in privacy-sensitive environments, such as mobile devices and personal assistants. By enabling efficient fine-tuning without compromising user data privacy, this work contributes to the growing field of privacy-preserving machine learning. The methodology could be adapted to other domains where federated learning is applicable, potentially influencing future research in decentralized learning systems. This paper presents a novel adaptive federated fine-tuning framework that effectively addresses the challenges of heterogeneous environments in self-supervised speech representation learning. The technical contributions, particularly in the areas of early exits and layer-wise aggregation, represent a meaningful advancement in the field of federated learning for audio applications.
Recent advances in generative models, such as diffusion and flow matching, have shown strong performance in audio tasks. However, speech enhancement (SE) models are typically trained on limited datasets and evaluated under narrow conditions, limiting real-world applicability. To address this, we propose DiT-Flow, a flow matching-based SE framework built on the latent Diffusion Transformer (DiT) backbone and trained for robustness across diverse distortions, including noise, reverberation, and compression. DiT-Flow operates on compact variational auto-encoders (VAEs)-derived latent features. We validated our approach on StillSonicSet, a synthetic yet acoustically realistic dataset composed of LibriSpeech, FSD50K, FMA, and 90 Matterport3D scenes. Experiments show that DiT-Flow consistently outperforms state-of-the-art generative SE models, demonstrating the effectiveness of flow matching in multi-condition speech enhancement. Despite ongoing efforts to expand synthetic data realism, a persistent bottleneck in SE is the inevitable mismatch between training and deployment conditions. By integrating LoRA with the MoE framework, we achieve both parameter-efficient and high-performance training for DiT-Flow robust to multiple distortions with using 4.9% percentage of the total parameters to obtain a better performance on five unseen distortions.
Primary: Johns Hopkins University
All Institutions: Johns Hopkins University, Technion Israel Institute of Technology, University of Haifa
The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
The methodology of DiT-Flow is robust, leveraging flow matching and latent Diffusion Transformers to enhance speech under multiple distortions. The integration of LoRA with the Mixture-of-Experts framework is particularly innovative, allowing for parameter-efficient adaptation to varying acoustic conditions. The use of a synthetic dataset, StillSonicSet, designed to simulate realistic conditions, further strengthens the approach. However, the paper could benefit from clearer descriptions of hyperparameter choices and training procedures.
The experiments are comprehensive, validating DiT-Flow against state-of-the-art models across various conditions. The use of multiple evaluation metrics, including PESQ, ESTOI, and DNSMOS, provides a well-rounded assessment of performance. The results demonstrate significant improvements over baseline models, particularly in challenging scenarios, indicating the effectiveness of the proposed methods. However, the paper lacks detailed comparisons with a broader range of existing methods, which could provide more context for its contributions.
The paper includes sufficient detail regarding the model architecture and training process, but lacks a clear link to code or datasets, which hampers reproducibility. Providing access to the StillSonicSet dataset and the trained models would enhance reproducibility and facilitate further research.
One limitation is the reliance on synthetic data, which may not fully capture the complexities of real-world audio environments. Additionally, while the model shows robustness to multiple distortions, its performance in extreme or novel conditions remains to be tested. The computational efficiency of the model in real-time applications also needs further exploration.
The advancements in speech enhancement presented in this paper have significant implications for real-world applications, particularly in telecommunication, virtual meetings, and assistive technologies. The ability to enhance speech quality in diverse acoustic environments can improve communication clarity and accessibility for users in various settings. The main contribution of this paper is the development of DiT-Flow, a novel speech enhancement framework that effectively utilizes flow matching and latent representations to improve robustness against multiple distortions. This work represents a significant step forward in the field of audio processing, addressing common challenges faced in real-world applications and demonstrating the potential for future advancements in speech enhancement technologies.
Animal vocalizations provide crucial insights for wildlife assessment, particularly in complex environments such as forests, aiding species identification and ecological monitoring. Recent advances in deep learning have enabled automatic species classification from their vocalizations. However, classifying species unseen during training remains challenging. To address this limitation, we introduce AnimalCLAP, a taxonomy-aware language-audio framework comprising a new dataset and model that incorporate hierarchical biological information. Specifically, our vocalization dataset consists of 4,225 hours of recordings covering 6,823 species, annotated with 22 ecological traits. The AnimalCLAP model is trained on this dataset to align audio and textual representations using taxonomic structures, improving the recognition of unseen species. We demonstrate that our proposed model effectively infers ecological and biological attributes of species directly from their vocalizations, achieving superior performance compared to CLAP. Our dataset, code, and models will be publicly available at https://dahlian00.github.io/AnimalCLAP_Page/.
Primary: quadInstitute of Science Tokyo
All Institutions: quadInstitute of Science Tokyo, The University of Osaka, The University of Tokyo
The main contribution of this paper is the introduction of AnimalCLAP, a taxonomy-aware language-audio pretraining framework that significantly improves species recognition and trait inference from animal vocalizations. This work represents a meaningful advancement in the application of machine learning to ecological monitoring, with a robust methodology and promising results that could influence future research and practices in wildlife assessment.
The methodology presented in AnimalCLAP is innovative, leveraging a taxonomy-aware framework that integrates hierarchical biological information into the model's training process. The authors introduce a substantial dataset of animal vocalizations, which is a critical asset for training and evaluating the model. The alignment of audio and textual representations through taxonomic structures is a novel approach that enhances the model's ability to generalize to unseen species, which is a significant challenge in the field. The use of contrastive learning techniques is well-justified and effectively applied to the task of species recognition and trait inference.
The experiments are comprehensive, utilizing a large dataset of 4,225 hours of recordings from 6,823 species, which is a considerable contribution to the field. The results demonstrate that AnimalCLAP outperforms existing models, including CLAP, in recognizing unseen species and inferring ecological traits. The evaluation metrics used are appropriate, and the authors provide a clear comparison of their model's performance against baseline methods, showcasing the effectiveness of their approach.
The authors commit to making their dataset, code, and models publicly available, which is crucial for reproducibility. However, the paper would benefit from a more detailed description of the experimental setup, including hyperparameter settings and training procedures, to facilitate replication by other researchers.
One limitation of the study is the potential bias in the dataset, which may not cover all ecological contexts or species diversity adequately. Additionally, the model's performance on edge cases or species with very similar vocalizations may not be thoroughly addressed. The reliance on taxonomic structures may also limit the model's applicability in more complex ecological scenarios where such hierarchies are not well defined.
The implications of this research are significant for wildlife conservation and ecological monitoring, as it provides a tool for non-invasive species identification and trait inference from vocalizations. This could enhance biodiversity assessments and inform conservation strategies. The methodology could also be adapted for other domains where audio classification is relevant, such as environmental monitoring or even human-related vocalizations. The main contribution of this paper is the introduction of AnimalCLAP, a taxonomy-aware language-audio pretraining framework that significantly improves species recognition and trait inference from animal vocalizations. This work represents a meaningful advancement in the application of machine learning to ecological monitoring, with a robust methodology and promising results that could influence future research and practices in wildlife assessment.
Audio-Visual Semantic Segmentation (AVSS) aligns audio and video at the pixel level but requires costly per-frame annotations. We introduce Weakly Supervised Audio-Visual Semantic Segmentation (WSAVSS), which uses only video-level labels to generate per-frame semantic masks of sounding objects. We decompose WSAVSS into looking, listening, and segmentation, and propose Progressive Cross-modal Alignment for Semantics (PCAS) with two modules: *Looking-before-Listening* and *Listening-before-Segmentation*. PCAS builds a classification task to train the audio-visual encoder using video labels, injects visual semantic prompts to enhance frame-level audio understanding, and then applies progressive contrastive alignment to map audio categories to image regions without mask annotations. Experiments show PCAS achieves state-of-the-art performance among weakly supervised methods on AVS and remains competitive with fully supervised baselines on AVSS, validating its effectiveness.
Primary: Beijing Institute of Technology
All Institutions: Beijing Institute of Technology
The main contribution of this paper is the introduction of a novel weakly supervised framework for audio-visual semantic segmentation that effectively aligns audio and visual features without requiring dense annotations. This work represents a significant step forward in the field of audio-visual understanding, providing a robust methodology and promising results that could influence future research and applications.
The methodology presented in this paper is innovative, particularly in its decomposition of the WSAVSS task into three distinct phases: looking, listening, and segmentation. The introduction of Temporal Visual Prompting (TVP) to enhance audio understanding through visual cues is a novel approach that leverages the inherent relationships between audio and visual modalities. The Progressive Cross-modal Alignment for Semantics (PCAS) framework, which combines instance-wise and token-wise contrastive learning, is well-conceived and addresses the challenge of aligning audio and visual features without requiring dense annotations. This progressive alignment strategy is a significant advancement over existing methods, making it a valuable contribution to the field.
The experiments are comprehensive, demonstrating the effectiveness of the proposed method through comparisons with both weakly supervised and fully supervised baselines. The use of multiple datasets and the reporting of mean IoU and F-score metrics provide a robust evaluation of the model's performance. The ablation studies effectively highlight the contributions of each module within the proposed framework, reinforcing the claims of improved performance. However, the absence of a demo or project URL limits the accessibility of the results for further validation by the community.
While the paper provides a detailed description of the methodology and experimental setup, it lacks specific implementation details such as code availability or dataset access instructions. The absence of these resources may hinder reproducibility. Clearer guidelines on how to replicate the experiments would enhance the paper's impact.
One limitation of the study is the reliance on video-level labels, which, while reducing annotation costs, may not capture the full complexity of audio-visual interactions. Additionally, the paper does not address potential biases in the datasets used, which could affect the generalizability of the results. The performance on more complex scenes with overlapping sounds and visuals could also be explored further.
The proposed WSAVSS framework has significant implications for applications in multimedia content analysis, human-computer interaction, and assistive technologies. By reducing the need for extensive annotations, this research can facilitate advancements in real-time audio-visual processing systems, enhancing accessibility and user experience in various domains. The approach could also inspire further research into weakly supervised learning paradigms across different modalities. The main contribution of this paper is the introduction of a novel weakly supervised framework for audio-visual semantic segmentation that effectively aligns audio and visual features without requiring dense annotations. This work represents a significant step forward in the field of audio-visual understanding, providing a robust methodology and promising results that could influence future research and applications.
This paper presents SelfTTS, a text-to-speech (TTS) model designed for cross-speaker style transfer that eliminates the need for external pre-trained speaker or emotion encoders. The architecture achieves emotional expressivity in neutral speakers through an explicit disentanglement strategy utilizing Gradient Reversal Layers (GRL) combined with cosine similarity loss to decouple speaker and emotion information. We introduce Multi Positive Contrastive Learning (MPCL) to induce clustered representations of speaker and emotion embeddings based on their respective labels. Furthermore, SelfTTS employs a self-refinement strategy via Self-Augmentation, exploiting the model's voice conversion capabilities to enhance the naturalness of synthesized speech. Experimental results demonstrate that SelfTTS achieves superior emotional naturalness (eMOS) and robust stability in target timbre and emotion compared to state-of-the-art baselines.
Primary: Universidade Estadual de Campinas (UNICAMP)
All Institutions: Universidade Estadual de Campinas (UNICAMP)
The main contribution of this paper is the development of SelfTTS, a robust TTS framework that achieves high-quality cross-speaker style transfer through innovative embedding disentanglement and self-refinement strategies. This work represents a meaningful advancement in the field of speech synthesis, addressing key challenges related to emotional expressivity and speaker identity while providing a solid experimental foundation to support its claims.
The paper introduces SelfTTS, a novel TTS framework that effectively decouples speaker and emotion embeddings without relying on external encoders. The methodology employs Gradient Reversal Layers (GRL) and Multi Positive Contrastive Learning (MPCL) to achieve disentanglement and clustering of embeddings, which is a significant advancement over existing methods that often suffer from speaker leakage. The self-refinement strategy through Self-Augmentation is particularly innovative, leveraging the model’s voice conversion capabilities to enhance the naturalness of synthesized speech. This approach is well-justified and clearly articulated, demonstrating a solid understanding of the challenges in TTS.
The experimental setup is robust, utilizing both subjective (eMOS, nMOS, sMOS) and objective metrics (UTMOS, WER, SECS, EECS) to evaluate performance. The results indicate that SelfTTS outperforms state-of-the-art models in emotional naturalness and stability, which is a crucial aspect of TTS systems. The use of cross-corpus experiments adds to the credibility of the findings, although the paper could benefit from more extensive comparisons with additional baselines.
The paper provides adequate implementation details, including the architecture, training procedures, and evaluation metrics, which facilitate reproducibility. The authors have made their code publicly available, enhancing the likelihood that other researchers can replicate the results. However, some hyperparameters and specific configurations could be more explicitly detailed to ensure complete clarity.
One limitation noted is the model's performance in cross-corpus scenarios, where emotional adherence is lower due to the differences in recording conditions. Additionally, while the Self-Augmentation strategy shows promise, its effectiveness may vary based on the quality of synthetic samples generated, which could introduce artifacts into the training process.
The advancements presented in SelfTTS have significant implications for the development of expressive TTS systems, particularly in applications requiring emotional expressivity and speaker identity preservation. This work could benefit various fields, including virtual assistants, audiobooks, and gaming, where natural and emotionally engaging speech synthesis is essential. The main contribution of this paper is the development of SelfTTS, a robust TTS framework that achieves high-quality cross-speaker style transfer through innovative embedding disentanglement and self-refinement strategies. This work represents a meaningful advancement in the field of speech synthesis, addressing key challenges related to emotional expressivity and speaker identity while providing a solid experimental foundation to support its claims.
Speech emotion recognition (SER) systems can exhibit gender-related performance disparities, but how such bias manifests in multilingual speech LLMs across languages and modalities is unclear. We introduce a novel multilingual, multimodal benchmark built on MELD-ST, spanning English, Japanese, and German, to quantify language-specific SER performance and gender gaps. We find bias is strongly language-dependent, and multimodal fusion does not reliably improve fairness. To address these, we propose ERM-MinMaxGAP, a fairness-informed training objective, which augments empirical risk minimization (ERM) with a proposed adaptive fairness weight mechanism and a novel MinMaxGAP regularizer on the maximum male-female loss gap within each language and modality. Building upon the Qwen2-Audio backbone, our ERM-MinMaxGAP approach improves multilingual SER performance by 5.5% and 5.0% while reducing the overall gender bias gap by 0.1% and 1.4% in the unimodal and multimodal settings, respectively.
Primary: Kyoto University
All Institutions: Agency for Science, Kyoto University, A*STAR
This paper presents a novel benchmark and a fairness-aware training objective for mitigating gender bias in multilingual multimodal speech emotion recognition systems. The technical contributions and methodology are robust, addressing a pressing issue in the field of machine learning and AI.
The proposed methodology, ERM-MinMaxGAP, is a significant advancement in addressing gender bias in multilingual multimodal speech emotion recognition (SER). The integration of empirical risk minimization with a fairness regularization term that focuses on the maximum male-female loss gap is innovative. The adaptive fairness weight mechanism further enhances the robustness of the training process, allowing for dynamic adjustments based on the model's performance. The detailed description of the MinMaxGAP regularizer and its implementation demonstrates a thorough understanding of the complexities involved in SER tasks, particularly in a multilingual context.
The experimental setup is well-structured, utilizing the MELD-ST dataset to benchmark the proposed method against existing models. The results indicate that ERM-MinMaxGAP not only improves SER performance but also reduces gender disparity effectively across different languages and modalities. The ablation studies provide valuable insights into the contributions of each component of the proposed method, reinforcing the effectiveness of the MinMaxGAP regularization approach.
The paper states that all code, data, and models will be released upon acceptance, which is a positive aspect for reproducibility. However, specific implementation details regarding the training process, hyperparameters, and dataset preparation are provided, which aids in replicating the experiments. The clarity in methodology and results presentation supports reproducibility.
One limitation is that while the proposed method shows improvements in SER and fairness, it does not achieve the minimum post-hoc gender gap in every setting, indicating that the approach may not be universally applicable across all datasets or languages. Additionally, the reliance on a specific dataset (MELD-ST) may limit the generalizability of the findings.
The implications of this research are significant, as it addresses a critical issue of fairness in AI systems, particularly in emotion recognition, which has applications in various fields such as mental health assessment, customer service, and human-computer interaction. By improving fairness in SER systems, this work contributes to the development of more equitable AI technologies that can better serve diverse populations. This paper presents a novel benchmark and a fairness-aware training objective for mitigating gender bias in multilingual multimodal speech emotion recognition systems. The technical contributions and methodology are robust, addressing a pressing issue in the field of machine learning and AI.
Composing coherent long-form music remains a significant challenge due to the complexity of modeling long-range dependencies and the prohibitive memory and computational requirements associated with lengthy audio representations. In this work, we propose a simple yet powerful trick: we assume that AI models can understand and generate time-accelerated (speeded-up) audio at rates such as 2x, 4x, or even 8x. By first generating a high-speed version of the music, we greatly reduce the temporal length and resource requirements, making it feasible to handle long-form music that would otherwise exceed memory or computational limits. The generated audio is then restored to its original speed, recovering the full temporal structure. This temporal speed-up and slow-down strategy naturally follows the principle of hierarchical generation from abstract to detailed content, and can be conveniently applied to existing music generation models to enable long-form music generation. We instantiate this idea in SqueezeComposer, a framework that employs diffusion models for generation in the accelerated domain and refinement in the restored domain. We validate the effectiveness of this approach on two tasks: long-form music generation, which evaluates temporal-wise control (including continuation, completion, and generation from scratch), and whole-song singing accompaniment generation, which evaluates track-wise control. Experimental results demonstrate that our simple temporal speed-up trick enables efficient, scalable, and high-quality long-form music generation. Audio samples are available at https://SqueezeComposer.github.io/.
Primary: Peking University
All Institutions: Peking University, The State Key Laboratory of Multimedia Information Processing, Hong Kong SAR, The Hong Kong University of Science and Technology
The main contribution of this paper is the introduction of SqueezeComposer, a novel framework for long-form music generation that utilizes temporal speed-up to enhance computational efficiency while preserving musical coherence. This work represents a significant advancement in the field of audio generation, addressing key challenges and opening avenues for future research in scalable music composition.
The methodology presented in SqueezeComposer is innovative, leveraging a temporal speed-up approach to address the challenges of long-form music generation. By generating music in an accelerated domain and restoring it to normal speed, the authors effectively reduce computational requirements while maintaining musical coherence. The hierarchical generation paradigm is well-structured, allowing for both abstract and detailed content generation. The use of diffusion models for both generation and refinement is a strong choice, aligning with current trends in audio synthesis. However, the paper could benefit from a more detailed explanation of the implementation specifics and the choice of hyperparameters.
The experiments are comprehensive, utilizing a variety of datasets and evaluation metrics, including Fréchet Audio Distance (FAD) and AudioBox-Aesthetics metrics. The results demonstrate that SqueezeComposer outperforms existing methods in terms of generation efficiency and quality, particularly in long-form music generation tasks. The comparison against established baselines is robust, showcasing the framework's effectiveness across different music generation scenarios. However, further qualitative assessments through user studies could enhance the evaluation of generated audio quality.
The paper provides a clear algorithmic description of the SqueezeComposer framework, but it lacks detailed implementation specifics, such as the exact architectures used for the diffusion models and the training process. Including code or a more thorough description of the experimental setup would improve reproducibility.
One limitation is the potential degradation in audio quality when using accelerated audio representations, which could affect the fidelity of the generated music. Additionally, while the framework shows promise for long-form music generation, the scalability to even longer compositions or more complex musical structures is not fully explored. The reliance on existing vocoders without retraining may also limit the potential for achieving the highest audio quality.
SqueezeComposer has the potential to significantly impact the field of music generation by enabling efficient production of long-form compositions, which could be beneficial for various applications in music production, film scoring, and interactive media. The approach could also inspire further research into hierarchical generation techniques and the use of accelerated representations in other domains of generative modeling. The main contribution of this paper is the introduction of SqueezeComposer, a novel framework for long-form music generation that utilizes temporal speed-up to enhance computational efficiency while preserving musical coherence. This work represents a significant advancement in the field of audio generation, addressing key challenges and opening avenues for future research in scalable music composition.
Multimodal Large Language Models (MLLMs) excel in Open-Vocabulary (OV) emotion recognition but often neglect fine-grained acoustic modeling. Existing methods typically use global audio encoders, failing to capture subtle, local temporal dynamics like micro-prosody and intonation shifts within individual utterances. To address this, we propose AcoustEmo, a time-sensitive MLLM featuring a novel Utterance-Aware Acoustic Q-Former. Our approach utilizes a timestamp-synchronized sliding window to dynamically extract segment-level audio tokens instead of coarse global representations. This enables the model to explicitly trace the temporal evolution of subtle acoustic clues and capture deep contextual dependencies in dialogues. Experiments on the Explainable Multimodal Emotion Recognition (EMER) task show that AcoustEmo significantly enhances complex emotion reasoning, outperforming baselines while maintaining robust contextual accuracy.
Primary: The University of Osaka
All Institutions: The University of Osaka, The University of Tokyo
The paper presents AcoustEmo, a time-sensitive MLLM that significantly enhances open-vocabulary emotion reasoning by capturing local acoustic dynamics through a novel Utterance-Aware Acoustic Q-Former. This work is a meaningful contribution to the field of multimodal emotion recognition, addressing critical gaps in existing methodologies and demonstrating substantial technical advancements.
The proposed methodology introduces a novel architecture, AcoustEmo, which leverages an Utterance-Aware Acoustic Q-Former to address the limitations of traditional global audio encoders in capturing fine-grained acoustic details. The use of a timestamp-synchronized sliding window for dynamic extraction of segment-level audio tokens is innovative, allowing the model to maintain semantic coherence between audio and text modalities. This approach is well-justified and effectively targets the nuances of emotion conveyed through micro-prosody and intonation shifts, which are critical for accurate emotion recognition.
The experiments conducted on the Explainable Multimodal Emotion Recognition (EMER) task demonstrate the effectiveness of the proposed model. The paper provides a comprehensive evaluation against multiple baseline models, showcasing significant improvements in performance metrics. The ablation studies further validate the necessity of the proposed components, reinforcing the claims made regarding the importance of local acoustic dynamics and timestamp synchronization.
The paper includes sufficient implementation details, including the architecture, optimization strategies, and dataset descriptions, which facilitate reproducibility. However, the absence of a publicly available code repository or demo URL limits the ability for other researchers to directly replicate the results.
While the model shows promising results, it occasionally misclassifies ambiguous emotional states, particularly in sarcastic utterances. Additionally, the performance can degrade in low-SNR scenarios due to background noise interference. These limitations highlight areas for future improvement, particularly in enhancing robustness against challenging acoustic conditions.
The advancements presented in AcoustEmo have significant implications for applications in empathetic conversational agents, mental health monitoring, and human-computer interaction. By improving the accuracy of emotion recognition in multimodal contexts, the model can contribute to more socially aware AI systems, enhancing user experiences in various interactive settings. The paper presents AcoustEmo, a time-sensitive MLLM that significantly enhances open-vocabulary emotion reasoning by capturing local acoustic dynamics through a novel Utterance-Aware Acoustic Q-Former. This work is a meaningful contribution to the field of multimodal emotion recognition, addressing critical gaps in existing methodologies and demonstrating substantial technical advancements.
Recent advancements in text-to-speech technologies enable generating high-fidelity synthetic speech nearly indistinguishable from real human voices. While recent studies show the efficacy of self-supervised learning-based speech encoders for deepfake detection, these models struggle to generalize across unseen speakers. Our quantitative analysis suggests these encoder representations are substantially influenced by speaker information, causing detectors to exploit speaker-specific correlations rather than artifact-related cues. We call this phenomenon speaker entanglement. To mitigate this reliance, we introduce SNAP, a speaker-nulling framework. We estimate a speaker subspace and apply orthogonal projection to suppress speaker-dependent components, isolating synthesis artifacts within the residual features. By reducing speaker entanglement, SNAP encourages detectors to focus on artifact-related patterns, leading to state-of-the-art performance.
Primary: NAVER Cloud
All Institutions: NAVER Cloud, clubsuit, clubsuit Kyudan Jung, clubsuit Jihwan Kim, dagger Cheonbok Park
The main contribution of this paper is the introduction of the SNAP framework, which effectively mitigates speaker entanglement in deepfake detection by employing orthogonal projection techniques to isolate synthesis artifacts. This innovative approach not only achieves state-of-the-art performance but also demonstrates robust generalization capabilities across unseen speakers and TTS models, marking a significant advancement in the field of audio deepfake detection.
The proposed SNAP framework introduces a novel approach to disentangle speaker identity from synthetic speech detection by employing orthogonal projection techniques. This mathematical decomposition of the feature space into speaker-dependent and artifact subspaces is innovative and effectively addresses the identified issue of speaker entanglement. The use of a simple logistic regression classifier on the refined features demonstrates a practical application of the method, emphasizing efficiency without compromising performance.
The experiments are well-structured, utilizing established datasets such as ASVspoof 2019 and 2021, and the In-the-Wild benchmark. The results show a clear improvement in detection performance, with significant reductions in equal error rates (EER) across various conditions, including unseen speakers and TTS models. The quantitative analysis of speaker entanglement through silhouette scores adds depth to the evaluation, reinforcing the effectiveness of the SNAP method.
The paper provides a clear description of the methodology, including feature extraction, subspace projection, and classification processes. However, the absence of a publicly available code repository or demo limits reproducibility. Future work should consider sharing implementation details to facilitate independent validation of results.
While the SNAP framework shows promising results, it primarily focuses on speaker nulling, which may not address other potential confounding factors in deepfake detection. Additionally, the reliance on logistic regression may limit the exploration of more complex models that could further enhance performance. The generalization to unseen TTS models is commendable, but the robustness across all possible variations in synthetic speech generation remains to be fully evaluated.
The implications of this research extend beyond deepfake detection, as the speaker-nulling framework could be applied to other areas of audio processing, such as emotion recognition and speaker-independent speech recognition. The ability to isolate artifacts from speaker identity can enhance the reliability of various speech-related applications, contributing to the development of more secure and trustworthy audio technologies. The main contribution of this paper is the introduction of the SNAP framework, which effectively mitigates speaker entanglement in deepfake detection by employing orthogonal projection techniques to isolate synthesis artifacts. This innovative approach not only achieves state-of-the-art performance but also demonstrates robust generalization capabilities across unseen speakers and TTS models, marking a significant advancement in the field of audio deepfake detection.
Large Language Models (LLMs) have advanced audio generation through discrete representation learning. However, most existing neural codecs focus on speech and emphasize reconstruction fidelity, overlooking unified low frame rate modeling across diverse audio domains, including speech, music, and general sound. Moreover, high reconstruction quality does not necessarily yield semantically informative representations, limiting effectiveness in downstream generation tasks. We propose OmniCodec, a universal neural audio codec tailored for low frame rate. It adopts a hierarchical multi-codebook design with semantic-acoustic decoupling by leveraging the audio encoder of the pre-trained understanding model, along with a self-guidance strategy to improve codebook utilization and reconstruction. Compared with the Mimi codec, experiments show that OmniCodec achieves outstanding performance at the same bitrate, delivering superior reconstruction quality while also providing more semantically informative representations that benefit downstream generation tasks. Our model and code will be open-sourced. Our demo page is available.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Shanghai Lingguang Zhaxian Technology
The main contribution of this paper is the introduction of OmniCodec, a universal neural audio codec that effectively combines low frame rate modeling with semantic-acoustic decoupling, achieving superior reconstruction quality and semantic representation across diverse audio domains. This work significantly advances the state of audio codecs, particularly in their application to large language models and generative tasks, and sets a foundation for future research in audio representation learning.
The methodology presented in this paper is innovative, particularly with its hierarchical multi-codebook design and the semantic-acoustic decoupling approach. The use of a pre-trained understanding model's audio encoder to enhance semantic representation is a novel contribution that addresses the limitations of existing codecs. The self-guidance strategy to improve codebook utilization is also a noteworthy addition, demonstrating a thoughtful approach to enhancing reconstruction quality while maintaining low frame rates.
The experiments are robust, utilizing a comprehensive dataset of approximately 160,000 hours of audio across various domains (speech, music, and general sound). The evaluation metrics are well-chosen, including both objective measures (PESQ, STOI, Mel distance) and subjective assessments (N-MOS, S-MOS). The results indicate that OmniCodec outperforms existing models, particularly in the music and general sound domains, which validates the effectiveness of the proposed architecture.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameters, which facilitates reproducibility. The open-sourcing of the model and code further enhances the potential for other researchers to replicate and build upon this work.
One limitation noted is the performance disparity in the speech domain compared to other models, which may be attributed to the structure of the WavLM model used for semantic supervision. Additionally, the paper acknowledges challenges in achieving optimal semantic decoupling for speech, suggesting that future work will be needed to address these issues.
The proposed OmniCodec has significant implications for audio generation tasks across various domains, including speech synthesis and music generation. Its ability to provide semantically informative representations can enhance applications in multimedia content creation, real-time audio processing, and interactive systems. The open-source nature of the project encourages further exploration and innovation in the field. The main contribution of this paper is the introduction of OmniCodec, a universal neural audio codec that effectively combines low frame rate modeling with semantic-acoustic decoupling, achieving superior reconstruction quality and semantic representation across diverse audio domains. This work significantly advances the state of audio codecs, particularly in their application to large language models and generative tasks, and sets a foundation for future research in audio representation learning.
Most existing text-to-speech (TTS) systems either synthesize speech sentence by sentence and stitch the results together, or drive synthesis from plain-text dialogues alone. Both approaches leave models with little understanding of global context or paralinguistic cues, making it hard to capture real-world phenomena such as multi-speaker interactions (interruptions, overlapping speech), evolving emotional arcs, and varied acoustic environments. We introduce the Borderless Long Speech Synthesis framework for agent-centric, borderless long audio synthesis. Rather than targeting a single narrow task, the system is designed as a unified capability set spanning VoiceDesigner, multi-speaker synthesis, Instruct TTS, and long-form text synthesis. On the data side, we propose a "Labeling over filtering/cleaning" strategy and design a top-down, multi-level annotation schema we call Global-Sentence-Token. On the model side, we adopt a backbone with a continuous tokenizer and add Chain-of-Thought (CoT) reasoning together with Dimension Dropout, both of which markedly improve instruction following under complex conditions. We further show that the system is Native Agentic by design: the hierarchical annotation doubles as a Structured Semantic Interface between the LLM Agent and the synthesis engine, creating a layered control protocol stack that spans from scene semantics down to phonetic detail. Text thereby becomes an information-complete, wide-band control channel, enabling a front-end LLM to convert inputs of any modality into structured generation commands, extending the paradigm from Text2Speech to borderless long speech synthesis.
Primary: Nanjing University
All Institutions: Nanjing University, WeNet Open Source Community
The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
The proposed methodology introduces a novel framework for long-form speech synthesis that emphasizes the importance of global context and paralinguistic cues. The "Labeling over filtering/cleaning" strategy is innovative, as it challenges conventional practices in data preparation by advocating for the inclusion of complex, noisy data that reflects real-world speech dynamics. The Global-Sentence-Token hierarchical annotation schema is a significant advancement, enabling a structured approach to capturing the nuances of speech synthesis. The integration of Chain-of-Thought reasoning and Dimension Dropout enhances the model's ability to follow complex instructions, which is a notable methodological improvement over existing TTS systems.
The paper lacks quantitative evaluations of the proposed system's performance, particularly in terms of emotional arc coherence and multi-speaker interaction naturalness. While it discusses the challenges of evaluating borderless long audio synthesis, it does not provide concrete experimental results or comparisons with existing methods. The absence of benchmark results limits the ability to assess the system's effectiveness rigorously. Future work is needed to establish robust evaluation metrics that can capture the richness of the proposed framework.
The paper does not provide sufficient implementation details or access to code and datasets, which raises concerns about reproducibility. The lack of a demo or project URL further complicates the ability for other researchers to replicate the findings or build upon this work. Clearer documentation and shared resources would enhance reproducibility.
The system is currently optimized for content creation rather than real-time interactions, which limits its applicability in dynamic environments. Additionally, the training data is primarily speech-centric, and the system's emergent capabilities for sound effects and music are not fully developed. These limitations suggest that while the framework is promising, it requires further refinement and expansion to address broader applications.
The potential applications of this research extend beyond traditional TTS systems, offering possibilities for enhanced audio experiences in content creation, gaming, and virtual environments. The ability to synthesize speech with rich emotional and contextual cues could significantly improve user engagement and interaction quality in various multimedia applications. However, the challenges in real-time synthesis and the need for more diverse training data must be addressed to realize its full impact. The main contribution of this paper is the introduction of the Borderless Long Speech Synthesis framework, which innovatively integrates multi-dimensional annotations and contextual understanding into TTS systems, significantly advancing the state-of-the-art in audio synthesis. The technical contributions and proposed methodologies offer substantial improvements over existing systems, although further experimental validation and reproducibility efforts are necessary to solidify its impact in the field.
While Large Audio-Language Models (LALMs) have advanced audio captioning, robust evaluation remains difficult. Reference-based metrics are expensive and often fail to assess acoustic fidelity, while Contrastive Language-Audio Pretraining (CLAP)-based approaches frequently overlook syntactic errors and fine-grained details. We propose CAF-Score, a reference-free metric that calibrates CLAP's coarse-grained semantic alignment with the fine-grained comprehension and syntactic awareness of LALMs. By combining contrastive audio-text embeddings with LALM reasoning, CAF-Score effectively detects syntactic inconsistencies and subtle hallucinations. Experiments on the BRACE benchmark demonstrate that our approach achieves the highest correlation with human judgments, even outperforming reference-based baselines in challenging scenarios. These results highlight the efficacy of CAF-Score for reference-free audio captioning evaluation. Code and results are available at https://github.com/inseong00/CAF-Score.
Primary: Sogang University
All Institutions: Sogang University
The main contribution of this paper is the introduction of CAF-Score, a novel reference-free metric for audio captioning evaluation that effectively combines the coarse-grained semantic alignment of CLAP with the fine-grained comprehension and syntactic awareness of LALMs. This work represents a significant step forward in the evaluation of audio captioning systems, addressing key challenges in the field and providing a foundation for future research and development.
The methodology presented in this paper is innovative, combining the strengths of CLAP and LALMs to create a reference-free evaluation metric for audio captioning. The use of a sliding-window approach with max pooling to enhance alignment accuracy is particularly noteworthy, as is the adaptation of the FLEUR metric for audio evaluation. The hybrid design effectively addresses the limitations of both models, allowing for more nuanced assessments of audio-text alignment.
The experiments conducted on the BRACE benchmark are extensive and well-structured, demonstrating the effectiveness of CAF-Score in comparison to both reference-based and existing reference-free metrics. The paper provides a thorough analysis of the performance across multiple models and configurations, showcasing the robustness of the proposed metric in various scenarios, including hallucination detection.
The implementation details are clearly outlined, and the authors provide a GitHub repository with code and results, enhancing the reproducibility of the study. However, the reliance on specific model configurations and the computational overhead of LALMs may pose challenges for some researchers attempting to replicate the results.
The paper acknowledges that the performance of CAF-Score is bounded by the capabilities of the underlying models, and instances of simultaneous misalignment between CLAP and LALMs can lead to failures in evaluation. Additionally, the fixed weighting parameter may not be optimal for all audio-caption pairs, suggesting a need for further exploration of adaptive strategies.
The proposed CAF-Score metric has significant implications for the field of audio captioning, providing a scalable and robust evaluation framework that does not rely on costly ground-truth annotations. This advancement could facilitate the development of more effective audio understanding and captioning systems, ultimately enhancing the accessibility and usability of audio content across various applications. The main contribution of this paper is the introduction of CAF-Score, a novel reference-free metric for audio captioning evaluation that effectively combines the coarse-grained semantic alignment of CLAP with the fine-grained comprehension and syntactic awareness of LALMs. This work represents a significant step forward in the evaluation of audio captioning systems, addressing key challenges in the field and providing a foundation for future research and development.
A multi-task learning framework is proposed for optimizing a single deep neural network (DNN) for joint noise reduction (NR) and hearing loss compensation (HLC). A distinct training objective is defined for each task, and the DNN predicts two time-frequency masks. During inference, the amounts of NR and HLC can be adjusted independently by exponentiating each mask before combining them. In contrast to recent approaches that rely on training an auditory-model emulator to define a differentiable training objective, we propose an auditory model that is inherently differentiable, thus allowing end-to-end optimization. The audiogram is provided as an input to the DNN, thereby enabling listener-specific personalization without the need for retraining. Results show that the proposed approach not only allows adjusting the amounts of NR and HLC individually, but also improves objective metrics compared to optimizing a single training objective. It also outperforms a cascade of two DNNs that were separately trained for NR and HLC, and shows competitive HLC performance compared to a traditional hearing-aid prescription. To the best of our knowledge, this is the first study that uses an auditory model to train a single DNN for both NR and HLC across a wide range of listener profiles.
Primary: Technical University of Denmark
All Institutions: Technical University of Denmark
This paper presents a novel multi-task learning framework for joint noise reduction and hearing loss compensation using a single deep neural network. The approach's innovative use of a differentiable auditory model and listener-specific personalization is a significant contribution to the field, with promising experimental results that could lead to practical applications in hearing aids and auditory processing technologies.
The paper introduces a multi-task learning framework that optimizes a single DNN for joint noise reduction (NR) and hearing loss compensation (HLC). The distinct training objectives for each task are well-defined, and the use of a differentiable auditory model for end-to-end optimization is a significant methodological advance. The incorporation of listener-specific audiograms as input for personalization without retraining is particularly innovative, showcasing a practical approach to tailoring solutions for individual users. The methodology is sound, but further details on the architecture and training process would enhance understanding.
The experiments conducted demonstrate a clear comparison between the proposed method and existing approaches, including a cascade of two separately trained DNNs and traditional hearing-aid prescriptions. The results indicate improvements in objective metrics, which are crucial for validating the effectiveness of the proposed framework. However, the paper could benefit from more extensive subjective evaluations (e.g., user studies) to complement the objective metrics and provide a holistic view of performance.
The paper lacks detailed implementation specifics, such as hyperparameters, training data characteristics, and the exact architecture of the DNN. This omission makes it challenging to fully assess reproducibility. Including a supplementary material section with code, datasets, and configuration files would significantly enhance reproducibility.
One limitation is the reliance on the audiogram as an input, which may not be available for all users. Additionally, while the results are promising, the paper does not address potential scalability issues or the performance of the model in real-world scenarios with diverse acoustic environments. The generalizability of the findings across different populations and hearing profiles also warrants further investigation.
The proposed framework has significant implications for the field of audiology and assistive technologies, potentially improving the quality of life for individuals with hearing loss. By enabling personalized adjustments to noise reduction and hearing compensation, this research could lead to more effective hearing aids and auditory devices. The integration of machine learning in this domain represents a step forward in the intersection of health technology and artificial intelligence. This paper presents a novel multi-task learning framework for joint noise reduction and hearing loss compensation using a single deep neural network. The approach's innovative use of a differentiable auditory model and listener-specific personalization is a significant contribution to the field, with promising experimental results that could lead to practical applications in hearing aids and auditory processing technologies.
Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse STS features coherently. To handle complex multi-event scenarios, we further propose Bi-Frame Sound Synthesis, enabling parallel in-frame and out-of-frame audio generation and improving controllability. To support training and evaluation, we construct the DirectorSound dataset and introduce VGGSoundDirector and DirectorBench. Experiments demonstrate that FoleyDirector substantially enhances temporal controllability while maintaining high audio fidelity, empowering users to act as Foley directors and advancing V2A toward more expressive and controllable generation.
Primary: Zhejiang University
All Institutions: Zhejiang University, The State Key Lab of Brain-Machine Intelligence, Zhejiang University
FoleyDirector introduces a novel framework for fine-grained temporal control in video-to-audio generation, significantly advancing the state-of-the-art in this domain. The combination of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and audio synthesis.
The methodology presented in FoleyDirector is innovative, particularly with the introduction of Structured Temporal Scripts (STS) and the Script-Guided Temporal Fusion Module. These components allow for fine-grained temporal control in video-to-audio generation, addressing a significant gap in existing methods that struggle with complex audio generation scenarios. The integration of Bi-Frame Sound Synthesis further enhances the capability to manage both in-frame and out-of-frame audio, showcasing a thoughtful approach to improving controllability in audio synthesis. The methodology is well-structured and provides a clear framework for implementation.
The experimental section demonstrates a robust evaluation of the proposed framework. The construction of the DirectorSound dataset and the introduction of evaluation benchmarks (VGGSoundDirector and DirectorBench) are commendable, as they provide necessary resources for training and evaluation. The experiments effectively illustrate the improvements in temporal controllability and audio fidelity, with results that substantiate the claims made in the paper. However, details on the evaluation metrics used and their significance could be elaborated further to enhance clarity.
While the paper outlines the methodology and experiments, it lacks explicit details regarding the implementation and availability of the code or datasets, which could hinder reproducibility. Providing a link to a project repository or supplementary materials would greatly enhance the paper's reproducibility and allow other researchers to build upon this work.
One limitation is the potential complexity in user interaction with the system, as fine-grained control may require a steep learning curve for users unfamiliar with audio synthesis. Additionally, the paper does not address the scalability of the framework in real-world applications or the computational resources required for training and inference.
The advancements made in FoleyDirector have significant implications for various applications, including film production, video game development, and virtual reality, where precise audio generation is critical. By empowering users to act as Foley directors, the framework can enhance the creative process in multimedia content creation, potentially leading to more immersive experiences. FoleyDirector introduces a novel framework for fine-grained temporal control in video-to-audio generation, significantly advancing the state-of-the-art in this domain. The combination of innovative methodologies and comprehensive experimental validation positions this work as a meaningful contribution to the field of machine learning and audio synthesis.
Spoken dialogue generation is crucial for applications like podcasts, dynamic commentary, and entertainment content, but poses significant challenges compared to single-utterance text-to-speech (TTS). Key requirements include accurate turn-taking, cross-turn acoustic consistency, and long-form stability, which current models often fail to address due to a lack of dialogue context modeling. To bridge this gap, we present MOSS-TTSD, a spoken dialogue synthesis model designed for expressive, multi-party conversational speech across multiple languages. With enhanced long-context modeling, MOSS-TTSD generates long-form spoken conversations from dialogue scripts with explicit speaker tags, supporting up to 60 minutes of single-pass synthesis, multi-party dialogue with up to 5 speakers, and zero-shot voice cloning from a short reference audio clip. The model supports various mainstream languages, including English and Chinese, and is adapted to several long-form scenarios. Additionally, to address limitations of existing evaluation methods, we propose TTSD-eval, an objective evaluation framework based on forced alignment that measures speaker attribution accuracy and speaker similarity without relying on speaker diarization tools. Both objective and subjective evaluation results show that MOSS-TTSD surpasses strong open-source and proprietary baselines in dialogue synthesis.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
MOSS-TTSD presents a significant advancement in spoken dialogue generation, effectively addressing key challenges in the field. The comprehensive evaluation framework and the model's capabilities for long-form synthesis and multi-party interactions mark a notable contribution to the audio processing landscape.
The methodology presented in MOSS-TTSD is robust and well-structured, addressing significant challenges in spoken dialogue generation. The use of a fully discrete speech generation paradigm, combined with a multi-head delay pattern for autoregressive prediction, is innovative. The model's ability to handle long-form synthesis and multi-party dialogue through explicit speaker tagging and zero-shot voice cloning is a notable advancement. The introduction of the TTSD-eval framework for objective evaluation is a significant contribution, as it addresses the limitations of existing metrics that rely on speaker diarization.
The experiments conducted are comprehensive, utilizing both objective and subjective evaluation methods. The paper provides a clear comparison against strong open-source and proprietary baselines, demonstrating the superiority of MOSS-TTSD in terms of speaker consistency and intelligibility. The use of diverse test sets and the detailed description of the evaluation metrics enhance the credibility of the results.
The paper lacks specific URLs for the code and models, which hinders reproducibility. While the methodology is described in detail, the absence of a public repository makes it difficult for other researchers to replicate the results. Providing access to the code and trained models would significantly improve the reproducibility of the findings.
One limitation is the reliance on high-quality training data, which may not be readily available for all languages and scenarios. Additionally, while the model supports multiple languages, the performance across less common languages is not thoroughly evaluated. The potential for biases in the voice cloning process, particularly with limited reference audio, is another area that could be explored further.
The implications of MOSS-TTSD are substantial, particularly in applications such as podcasts, dynamic commentary, and entertainment content. The ability to generate coherent and natural multi-party dialogues opens new avenues for automated content creation and enhances user interaction in various multimedia applications. The model's multilingual capabilities also contribute to its broader applicability in global contexts. MOSS-TTSD presents a significant advancement in spoken dialogue generation, effectively addressing key challenges in the field. The comprehensive evaluation framework and the model's capabilities for long-form synthesis and multi-party interactions mark a notable contribution to the audio processing landscape.
Human communication seamlessly integrates speech and bodily motion, where hand gestures naturally complement vocal prosody to express intent, emotion, and emphasis. While recent text-to-speech (TTS) systems have begun incorporating multimodal cues such as facial expressions or lip movements, the role of hand gestures in shaping prosody remains largely underexplored. We propose a novel multimodal TTS framework, Gesture2Speech, that leverages visual gesture cues to modulate prosody in synthesized speech. Motivated by the observation that confident and expressive speakers coordinate gestures with vocal prosody, we introduce a multimodal Mixture-of-Experts (MoE) architecture that dynamically fuses linguistic content and gesture features within a dedicated style extraction module. The fused representation conditions an LLM-based speech decoder, enabling prosodic modulation that is temporally aligned with hand movements. We further design a gesture-speech alignment loss that explicitly models their temporal correspondence to ensure fine-grained synchrony between gestures and prosodic contours. Evaluations on the PATS dataset show that Gesture2Speech outperforms state-of-the-art baselines in both speech naturalness and gesture-speech synchrony. To the best of our knowledge, this is the first work to utilize hand gesture cues for prosody control in neural speech synthesis. Demo samples are available at https://research.sri-media-analysis.com/aaai26-beeu-gesture2speech/
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of Gesture2Speech, a multimodal TTS framework that leverages hand gestures to enhance prosody in synthesized speech, showcasing a novel approach to integrating visual cues in speech synthesis. This work represents a significant step forward in the field of expressive speech synthesis, combining advanced machine learning techniques with insights from human communication to create more natural and engaging speech outputs.
The proposed Gesture2Speech framework introduces a novel multimodal TTS architecture that integrates hand gestures as dynamic control signals for prosody modulation in synthesized speech. The use of a Mixture-of-Experts (MoE) architecture to dynamically fuse linguistic and gesture features is innovative, allowing for flexible and context-aware speech synthesis. The introduction of a gesture-speech alignment loss to ensure temporal synchrony between gestures and prosodic contours is a significant methodological advancement. However, the paper could benefit from a more detailed explanation of the training process and the specific configurations of the MoE modules.
The experiments conducted on the PATS dataset demonstrate the effectiveness of the Gesture2Speech framework in improving speech naturalness and gesture-speech synchrony compared to state-of-the-art baselines. The use of both objective metrics (e.g., WER, CER, UTMOS) and subjective evaluations (Mean Opinion Scores) provides a comprehensive assessment of the model's performance. The results indicate that the proposed multimodal approach significantly enhances prosodic expressiveness and alignment, although further exploration of different datasets and real-world applications could strengthen the findings.
The paper provides a clear description of the experimental setup, including the dataset, model configurations, and evaluation metrics, which aids reproducibility. However, the lack of a publicly available code repository limits the ability for others to replicate the results directly. Including implementation details such as hyperparameters and training procedures would further enhance reproducibility.
One notable limitation is the reliance on the PATS dataset, which may not encompass a diverse range of cultural and emotional expressions. Additionally, the framework's performance in real-world scenarios, where full-body visibility or high-resolution hand tracking may not be feasible, remains uncertain. The paper also does not address potential computational overhead associated with the MoE architecture, which could impact deployment in resource-constrained environments.
The Gesture2Speech framework has significant implications for applications in areas such as virtual assistants, dubbing, and interactive storytelling, where expressive speech synthesis is crucial. By incorporating hand gestures into TTS systems, the research paves the way for more natural and engaging human-computer interactions. Furthermore, the findings could inspire future research into multimodal communication and the integration of additional non-verbal cues. The main contribution of this paper is the introduction of Gesture2Speech, a multimodal TTS framework that leverages hand gestures to enhance prosody in synthesized speech, showcasing a novel approach to integrating visual cues in speech synthesis. This work represents a significant step forward in the field of expressive speech synthesis, combining advanced machine learning techniques with insights from human communication to create more natural and engaging speech outputs.
While Large Audio-Language Models (LALMs) have been shown to exhibit degraded instruction-following capabilities, their ability to infer task patterns from in-context examples under audio conditioning remains unstudied. To address this gap, we present ALICE, a three-stage framework that progressively reduces textual guidance to systematically evaluate LALMs' in-context learning ability under audio conditioning. Evaluating six LALMs across four audio understanding tasks under two output constraint categories, we uncover a consistent asymmetry across all stages and LALMs: in-context demonstrations reliably improve format compliance but fail to improve, and often degrade, the core task performance. This suggests that LALMs can glean surface-level formatting patterns from demonstrations but may struggle to leverage cross-modal semantic grounding to reliably infer task objectives from audio-conditioned examples, highlighting potential limitations in current cross-modal integration.
Primary: National Taiwan University
All Institutions: National Taiwan University
This paper presents ALICE, a novel framework for evaluating the in-context learning ability of large audio-language models, revealing critical insights into their limitations in cross-modal integration and task inference. The methodology is robust, and the findings contribute meaningfully to the understanding of LALMs, although further exploration in more diverse settings is warranted.
The paper introduces a novel three-stage evaluation framework (ALICE) that systematically reduces textual guidance to assess the in-context learning (ICL) ability of large audio-language models (LALMs) under audio conditioning. The methodology is well-structured, allowing for controlled experiments that isolate the effects of textual cues on task performance and format compliance. The use of diverse audio understanding tasks and the careful selection of models enhances the robustness of the findings.
The experiments are comprehensive, evaluating six LALMs across four audio understanding tasks with two output constraint categories. The results reveal a consistent asymmetry where in-context demonstrations improve format compliance but do not enhance core task performance, providing valuable insights into the limitations of current LALMs. The evaluation metrics are appropriate, and the analysis of results is thorough, although the paper could benefit from more detailed statistical analysis to support the claims.
The paper provides a GitHub repository link for the inference code and related resources, which is a positive aspect for reproducibility. However, the paper lacks detailed implementation specifics regarding the models and datasets used, which may hinder full reproducibility.
The study is limited to format-constrained audio understanding tasks, which may not generalize to other domains or more complex tasks. Additionally, the reliance on surface-level pattern matching for format inference suggests that the models may not be fully leveraging the potential of cross-modal integration, indicating a gap in their capabilities.
The findings have significant implications for the development of LALMs and highlight the need for improved training paradigms that better integrate auditory information with task objectives. This research could inform future work in multimodal AI systems, particularly in enhancing instruction-following and task inference capabilities in audio-language models. This paper presents ALICE, a novel framework for evaluating the in-context learning ability of large audio-language models, revealing critical insights into their limitations in cross-modal integration and task inference. The methodology is robust, and the findings contribute meaningfully to the understanding of LALMs, although further exploration in more diverse settings is warranted.
With the advancements in AI speech synthesis, it is easier than ever before to generate realistic audio in a target voice. One only needs a few seconds of reference audio from the target, quite literally putting words in the target person's mouth. This imposes a new set of forensics-related challenges on speech-based authentication systems, videoconferencing, and audio-visual broadcasting platforms, where we want to detect synthetic speech. At the same time, leveraging AI speech synthesis can enhance the different modes of communication through features such as low-bandwidth communication and audio enhancements - leading to ever-increasing legitimate use-cases of synthetic audio. In this case, we want to verify if the synthesized voice is actually spoken by the user. This will require a mechanism to verify whether a given synthetic audio is driven by an authorized identity, or not. We term this task audio avatar fingerprinting. As a step towards audio forensics in these new and emerging situations, we analyze and extend an off-the-shelf speaker verification model developed outside of forensics context for the task of fake speech detection and audio avatar fingerprinting, the first experimentation of its kind. Furthermore, we observe that no existing dataset allows for the novel task of verifying the authorized use of synthetic audio - a limitation which we address by introducing a new speech forensics dataset for this novel task.
Primary: Fort George G. Meade MD
All Institutions: Fort George G. Meade MD
The main contribution of this paper is the introduction of a novel framework for verifying the authorized use of synthetic audio through audio avatar fingerprinting. This work addresses a critical need in the evolving landscape of AI-generated content and has the potential to significantly impact the fields of audio forensics and security.
The paper proposes a novel approach termed "audio avatar fingerprinting," which extends existing speaker verification models to detect synthetic audio. The methodology is well-structured, leveraging off-the-shelf models while introducing a new dataset specifically designed for the task. The authors provide a clear rationale for their approach, addressing a significant gap in the current literature regarding the verification of synthetic speech. However, the paper could benefit from a more detailed explanation of the model's architecture and the specific modifications made to the existing verification model.
The authors introduce a new dataset for the task, which is a crucial contribution as it enables future research in this area. The experiments conducted demonstrate the effectiveness of the proposed method in distinguishing between authorized and unauthorized synthetic audio. The results are promising, showcasing the potential of the approach, although the paper lacks a comprehensive comparison with other state-of-the-art methods in the domain of audio forensics.
The paper does not provide sufficient details regarding the implementation of the proposed methods or the dataset creation process, which may hinder reproducibility. Including code repositories or detailed experimental setups would enhance the ability of other researchers to replicate the findings.
One notable limitation is the reliance on a single dataset, which may not capture the full diversity of synthetic audio scenarios. Additionally, the paper does not address potential adversarial attacks on the proposed method, which could be a significant concern in real-world applications.
The implications of this research are substantial, particularly in the context of audio forensics and security. As synthetic audio becomes more prevalent, the ability to authenticate voice recordings is crucial for preventing misuse. This work could pave the way for more secure communication systems and enhance trust in audio-based interactions. The main contribution of this paper is the introduction of a novel framework for verifying the authorized use of synthetic audio through audio avatar fingerprinting. This work addresses a critical need in the evolving landscape of AI-generated content and has the potential to significantly impact the fields of audio forensics and security.
Puns represent a typical linguistic phenomenon that exploits polysemy and phonetic ambiguity to generate humour, posing unique challenges for natural language understanding. Within pun research, audio plays a central role in human communication except text and images, while datasets and systematic resources for spoken puns remain scarce, leaving this crucial modality largely underexplored. In this paper, we present APUN-Bench, the first benchmark dedicated to evaluating large audio language models (LALMs) on audio pun understanding. Our benchmark contains 4,434 audio samples annotated across three stages: pun recognition, pun word location and pun meaning inference. We conduct a deep analysis of APUN-Bench by systematically evaluating 10 state-of-the-art LALMs, uncovering substantial performance gaps in recognizing, localizing, and interpreting audio puns. This analysis reveals key challenges, such as positional biases in audio pun location and error cases in meaning inference, offering actionable insights for advancing humour-aware audio intelligence.
Primary: University of Auckland
All Institutions: University of Auckland
This paper introduces APUN-Bench, a pioneering benchmark for evaluating audio pun understanding in large audio language models, significantly advancing the field of multimodal language processing. The comprehensive methodology and rigorous experimental evaluation highlight the challenges faced by current models, providing actionable insights for future research.
The paper presents a novel approach to benchmarking audio pun understanding through the creation of APUN-Bench, which includes a comprehensive dataset of 4,434 audio samples annotated across three distinct stages: pun recognition, pun word location, and pun meaning inference. The methodology is robust, utilizing both synthetic and real-world data, and incorporates human verification to ensure data quality. The multi-stage evaluation framework is well-structured and addresses a significant gap in the understanding of audio puns, making it a valuable contribution to the field.
The experiments conducted on 10 state-of-the-art large audio language models (LALMs) provide a thorough analysis of their performance across the three evaluation stages. The results reveal substantial performance gaps, particularly in pun word location and meaning inference, highlighting the limitations of current models. The use of statistical tests to validate findings adds rigor to the experimental evaluation.
The paper provides sufficient detail regarding the dataset construction, evaluation metrics, and model configurations, which facilitates reproducibility. However, the lack of publicly available URLs for the dataset and models limits the ease of access for other researchers.
The study acknowledges several limitations, including the restricted scope of pun types examined, the focus on single-sentence instances rather than multi-turn dialogues, and the limited size of the real-world corpus. These factors may constrain the generalizability of the findings.
The research has significant implications for advancing audio understanding in natural language processing, particularly in applications related to humor, education, and voice assistants. By addressing the complexities of audio puns, the work paves the way for more sophisticated models that can better understand and generate humor in spoken language. This paper introduces APUN-Bench, a pioneering benchmark for evaluating audio pun understanding in large audio language models, significantly advancing the field of multimodal language processing. The comprehensive methodology and rigorous experimental evaluation highlight the challenges faced by current models, providing actionable insights for future research.
Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.
Primary: unknown
All Institutions: unknown
The main contribution of this work is the introduction of FLAC, a novel probabilistic approach to few-shot acoustic synthesis that effectively captures the uncertainty of scene acoustics, establishing a new direction for data-efficient audio generation. The paper's methodology, experimental results, and potential applications highlight its significance in advancing the field of machine learning for audio synthesis.
The paper introduces a novel approach to few-shot acoustic synthesis using a probabilistic framework that leverages flow-matching and diffusion transformers. This methodology is significant as it addresses the limitations of deterministic models by capturing the uncertainty in room impulse responses (RIRs) under sparse context. The integration of multimodal cues (spatial, geometric, and acoustic) enhances the generation process, making it more adaptable to novel environments. The use of a flow-matching objective is a fresh perspective in the domain of acoustic synthesis, providing a robust foundation for future research.
The authors conducted experiments using two datasets, AcousticRooms and Hearing Anything Anywhere, demonstrating that their method outperforms existing eight-shot baselines with only one-shot training. The introduction of AGREE, a joint acoustic-geometry embedding for evaluation, is a valuable contribution that allows for a more nuanced assessment of generated RIRs. The results are compelling, showcasing the effectiveness of FLAC in generating acoustically consistent outputs, although further details on the statistical significance of the results would strengthen the claims.
The paper lacks detailed implementation specifics, which are crucial for reproducibility. While the methodology is well-articulated, the absence of code or supplementary materials limits the ability of other researchers to replicate the findings. Including a GitHub repository or supplementary materials would significantly enhance the reproducibility of the work.
One limitation is the reliance on specific datasets, which may not generalize across all acoustic environments. Additionally, the method's performance in highly complex or dynamic scenes remains untested. The paper could also benefit from a more thorough exploration of the computational efficiency of the proposed method, especially in real-time applications.
This research has the potential to significantly impact fields such as virtual reality, gaming, and architectural acoustics by enabling more realistic sound generation in immersive environments. The ability to synthesize acoustic responses with minimal data requirements can lead to broader applications in audio engineering and sound design, making it easier to create immersive experiences without extensive recording setups. The main contribution of this work is the introduction of FLAC, a novel probabilistic approach to few-shot acoustic synthesis that effectively captures the uncertainty of scene acoustics, establishing a new direction for data-efficient audio generation. The paper's methodology, experimental results, and potential applications highlight its significance in advancing the field of machine learning for audio synthesis.
Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chains remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning.
Primary: Concordia University
All Institutions: Concordia University, Mila-Quebec AI Institute
The main contribution of this paper is the introduction of a reinforcement learning-based timestamp grounding strategy for large audio-language models, which enhances reasoning accuracy and model interpretability in multimodal tasks. This work represents a meaningful advancement in the integration of temporal awareness in audio processing, addressing a critical gap in existing methodologies and paving the way for future research in this domain.
The paper introduces a reinforcement learning-based strategy for timestamp grounding in large audio-language models, which is a novel approach in the context of multimodal reasoning. The methodology is well-structured, detailing how the model utilizes explicit timestamp annotations to enhance reasoning outputs. The integration of grounding mechanisms is a significant contribution, as it addresses a gap in existing models that often lack temporal awareness. The proposed method is theoretically sound and builds upon existing frameworks, yet it also innovatively extends them by incorporating timestamp grounding, which is a fresh perspective in the field.
The experiments are comprehensive, utilizing four benchmark datasets that are relevant to the task of speech-based reasoning. The results demonstrate a clear improvement over both zero-shot reasoning and fine-tuning approaches without timestamp grounding. The evaluation metrics used are appropriate, and the authors provide a thorough analysis of the model's performance across different scenarios. However, the paper could benefit from additional qualitative assessments to complement the quantitative results, such as user studies or case analyses.
The paper lacks detailed implementation specifics, such as hyperparameter settings, training duration, and hardware specifications, which are crucial for reproducibility. While the methodology is clearly described, the absence of a code repository or supplementary materials limits the ability of other researchers to replicate the findings. Including such details would significantly enhance the paper's impact and utility.
One notable limitation is the reliance on timestamp annotations, which may not be universally applicable across all audio tasks. Additionally, the paper does not address potential scalability issues when applying the proposed method to larger datasets or more complex audio scenarios. The authors also do not discuss the computational overhead introduced by the reinforcement learning component, which could be a concern in real-time applications.
The proposed approach has the potential to significantly advance the field of multimodal reasoning and audio processing. By grounding reasoning in temporal audio segments, it opens avenues for applications in areas such as automated transcription, audio-visual content analysis, and interactive voice response systems. The implications for improving model interpretability and reliability in audio tasks are substantial, making this research relevant for both academic and industrial applications. The main contribution of this paper is the introduction of a reinforcement learning-based timestamp grounding strategy for large audio-language models, which enhances reasoning accuracy and model interpretability in multimodal tasks. This work represents a meaningful advancement in the integration of temporal awareness in audio processing, addressing a critical gap in existing methodologies and paving the way for future research in this domain.
We propose to model parallel streams of data, such as overlapped speech, using shuffles. Specifically, this paper shows how the shuffle product and partial order finite-state automata (FSAs) can be used for alignment and speaker-attributed transcription of overlapped speech. We train using the total score on these FSAs as a loss function, marginalizing over all possible serializations of overlapping sequences at subword, word, and phrase levels. To reduce graph size, we impose temporal constraints by constructing partial order FSAs. We address speaker attribution by modeling (token, speaker) tuples directly. Viterbi alignment through the shuffle product FSA directly enables one-pass alignment. We evaluate performance on synthetic LibriSpeech overlaps. To our knowledge, this is the first algorithm that enables single-pass alignment of multi-talker recordings. All algorithms are implemented using k2 / Icefall.
Primary: Carnegie Mellon University
All Institutions: Brno University of Technology, Carnegie Mellon University, Johns Hopkins University
The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
The methodology presented in this paper is innovative in its application of shuffle products and partial order finite-state automata (FSAs) for modeling overlapped speech. The authors effectively leverage these mathematical constructs to create a framework for alignment and transcription of multi-talker recordings. The approach of using (token, speaker) tuples for speaker attribution is particularly noteworthy, as it directly addresses a significant challenge in the field of speech processing. The imposition of temporal constraints to reduce graph size is a practical consideration that enhances the efficiency of the proposed method.
The experiments conducted on synthetic LibriSpeech overlaps provide a solid basis for evaluating the proposed methods. The paper compares the performance of the shuffle product FSA against traditional methods, demonstrating a clear advantage in terms of alignment accuracy. However, the reliance on synthetic data may limit the generalizability of the results to real-world scenarios. The metrics used for evaluation are appropriate, but further validation on diverse datasets would strengthen the findings.
The paper mentions that all algorithms are implemented using k2 / Icefall, which is a positive aspect for reproducibility. However, the lack of a publicly available code repository or detailed implementation instructions may hinder other researchers from replicating the results. Providing a GitHub repository or similar resource would greatly enhance the reproducibility of the work.
One limitation of the study is the use of synthetic data for training and evaluation, which may not fully capture the complexities of real-world overlapped speech scenarios. Additionally, while the proposed method shows promise, the paper does not provide extensive comparisons with other state-of-the-art techniques, which could have offered more context regarding its performance.
The ability to accurately transcribe and attribute overlapped speech has significant implications for various applications, including automated transcription services, assistive technologies for the hearing impaired, and improvements in human-computer interaction. The proposed method could pave the way for advancements in multi-talker speech recognition systems, making them more robust and effective. The main contribution of this paper is the introduction of a novel algorithm for the single-pass alignment of multi-talker recordings using shuffle products and partial order FSAs. This work represents a significant advancement in the field of speech processing, particularly in addressing the challenges posed by overlapped speech, and has the potential to influence future research and applications in audio processing.
Multimodal models often converge to a dominant-modality solution, in which a stronger, faster-converging modality overshadows weaker ones. This modality imbalance causes suboptimal performance. Existing methods attempt to balance different modalities by reweighting gradients or losses. However, they overlook the fact that each modality has finite information capacity. In this work, we propose IIBalance, a multimodal learning framework that aligns the modality contributions with Intrinsic Information Budgets (IIB). We propose a task-grounded estimator of each modality's IIB, transforming its capacity into a global prior over modality contributions. Anchored by the highest-budget modality, we design a prototype-based relative alignment mechanism that corrects semantic drift only when weaker modalities deviate from their budgeted potential, rather than forcing imitation. During inference, we propose a probabilistic gating module that integrates the global budgets with sample-level uncertainty to generate calibrated fusion weights. Experiments on three representative benchmarks demonstrate that IIBalance consistently outperforms state-of-the-art balancing methods and achieves better utilization of complementary modality cues. Our code is available at: https://github.com/XiongZechang/IIBalance.
Primary: Alibaba Group
All Institutions: Alibaba Group, Beijing Jiaotong University
The main contribution of this paper is the introduction of IIBalance, a multimodal learning framework that utilizes Intrinsic Information Budgets to optimize modality contributions, leading to improved performance in scenarios with imbalanced modalities. This work significantly advances the understanding of modality interplay in multimodal systems and offers a practical solution to a common challenge in the field.
The paper introduces a novel framework, IIBalance, that addresses the issue of modality dominance in multimodal learning by proposing the concept of Intrinsic Information Budgets (IIB). This approach emphasizes the importance of recognizing each modality's information capacity and adapting their contributions accordingly. The methodology is well-structured, with a clear two-stage process that includes prototype-guided relative alignment and uncertainty-aware Bayesian fusion. The use of a dataset-level prior for modality contributions is particularly innovative, allowing for a more nuanced understanding of how different modalities should contribute based on their intrinsic capabilities.
The experimental validation is robust, employing three representative benchmarks (Kinetics-Sounds, CREMA-D, and AVE) to demonstrate the effectiveness of IIBalance. The results indicate consistent improvements over state-of-the-art methods, showcasing not only higher overall accuracy but also better performance in weaker modalities. The paper provides a thorough analysis of the contributions of various components of the proposed method, reinforcing the value of the IIB concept and its implementation.
The paper includes sufficient implementation details, such as training procedures, model architectures, and hyperparameter settings, which facilitate reproducibility. The authors have also made their code publicly available, further enhancing the potential for others to replicate and build upon their work.
While the proposed method shows promising results, the paper does not extensively discuss the scalability of the approach to more complex multimodal scenarios or its performance in real-world applications. Additionally, the reliance on a fixed IIB prior during training may limit adaptability in dynamic environments where modality reliability can change rapidly.
The implications of this work extend to various applications in audio-visual recognition, human-computer interaction, and any domain where multimodal data is prevalent. By improving how models leverage complementary information from different modalities, this research could enhance the robustness and accuracy of systems in fields such as robotics, surveillance, and multimedia content analysis. The main contribution of this paper is the introduction of IIBalance, a multimodal learning framework that utilizes Intrinsic Information Budgets to optimize modality contributions, leading to improved performance in scenarios with imbalanced modalities. This work significantly advances the understanding of modality interplay in multimodal systems and offers a practical solution to a common challenge in the field.
This technical report presents MOSS-TTS, a speech generation foundation model built on a scalable recipe: discrete audio tokens, autoregressive modeling, and large-scale pretraining. Built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations, we release two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio. Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation. This report summarizes the design, training recipe, and empirical characteristics of the released models.
Primary: Shanghai Innovation Institute
All Institutions: Shanghai Innovation Institute, MOSI Intelligence, Fudan University
The main contribution of this paper is the introduction of MOSS-TTS, a scalable speech generation model that emphasizes control and efficiency in audio synthesis. The technical contribution is significant, addressing current limitations in speech generation while providing a flexible framework for future research and application in the audio domain.
The methodology presented in MOSS-TTS is well-structured, leveraging a causal Transformer tokenizer and autoregressive modeling to achieve efficient speech generation. The introduction of MOSS-Audio-Tokenizer for compressing audio and the dual generator architecture (MOSS-TTS and MOSS-TTS-Local-Transformer) demonstrates a thoughtful approach to scalability and control in speech synthesis. The focus on zero-shot voice cloning and token-level control adds significant value to the framework, indicating a robust understanding of current challenges in the field.
The paper outlines a comprehensive evaluation across multilingual and open-domain settings, which is essential for demonstrating the model's versatility. However, the lack of detailed quantitative results or comparisons with existing state-of-the-art models limits the assessment of its performance. The empirical characteristics are mentioned, but more rigorous benchmarking against established metrics would strengthen the claims of superiority.
While the paper provides a clear design and training recipe, it lacks specific implementation details and code availability, which are critical for reproducibility. The absence of a project URL further complicates the ability for other researchers to replicate the results or build upon this work.
The paper does not adequately address potential limitations of the MOSS-TTS framework, such as the computational resources required for training and inference or the potential biases in voice cloning across different demographics. Additionally, the evaluation metrics used for assessing model performance could be more thoroughly discussed.
MOSS-TTS has the potential to significantly impact various applications, including virtual assistants, content creation, and accessibility tools for individuals with speech impairments. The ability to perform zero-shot voice cloning and nuanced control over speech generation could lead to more personalized and engaging user experiences. The main contribution of this paper is the introduction of MOSS-TTS, a scalable speech generation model that emphasizes control and efficiency in audio synthesis. The technical contribution is significant, addressing current limitations in speech generation while providing a flexible framework for future research and application in the audio domain.
Large audio language models (LALMs) can answer questions about speech, music, and environmental sounds, yet their internal reasoning is largely opaque and difficult to validate. We describe TalTech's solution to the Agent Track of the Interspeech 2026 Audio Reasoning Challenge, in which systems are evaluated on reasoning process quality, specifically the factual accuracy, logical soundness, and completeness of their reasoning chains. Our multi-source ensemble pipeline uses two LALMs that generate independent observations, while a separate text-only reasoning model cross-checks these against outputs from 25 acoustic tools organized into reliability tiers. By grounding every inference step in explicit, reliability-tagged evidence, the system produces dense, verifiable reasoning chains. Our system ranked first in the challenge, outperforming all competing systems by a wide margin in challenge's reasoning quality metric.
Primary: Tallinn University of Technology
All Institutions: Tallinn University of Technology
The paper presents a novel multi-source evidence fusion approach for audio question answering, achieving top performance in reasoning quality while addressing the challenges of reliability and transparency in LALMs. The comprehensive methodology and strong experimental results contribute significantly to the field of audio understanding and reasoning, paving the way for future advancements in multimodal AI systems.
The paper presents a robust multi-source ensemble pipeline that effectively combines two large audio language models (LALMs) with a tiered reliability framework for acoustic tools. The methodology emphasizes dual-source evidence fusion and a structured contradiction detection mechanism, which enhances the reasoning quality of the system. The approach of grounding in reliability-tagged evidence is innovative and addresses the common issue of hallucination in LALMs, making the reasoning process more transparent and verifiable.
The evaluation is conducted on the Interspeech 2026 Audio Reasoning Challenge dataset, which is comprehensive and includes a diverse range of audio scenarios. The reported results demonstrate a strong performance, with the system achieving the highest reasoning quality score and competitive accuracy. Ablation studies provide statistical significance to the improvements gained from the dual-source evidence fusion, reinforcing the effectiveness of the proposed methodology.
The paper provides detailed implementation details, including the models and tools used, which enhances reproducibility. However, the reliance on empirical tuning of reliability weights and confidence caps without a data-driven approach may pose challenges for complete reproducibility in other contexts.
The system's end-to-end latency of 8-10 minutes per sample limits its applicability in real-time scenarios. Additionally, while the architecture is well-suited for the challenge, its generalizability to other reasoning tasks remains to be fully validated. The empirical tuning of parameters may also restrict the adaptability of the system to different datasets or tasks.
The proposed system has significant implications for audio understanding and reasoning, particularly in applications such as automated audio analysis, content moderation, and interactive audio systems. By improving the transparency and reliability of audio question answering, it opens avenues for more trustworthy AI applications in various domains, including education, entertainment, and accessibility. The paper presents a novel multi-source evidence fusion approach for audio question answering, achieving top performance in reasoning quality while addressing the challenges of reliability and transparency in LALMs. The comprehensive methodology and strong experimental results contribute significantly to the field of audio understanding and reasoning, paving the way for future advancements in multimodal AI systems.
During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University, Quebec Artificial Intelligence Institute, Université de Montréal
The main contribution of this paper is the introduction of FLAIR, a framework that enables full-duplex spoken dialogue models to perform latent reasoning concurrently with speech perception, enhancing response quality and conversational dynamics. This innovative approach addresses a critical gap in current dialogue systems, allowing for more human-like interactions and setting a foundation for future research in the field.
The proposed methodology, FLAIR, introduces a novel approach to full-duplex spoken dialogue systems by integrating latent reasoning during the listening phase. This is achieved through an Evidence Lower Bound (ELBO)-based objective that allows for efficient supervised fine-tuning without requiring explicit reasoning annotations. The use of a Global-aware Expert model to derive latent embeddings is innovative, as it leverages the entire dialogue context to enhance response generation. The recursive feeding of latent embeddings during user speech is a significant departure from traditional autoregressive models, allowing for continuous reasoning without latency.
The experimental results demonstrate the effectiveness of FLAIR across multiple benchmarks, showcasing improvements in response quality and conversational dynamics. The paper provides a comprehensive evaluation on various tasks, including factual knowledge and multi-turn question answering, and compares FLAIR against existing models. The results indicate that FLAIR achieves competitive performance, particularly in scenarios requiring reasoning, which underscores its practical applicability in real-world dialogue systems.
The paper outlines the training process and architecture in detail, including the data generation methods and the training pipeline. However, the lack of a publicly available code repository or demo limits reproducibility, as external researchers cannot easily implement or test the proposed model.
One limitation is the reliance on synthetic data for training, which may not fully capture the nuances of real-world conversational interactions. Additionally, the paper does not address potential biases in the generated datasets or the implications of using large-scale models in diverse applications. The absence of a demo or project URL also hinders practical engagement with the work.
The advancements presented in this paper have the potential to significantly enhance human-computer interaction, making conversational agents more responsive and capable of handling complex dialogue scenarios. This could lead to more natural and efficient communication in various applications, including customer service, virtual assistants, and educational tools. The main contribution of this paper is the introduction of FLAIR, a framework that enables full-duplex spoken dialogue models to perform latent reasoning concurrently with speech perception, enhancing response quality and conversational dynamics. This innovative approach addresses a critical gap in current dialogue systems, allowing for more human-like interactions and setting a foundation for future research in the field.
Reliable Sound Source Localization (SSL) plays an essential role in many downstream tasks, where informed decision making depends not only on accurate localization but also on the confidence in each estimate. This need for reliability becomes even more pronounced in challenging conditions, such as reverberant environments and multi-source scenarios. However, existing SSL methods typically provide only point estimates, offering limited or no Uncertainty Quantification (UQ). We leverage the Conformal Prediction (CP) framework and its extensions for controlling general risk functions to develop two complementary UQ approaches for SSL. The first assumes that the number of active sources is known and constructs prediction regions that cover the true source locations. The second addresses the more challenging setting where the source count is unknown, first reliably estimating the number of active sources and then forming corresponding prediction regions. We evaluate the proposed methods on extensive simulations and real-world recordings across varying reverberation levels and source configurations. Results demonstrate reliable finite-sample guarantees and consistent performance for both known and unknown source-count scenarios, highlighting the practical utility of the proposed frameworks for uncertainty-aware SSL.
Primary: Tel-Aviv University
All Institutions: Tel-Aviv University
The main contribution of this paper is the development of a robust framework for uncertainty quantification in multi-speaker sound source localization, leveraging Conformal Prediction methods to provide reliable prediction regions and risk control. This work significantly advances the field by addressing the critical need for confidence measures in SSL, enabling more informed decision-making in complex acoustic environments.
The paper presents a novel approach to uncertainty quantification (UQ) in multi-speaker sound source localization (SSL) using the Conformal Prediction (CP) framework. It introduces two complementary methods: one for known source counts that constructs prediction regions around estimates, and another for unknown source counts that estimates the number of sources while forming corresponding prediction regions. The methodology is well-grounded in statistical theory and effectively addresses the limitations of existing SSL methods that typically provide only point estimates without quantifying uncertainty. The integration of risk control into the UQ framework is a significant advancement, allowing for more informed decision-making in practical applications.
The experimental evaluation is comprehensive, utilizing both simulated environments and real-world recordings to assess the proposed methods under varying conditions of reverberation and source configurations. The results demonstrate the effectiveness of the proposed frameworks, showing reliable finite-sample guarantees and consistent performance across different scenarios. The use of both classical and deep learning-based likelihood maps strengthens the validity of the findings. However, the paper could benefit from more detailed comparisons with existing state-of-the-art methods to contextualize the improvements achieved.
The authors provide a GitHub repository with the code, which enhances reproducibility. The detailed description of the experimental setup, including datasets and calibration processes, allows for other researchers to replicate the experiments. However, the paper could improve by including more specific instructions for running the code and potentially providing a demo or example outputs.
While the proposed methods show promise, there are limitations regarding the reliance on the accuracy of the likelihood maps generated by the underlying SSL methods. The performance may degrade in highly complex acoustic environments or with significant noise interference. Additionally, the paper does not address the computational efficiency of the proposed methods, which could be a concern for real-time applications.
The research has significant implications for various applications in audio signal processing, robotics, and human-computer interaction, where reliable sound source localization is critical. By providing a framework for uncertainty-aware SSL, this work could enhance the robustness of systems that rely on accurate localization, such as autonomous vehicles and assistive technologies for the hearing impaired. The integration of UQ into SSL methods could also pave the way for more advanced applications in augmented reality and immersive audio experiences. The main contribution of this paper is the development of a robust framework for uncertainty quantification in multi-speaker sound source localization, leveraging Conformal Prediction methods to provide reliable prediction regions and risk control. This work significantly advances the field by addressing the critical need for confidence measures in SSL, enabling more informed decision-making in complex acoustic environments.