We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Shanghai Innovation Institute, Nanyang Technological University, Hunyuan Team, Tencent, Tianjin University, ZODA, Peking University, Fudan University
The paper benchmarks five leading audio editing models (Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ) along with two baselines (Identity, Noise). The inclusion of baselines provides essential context for interpreting model performance. A notable detail is the acknowledgment of input length constraints for some models, leading to evaluation on a subset of samples (801 out of 2,000) for MMEdit, Audio-Omni, and SmartDJ, while others are evaluated on the full set. This pragmatic approach ensures fair comparison given current model limitations. The results are striking and provide clear diagnostic insights. The consistently low Exact Match Rate (EMR) across all models (below 5%, and 0% for complex mixed-modality tasks) unequivocally demonstrates that current systems are far from achieving reliable, flawless edits. This highlights a significant challenge for the field. The detailed analysis by task complexity and modality reveals universal performance degradation for more complex and mixed-modality tasks, exposing a lack of structural robustness and cross-domain synchronization. The observed trade-off between Instruction Following Rate (IF
MMAE introduces a meticulously designed benchmark for instruction-based audio editing, addressing a critical gap in the field. The core methodology revolves around a comprehensive, multi-dimensional taxonomy and a novel rubric-based evaluation paradigm. The taxonomy is a strong point, categorizing tasks along three orthogonal dimensions: Modality (7 types, including mixed modalities like sound-music-speech), Complexity (6 levels from 'Single' to 'Multi-hop' and 'Multi-round'), and Operation (8 types, local and global edits). This systematic classification allows for fine-grained analysis of model capabilities across a vast spectrum of real-world scenarios, moving beyond the fragmented and restricted scope of previous benchmarks. The distributions across these dimensions are presented, indicating a diverse dataset. The rubric-based evaluation paradigm is particularly innovative and robust for open-ended generative tasks. It decomposes free-form editing tasks into 17,741 atomic, verifiable criteria, assessed along two crucial dimensions: Instruction Following (IF) and Consistency (CR). The four principles guiding rubric design—Completeness, Atomicity, Orthogonality, and Objectivity—are well-articulated and essential for ensuring the reliability and interpretability of the evaluation. Using an external, high-performance audio language model (Qwen3-Omni) as a judger, with multiple queries and majority voting, adds a layer of objectivity and scalability that traditional human MOS ratings often lack for such complex tasks. The Exact Match Rate (EMR) serves as a stringent, holistic metric for perfect execution. The data curation pipeline is rigorous, involving five stages from expert brainstorming to iterative quality inspection. The human-agent collaborative annotation workflow, leveraging an agentic pipeline (Omni-Detective) for detailed audio captions and LLMs for initial rubric drafts, followed by human refinement, is a pragmatic approach to balance efficiency and quality for a large-scale, complex dataset. This hybrid approach is crucial for generating high-fidelity samples and rubrics.
The paper benchmarks five leading audio editing models (Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ) along with two baselines (Identity, Noise). The inclusion of baselines provides essential context for interpreting model performance. A notable detail is the acknowledgment of input length constraints for some models, leading to evaluation on a subset of samples (801 out of 2,000) for MMEdit, Audio-Omni, and SmartDJ, while others are evaluated on the full set. This pragmatic approach ensures fair comparison given current model limitations. The results are striking and provide clear diagnostic insights. The consistently low Exact Match Rate (EMR) across all models (below 5%, and 0% for complex mixed-modality tasks) unequivocally demonstrates that current systems are far from achieving reliable, flawless edits. This highlights a significant challenge for the field. The detailed analysis by task complexity and modality reveals universal performance degradation for more complex and mixed-modality tasks, exposing a lack of structural robustness and cross-domain synchronization. The observed trade-off between Instruction Following Rate (IF
AI glasses present a compelling platform for AI agents to serve as personalized memory assistants. To be genuinely useful, such systems must move beyond short-term video comprehension and address memory gaps that humans experience for practical, personal, or social purposes over longitudinal egocentric video streams. However, existing egocentric datasets predominantly focus on action recognition or generic QAs from short clips, measuring perceptual capabilities rather than realistic human memory needs. We introduce SuperMemory-VQA, an egocentric visual question answering (VQA) dataset for evaluating AI assistants on practical, long-horizon memory tasks. It contains 52.9 hours of everyday activities recorded with AI glasses, including synchronized RGB video, audio transcription, eye gaze, IMU, and SLAM trajectories. Through a human-verified annotation pipeline, we construct grounded 4,853 question-answer pairs that span object and location memory, intent recall, visual scene recall, timeline reconstruction, conversational memory, and in-context retrieval. Each question is posed as multiple-choice with an explicit "unanswerable" option to test hallucination robustness. Benchmarking leading agentic frameworks and LLM backbones reveals that existing systems remain far from reliable on real-world memory tasks, highlighting the need for new architectures for grounded AI memory that can answer only when evidence is sufficient. A participant survey further supports that our questions are realistic, useful, and aligned with everyday memory needs.
Primary: The Ohio State University
All Institutions: The Ohio State University, Meta
The methodology for creating SuperMemory-VQA is robust and well-conceived. The dataset is designed to address a critical gap in egocentric VQA: long-horizon memory tasks that reflect realistic human needs. The six defined memory tasks (Object & Location, Conversational, Visual Scene, In-Context Retrieval, Timeline Reconstruction, Intent Recall) are highly relevant and cover a broad spectrum of memory challenges. The use of Gen 1 Meta Aria Glasses for data collection ensures rich, synchronized multimodal data (RGB, SLAM, eye-tracking, 7-channel audio, IMU, Magnetometer, Barometer), which is crucial for grounded memory tasks. The IRB-approved protocol and anonymization procedures (WhisperX transcription with manual removal of sensitive info, EgoBlur for faces/plates) demonstrate ethical data handling. A significant methodological strength is the human-in-the-
The methodology for creating SuperMemory-VQA is robust and well-conceived. The dataset is designed to address a critical gap in egocentric VQA: long-horizon memory tasks that reflect realistic human needs. The six defined memory tasks (Object & Location, Conversational, Visual Scene, In-Context Retrieval, Timeline Reconstruction, Intent Recall) are highly relevant and cover a broad spectrum of memory challenges. The use of Gen 1 Meta Aria Glasses for data collection ensures rich, synchronized multimodal data (RGB, SLAM, eye-tracking, 7-channel audio, IMU, Magnetometer, Barometer), which is crucial for grounded memory tasks. The IRB-approved protocol and anonymization procedures (WhisperX transcription with manual removal of sensitive info, EgoBlur for faces/plates) demonstrate ethical data handling. A significant methodological strength is the human-in-the-
MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: \textbf{DeepStack cross-layer feature injection}, which exposes the decoder to acoustic information from multiple encoder depths, and \textbf{time markers}, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.
Primary: OpenMOSS Team
All Institutions: OpenMOSS Team
MOSS-Audio presents a unified audio-language model that achieves state-of-the-art performance across various audio understanding tasks, leveraging innovative architectural choices and a comprehensive training methodology. The technical contributions, particularly in temporal grounding and feature injection, position it as a significant advancement in the field of audio processing and multimodal AI systems.
The methodology presented in MOSS-Audio is robust and innovative, employing a modular architecture that integrates a dedicated audio encoder, modality adapter, and a large language model. The introduction of DeepStack cross-layer feature injection is particularly notable, as it enhances the model's ability to capture multi-level acoustic features, which is crucial for understanding complex audio inputs. Additionally, the incorporation of explicit time markers for temporal grounding is a significant advancement, allowing the model to handle timestamped transcription and time-aware question answering more effectively. The event-preserving audio annotation pipeline is also a strong methodological contribution, ensuring that the model is trained on coherent audio segments rather than arbitrary cuts, which is essential for real-world audio understanding tasks.
The empirical results demonstrate that MOSS-Audio achieves state-of-the-art performance across a variety of benchmarks, including general audio understanding, speech captioning, ASR, and timestamped ASR. The model's performance is particularly impressive given its relatively compact size compared to other state-of-the-art models, indicating efficient scaling. The evaluation methodology is thorough, utilizing a range of benchmarks and metrics that provide a comprehensive assessment of the model's capabilities. However, the paper could benefit from more detailed comparisons against a wider array of existing models to further contextualize its performance.
The paper provides a detailed description of the architecture, training pipeline, and evaluation metrics, which supports reproducibility. However, the absence of publicly available code or model weights limits the practical reproducibility of the results. Future work should consider releasing the model and training code to facilitate independent validation and experimentation by the research community.
While MOSS-Audio shows strong performance, the paper does not address potential limitations related to the model's generalization to unseen audio types or its performance in noisy environments. Additionally, the reliance on large-scale annotated datasets may raise concerns about the model's applicability in low-resource settings. The paper could also explore the computational costs associated with training and deploying such models, which can be significant.
MOSS-Audio has the potential to significantly impact various applications, including voice assistants, automated transcription services, and audio analysis tools in diverse fields such as healthcare, entertainment, and security. By providing a unified framework for audio understanding, the model can enhance user interactions with technology and improve accessibility for individuals with hearing impairments. The advancements in temporal reasoning and audio-grounded reasoning could also lead to more sophisticated AI systems capable of understanding and responding to complex audio cues in real-time. MOSS-Audio presents a unified audio-language model that achieves state-of-the-art performance across various audio understanding tasks, leveraging innovative architectural choices and a comprehensive training methodology. The technical contributions, particularly in temporal grounding and feature injection, position it as a significant advancement in the field of audio processing and multimodal AI systems.
We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.
Primary: Johannes Kepler University
All Institutions: Johannes Kepler University, Sony Computer Science Laboratories
The experimental evaluation is comprehensive and well-structured, providing both objective and subjective evidence for LiveBand's superiority. **Datasets and Metrics**: Training and evaluation on Slakh2100, a standard multi-instrument dataset, ensures comparability. The use of an internal corpus for one variant (LiveBand$_int$) adds a dimension of real-world data. The chosen objective metrics (FAD$_vgg/CLAP$, BA$_F1$, COCOLA) are highly relevant for assessing audio quality, rhythmic accuracy, and mix adherence in music generation. Evaluating over two 10-second segments to measure drift is particularly insightful for streaming models. **Baselines**: Comparing against StreamMusicGen (SMG), a recent and relevant causal autoregressive model, provides a strong benchmark. The inclusion of a bidirectional upper bound (LiveBand$_bid$) and ground truth offers valuable context for potential performance ceilings. **Results**: LiveBand consistently outperforms SMG across all objective metrics (FAD, BA, COCOLA) and anticipation settings (0s, 0.1s, 1s). This is a significant finding, especially the ability to maintain strong performance even with 1-second anticipation, demonstrating that the model can generate meaningfully ahead of the
LiveBand proposes a robust and well-thought-out methodology for real-time, causal music accompaniment generation. The core innovation lies in training a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision. This design choice directly addresses the critical challenges of real-time systems: strict causality, low latency, and maintaining musical coherence without future lookahead. A key strength is the elimination of teacher forcing and its associated exposure bias. By feeding the generator only independent per-step Gaussian noise and causally available mix context, the training and inference conditions are matched by design. This allows for parallel computation during training while ensuring that the model's internal state accurately reflects its own generated history during autoregressive inference, avoiding the drift common in teacher-forced models. This is a significant improvement over prior autoregressive approaches that struggle with error accumulation. The use of a pre-trained *causal* audio autoencoder (a retrained CoDiCodec variant) is crucial, as it ensures that the latent representations themselves respect causality, which is fundamental to the real-time constraint. Operating in a continuous latent space, as opposed to discrete tokens, likely contributes to the high-fidelity claim and potentially smoother generation. Adversarial sequence-level supervision from a non-causal 1D convolutional discriminator is another strong methodological choice. This encourages the generator to produce holistically realistic and coherent sequences, rather than focusing on frame-level accuracy which can be overly rigid for musical tasks requiring continuous realignment. The discriminator's non-causal nature is justified by its training-only role, allowing it to provide a richer signal. Finally, the Adaptive Gradient Penalty (AdaGP) is a practical and elegant solution to stabilize adversarial training. By dynamically adjusting the gradient penalty weight to maintain a fixed discriminator advantage, it reduces the burden of hyperparameter tuning and improves training robustness, a common pain point in GANs. The generator's architecture, a pre-norm transformer with adaptive layer normalization, query-key normalization, Rotary Position Embeddings, and FlexAttention with scalar attention sink, incorporates modern best practices for stable and efficient transformer training.
The experimental evaluation is comprehensive and well-structured, providing both objective and subjective evidence for LiveBand's superiority. **Datasets and Metrics**: Training and evaluation on Slakh2100, a standard multi-instrument dataset, ensures comparability. The use of an internal corpus for one variant (LiveBand$_int$) adds a dimension of real-world data. The chosen objective metrics (FAD$_vgg/CLAP$, BA$_F1$, COCOLA) are highly relevant for assessing audio quality, rhythmic accuracy, and mix adherence in music generation. Evaluating over two 10-second segments to measure drift is particularly insightful for streaming models. **Baselines**: Comparing against StreamMusicGen (SMG), a recent and relevant causal autoregressive model, provides a strong benchmark. The inclusion of a bidirectional upper bound (LiveBand$_bid$) and ground truth offers valuable context for potential performance ceilings. **Results**: LiveBand consistently outperforms SMG across all objective metrics (FAD, BA, COCOLA) and anticipation settings (0s, 0.1s, 1s). This is a significant finding, especially the ability to maintain strong performance even with 1-second anticipation, demonstrating that the model can generate meaningfully ahead of the
Deep learning has advanced pathological voice detection rapidly, yet rare laryngeal diseases remain underexplored due to data scarcity. Recurrent Respiratory Papillomatosis (RRP) exemplifies this gap: an HPV-induced disease of the larynx in which patients oscillate between recurrence and post-surgical remission over the years. RRP demands continuous voice monitoring that existing cross-sectional corpora cannot support. We introduce the first longitudinal voice dataset for RRP, comprising recordings from 26 patients with up to ten years of follow-up. Each session pairs sustained vowels with sentence-level utterances, which are annotated by otolaryngologists and confirmed synchronously with laryngoscopy. Building on this resource, we establish a systematic benchmark spanning handcrafted features, end-to-end deep networks, self-supervised pretrained models, and recent audio large language models, all evaluated under session-level cross-validation with patient-level audit. Per-subject longitudinal analyses further confirm that the cross-sectional discriminative signal reflects laryngoscopic disease state rather than stable speaker attributes. This work lays a foundation for rare longitudinal pathological voice tasks in low-resource clinical settings.
Primary: National Taiwan University
All Institutions: National Taiwan University, National Taiwan Normal University, Academia Sinica, Massachusetts Institute of Technology, Far Eastern Memorial Hospital, Yuan Ze University, University of Southern California, Taipei Municipal Zhongshan Girls High School
The paper introduces RRP-Voice, the first longitudinal voice corpus for Recurrent Respiratory Papillomatosis, providing a critical resource for advancing voice pathology detection. The comprehensive benchmarking and longitudinal analysis contribute significantly to the field, addressing gaps in existing research and offering a foundation for future studies in rare disease diagnostics.
The methodology is robust, introducing a longitudinal dataset that addresses a significant gap in the study of rare laryngeal diseases. The systematic benchmarking across various representation families, including handcrafted features and modern deep learning approaches, demonstrates a comprehensive approach to evaluating voice pathology detection. The use of synchronous laryngoscopic labels adds credibility to the dataset and ensures that the results are clinically relevant.
The experimental setup is thorough, employing a well-structured cross-validation approach that preserves session integrity. The results show clear distinctions between different methods, particularly highlighting the effectiveness of self-supervised models over traditional supervised baselines. The longitudinal analysis provides valuable insights into the dynamics of voice pathology, which is often overlooked in cross-sectional studies.
The paper provides sufficient details on the experimental setup, including model architectures, training procedures, and evaluation metrics, which enhances reproducibility. However, the absence of a publicly accessible dataset or code repository limits the ability for other researchers to replicate the findings directly.
The dataset is relatively small, with only 26 patients contributing to the recordings, which may limit the generalizability of the findings. Additionally, the study focuses primarily on cross-sectional classification without delving deeply into longitudinal predictive modeling, which could be a valuable extension.
This work has significant implications for clinical practices in monitoring rare laryngeal diseases, potentially leading to improved patient outcomes through better diagnostic tools. The introduction of a longitudinal dataset also sets a precedent for future research in low-resource clinical settings, encouraging the exploration of other rare diseases using similar methodologies. The paper introduces RRP-Voice, the first longitudinal voice corpus for Recurrent Respiratory Papillomatosis, providing a critical resource for advancing voice pathology detection. The comprehensive benchmarking and longitudinal analysis contribute significantly to the field, addressing gaps in existing research and offering a foundation for future studies in rare disease diagnostics.
We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.
Primary: Alibaba Group
All Institutions: Alibaba Group, Independent Researcher
The paper introduces UniVocal, a pioneering framework for Speech-Singing Code-Switching synthesis, utilizing innovative methodologies to enhance vocal synthesis capabilities. The comprehensive approach to addressing data scarcity and improving acoustic modeling positions this work as a significant advancement in the field of audio synthesis, with potential applications across various domains.
The methodology presented in the paper is robust and innovative, introducing a unified framework for Speech-Singing Code-Switching (SCS) synthesis. The two-stage curriculum learning strategy is a significant contribution, allowing the model to progressively learn complex vocal transitions driven by text semantics. The integration of a refined cent token and Chain-of-Thought (CoT) generation enhances prosodic and melodic control, addressing limitations in existing models that rely on semantic tokenizers. The scalable data synthesis pipeline for generating diverse code-switching data is another noteworthy aspect, demonstrating a practical approach to overcoming data scarcity in this domain.
The experimental evaluation is thorough, with results demonstrating state-of-the-art performance on the newly introduced SCSBench benchmark. The paper provides a comprehensive analysis of the model's capabilities across different tasks, including empathetic speech generation and singing. The use of both objective metrics (e.g., WER, SIM, UTMOS) and subjective evaluations (e.g., human ratings for empathy and musicality) strengthens the validity of the results. The ablation studies effectively highlight the contributions of specific components, such as the refined cent token and the curriculum learning strategy.
The paper includes detailed descriptions of the model architecture, training procedures, and evaluation methodologies, which support reproducibility. The authors provide links to the code and datasets, facilitating further research and validation of their findings. However, the complexity of the model and the need for substantial computational resources may pose challenges for complete replication.
The paper acknowledges limitations related to the quality of synthetic singing data and the potential for artifacts in generated audio. Additionally, the reliance on explicit semantic triggers for robust generalization in real-world scenarios indicates that the model may struggle with purely implicit transitions. These limitations suggest areas for future improvement, particularly in enhancing the model's adaptability to diverse contexts.
The implications of this research extend to various applications in audio synthesis, including entertainment, education, and assistive technologies. The ability to seamlessly switch between speech and singing could enhance user experiences in interactive media, storytelling, and educational tools. However, the potential for misuse in generating deepfakes raises ethical considerations that must be addressed in future developments. The paper introduces UniVocal, a pioneering framework for Speech-Singing Code-Switching synthesis, utilizing innovative methodologies to enhance vocal synthesis capabilities. The comprehensive approach to addressing data scarcity and improving acoustic modeling positions this work as a significant advancement in the field of audio synthesis, with potential applications across various domains.
We address the challenge of generating high-fidelity, long-form soundtracks that remain coherent across scene transitions. Existing AI music systems are mainly designed for short, isolated clips and lack mechanisms to ensure narrative continuity. We present JenBridge, a modular and interpretable framework for adaptive long-form video soundtracking that ensures both high-fidelity audio generation and transition naturalness. The core architecture is a Transformer-based generative model trained with a flow-matching objective, following a two-stage paradigm: pretraining on large-scale text-audio corpora to establish robust musical priors, then adapting to the video domain with dual text-visual conditioning for precise cross-modal alignment. Crucially, to achieve long-form coherence across diverse scene changes, JenBridge incorporates a novel adaptive transition mechanism. This system features a versatile toolkit of transition styles, including a generative transition method, and uniquely employs a Large Language Model (LLM) Agent that acts as a director to select the most appropriate transition for each narrative shift intelligently. To rigorously assess this task, we propose the LVS Benchmark, a new benchmark that includes a curated dataset and novel evaluation metrics focusing on holistic and transition-aware assessment. Extensive experiments on the proposed benchmark demonstrate that JenBridge significantly outperforms existing methods in both objective and subjective metrics, particularly in terms of transition naturalness and overall narrative coherence. JenBridge represents a significant step towards fully automated, professional-quality video soundtracking.
Primary: Jen Music AI
All Institutions: Jen Music AI
JenBridge represents a significant advancement in the field of adaptive long-form video soundtracking, integrating innovative methodologies and robust evaluation frameworks to address the challenges of coherence and narrative continuity in AI-generated music. The comprehensive analysis of its technical contributions and the proposed benchmarks positions this work as a valuable resource for future research and applications in creative AI.
The methodology presented in JenBridge is robust, employing a two-stage training paradigm that integrates a Transformer-based generative model with a flow-matching objective. The segmentation of video into semantically coherent clips allows for a more manageable approach to music generation, while the dual conditioning mechanism (text and visual) enhances cross-modal alignment. The introduction of an LLM agent as a director for adaptive transitions is particularly innovative, showcasing a sophisticated understanding of narrative coherence in soundtracking. The modularity and interpretability of the framework are significant strengths, allowing for creative control over the soundtracking process.
The experimental evaluation is thorough, utilizing both objective and subjective metrics to assess performance on the newly proposed LVS Benchmark. The results demonstrate a clear superiority of JenBridge over existing methods, particularly in transition naturalness and overall coherence. The use of a user study adds a valuable layer of validation to the findings, reinforcing the model's effectiveness in real-world applications. The ablation studies further substantiate the importance of each component in the framework, providing insight into the contributions of various design choices.
The authors have committed to making the inference codes and the LVS Benchmark publicly available, which is a positive step towards ensuring reproducibility. However, the foundational text-to-music model's weights will not be released due to licensing constraints, which may limit full reproducibility of the results. The detailed descriptions of training procedures and methodologies contribute to a clearer understanding of the implementation, but the lack of access to the foundational model may hinder some aspects of reproducibility.
The paper acknowledges limitations related to the quality of the video-music training datasets, which could affect the final output's fidelity. Additionally, the LLM agent's current scope is limited to local decision-making without a global narrative understanding, which could lead to suboptimal musical choices in complex scenes. The model also does not account for original audio elements, such as dialogue, which may impact the overall coherence of the soundtrack.
JenBridge has the potential to significantly impact the field of automated soundtracking, providing tools that can enhance the creative processes of video producers and filmmakers. By bridging the gap between automated generation and professional-quality production, it opens avenues for new applications in multimedia content creation. The ethical considerations regarding data usage and the intent to empower human creators rather than replace them are commendable and reflect a responsible approach to AI development. JenBridge represents a significant advancement in the field of adaptive long-form video soundtracking, integrating innovative methodologies and robust evaluation frameworks to address the challenges of coherence and narrative continuity in AI-generated music. The comprehensive analysis of its technical contributions and the proposed benchmarks positions this work as a valuable resource for future research and applications in creative AI.
Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.
Primary: Nagoya University
All Institutions: Nagoya University, Beihang University, TARVO, Inc.
The main contribution of this paper is the development of a novel representation learning framework that integrates speech and text for enhancing electrolaryngeal speech. This work significantly advances the field of voice conversion by addressing the inherent challenges faced by laryngectomees, providing a practical and extensible solution that can greatly improve assistive communication technologies.
The paper proposes a novel seq2seq-based framework for electrolaryngeal (EL) speech enhancement by integrating speech and text representations. The methodology is well-structured, comprising a two-stage approach that includes representation integration and reconstruction training. The introduction of three distinct fusion strategies (middle-, input-, and hybrid-level) is innovative, allowing for enhanced learning of speech-text representations. The use of auxiliary text information to improve mapping quality is a significant advancement over traditional methods that rely solely on speech representations. The incorporation of an autoencoder-style reconstruction strategy is also a notable contribution, as it maintains model simplicity while improving performance.
The experimental evaluation is robust, utilizing multiple small-scale EL2SP datasets that reflect real-world challenges in data collection for laryngectomees. The results demonstrate consistent improvements over baseline models, indicating the effectiveness of the proposed methods. The use of both objective metrics (MCD, CER, F0 RMSE) and subjective evaluations (MOS for naturalness and intelligibility) provides a comprehensive assessment of the system's performance. The statistical significance of the results strengthens the claims made by the authors regarding the superiority of their approach.
The paper provides sufficient details regarding the implementation of the proposed systems, including the architecture, training procedures, and evaluation metrics. The use of established frameworks like ESPnet and the clear description of datasets and experimental protocols enhance reproducibility. However, the absence of publicly available code or detailed hyperparameter settings may hinder full reproducibility for some researchers.
One limitation of the study is the reliance on small-scale datasets, which may not fully capture the variability present in natural speech. Additionally, while the proposed methods show improvements, the performance may still not reach the level of natural human speech, indicating room for further enhancement. The complexity of the model increases with the incorporation of multiple fusion strategies, which may pose challenges in practical implementations.
The proposed methods have significant implications for assistive communication technologies, particularly for individuals who rely on electrolaryngeal devices. By improving the naturalness and intelligibility of EL speech, the research can enhance communication quality for laryngectomees, thus potentially improving their quality of life. The integration of speech and text representations may also inspire further research in multimodal speech processing and voice conversion applications. The main contribution of this paper is the development of a novel representation learning framework that integrates speech and text for enhancing electrolaryngeal speech. This work significantly advances the field of voice conversion by addressing the inherent challenges faced by laryngectomees, providing a practical and extensible solution that can greatly improve assistive communication technologies.
Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.
Primary: Shanghai University
All Institutions: Shanghai University, XJTLU Entrepreneur College (Taicang), Osaka University
The main contribution of this paper is the introduction of C2GA, a class-controllable generative augmentation framework that effectively addresses the challenges of data scarcity and class imbalance in respiratory sound classification. This innovative approach combines advanced generative modeling techniques with a focus on clinical relevance, significantly enhancing the performance of machine learning models in a critical healthcare domain.
The proposed C2GA framework introduces a novel approach to generative data augmentation for respiratory sound classification by leveraging a conditional Vector-Quantized Variational Autoencoder (VQ-VAE) and a Transformer-based autoregressive model. This two-stage method effectively constructs a semantically rich discrete latent space and generates high-fidelity Mel-spectrograms that maintain class semantics, addressing the limitations of existing augmentation techniques that often distort critical features. The methodology is well-structured, with clear descriptions of each stage, and emphasizes the importance of class conditioning and temporal dynamics in generating clinically relevant audio samples.
The experimental evaluation is robust, utilizing two distinct respiratory sound datasets that reflect real-world challenges such as noise and class imbalance. The authors provide comprehensive results demonstrating significant improvements in classification performance, particularly for minority classes, with clear metrics (accuracy, recall, F1-score) that validate the effectiveness of the C2GA framework compared to traditional and state-of-the-art methods. The ablation studies further reinforce the contributions of individual components within the framework, showcasing the importance of each element in achieving the reported gains.
The paper includes detailed implementation details, including architecture specifications, training procedures, and hyperparameter settings, which enhance reproducibility. However, the absence of a publicly available code repository limits the ability for others to replicate the results directly.
One limitation is the reliance on specific datasets, which may not fully represent the diversity of respiratory sounds encountered in clinical practice. Additionally, while the framework shows promise, its performance in extremely noisy environments or with highly imbalanced datasets may require further validation. The lack of a demo or project URL also hinders accessibility for interested researchers.
The C2GA framework has significant implications for clinical practice, particularly in improving the robustness and accuracy of automated respiratory sound classification systems. By enhancing the ability to detect subtle pathological features in noisy and imbalanced datasets, this research could lead to better diagnostic tools and improved patient outcomes in respiratory health monitoring. The main contribution of this paper is the introduction of C2GA, a class-controllable generative augmentation framework that effectively addresses the challenges of data scarcity and class imbalance in respiratory sound classification. This innovative approach combines advanced generative modeling techniques with a focus on clinical relevance, significantly enhancing the performance of machine learning models in a critical healthcare domain.
Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.
Primary: Texas A&M University
All Institutions: Texas A&M University, Massachusetts Institute of Technology
The paper presents a novel parameter-efficient dual-encoder framework for underwater acoustic classification that leverages both waveform and spectrogram representations. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and underwater acoustics.
The proposed dual-encoder architecture is innovative in its approach to simultaneously process both waveform and spectrogram representations, addressing the limitations of single-representation models in underwater acoustic classification. The introduction of the Choquet integral for decision-level fusion is a significant methodological advancement, allowing for non-linear interactions between features. The use of parameter-efficient fine-tuning techniques further enhances the model's adaptability to domain-specific challenges while minimizing computational costs. The soft-sort gating mechanism is a clever solution to the differentiability issue associated with the Choquet integral, enabling end-to-end training.
The experiments conducted on the DeepShip and ShipsEar datasets provide a solid empirical foundation for the proposed methodology. The results demonstrate clear improvements over single-encoder baselines, validating the effectiveness of the dual-encoder architecture and the Choquet integral fusion mechanism. However, the paper could benefit from additional comparative analysis against more recent state-of-the-art methods in underwater acoustic classification to strengthen claims of superiority.
The paper outlines the architecture and methodology in sufficient detail, but lacks specific implementation details such as hyperparameter settings and training procedures. Providing a code repository or supplementary material would enhance reproducibility and allow other researchers to validate the findings.
One limitation is the reliance on two specific datasets, which may not fully represent the diversity of underwater acoustic environments. Additionally, while the Choquet integral provides interpretability, the complexity of the model may pose challenges in understanding the learned fuzzy measures without further analysis.
The proposed framework has significant implications for various oceanic applications, including maritime security and environmental monitoring. By improving classification accuracy in complex underwater environments, this research contributes to advancements in autonomous underwater vehicles and acoustic monitoring systems, potentially enhancing our understanding of marine ecosystems. The paper presents a novel parameter-efficient dual-encoder framework for underwater acoustic classification that leverages both waveform and spectrogram representations. This comprehensive analysis highlights the technical contributions, innovative methodology, and potential impact on the field of machine learning and underwater acoustics.
Recent advances in Automatic Speech Recognition (ASR) and Large Language Models (LLMs) have significantly improved speech understanding capabilities. However, multi-speaker speech transcription remains challenging task, constrained by highly similar speaker voices, rapid turn-taking transitions, overlapping utterances and inaccurate speaker boundary segmentation. These challenges become particularly pronounced in real-world conversational audio, where speaker dynamics and acoustic conditions are highly variable. This technical report presents SoulX-Transcriber, a unified multi-speaker transcription system that jointly models speaker diarization (SD) and ASR within an LLM-based framework. SoulX-Transcriber adopts a two-stage training strategy to improve both speaker discrimination and transcription robustness. In the first stage, speaker-aware multi-task continuous pre-training enhances speaker representation learning and boundary perception. In the second stage, supervised fine-tuning further optimizes the model for accurate end-to-end speaker-attributed transcription under complex multi-speaker conditions. SoulX-Transcriber delivers strong performance and robustness across multiple public benchmarks, including AliMeeting, AISHELL-4, and AMI, while maintaining high adaptability to multi-domain scenarios.
Primary: Northwestern Polytechnical University
All Institutions: Northwestern Polytechnical University, Soul AI Lab, Moonstep AI
SoulX-Transcriber presents a novel end-to-end framework for multi-speaker transcription that effectively integrates speaker diarization and ASR, demonstrating strong performance across various benchmarks. The innovative methodology, comprehensive evaluation, and potential for real-world applications position this work as a significant contribution to the field of machine learning and audio processing.
The methodology presented in SoulX-Transcriber is robust, employing a two-stage training framework that effectively combines speaker diarization and automatic speech recognition within a unified model. The use of speaker-aware multi-task continuous pre-training followed by supervised fine-tuning is particularly innovative, enhancing speaker representation and transcription accuracy. The complementary data engineering pipeline, which includes both pseudo-labeled and simulated data, addresses the challenges of acquiring high-quality training data in multi-speaker scenarios. This dual approach allows for better generalization and robustness in real-world applications.
The experimental evaluation is comprehensive, utilizing multiple public benchmarks (AliMeeting, AISHELL-4, AMI) to demonstrate the model's effectiveness across different scenarios. The results indicate significant improvements in key metrics such as Diarization Error Rate (DER) and Word Error Rate (WER), showcasing the model's capability to handle both short-form and long-form audio. The inclusion of internal benchmarks further strengthens the evaluation by testing generalization across diverse conversational contexts.
The paper provides sufficient details regarding the training data, model architecture, and evaluation metrics, which supports reproducibility. However, the reliance on proprietary datasets and the complexity of the data generation pipeline may pose challenges for independent replication. The availability of the project URL and demo enhances the potential for others to reproduce the results.
One limitation is the potential for label noise in the pseudo-labeled data, which could affect the model's performance in high-precision tasks. Additionally, while the model shows strong performance in Mandarin-centric datasets, its adaptability to other languages or dialects may require further validation. The complexity of the model and training process could also limit accessibility for practitioners without extensive resources.
The SoulX-Transcriber framework has significant implications for industries relying on accurate multi-speaker transcription, such as customer service, meeting documentation, and media production. Its ability to handle complex conversational dynamics can enhance communication efficiency and accessibility. Furthermore, the integration of LLMs with audio processing opens avenues for future research in multimodal AI applications. SoulX-Transcriber presents a novel end-to-end framework for multi-speaker transcription that effectively integrates speaker diarization and ASR, demonstrating strong performance across various benchmarks. The innovative methodology, comprehensive evaluation, and potential for real-world applications position this work as a significant contribution to the field of machine learning and audio processing.
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce \textbf{SpeechEditBench}, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code will be released upon acceptance.
Primary: City University of Hong Kong
All Institutions: City University of Hong Kong, Leibniz Research Center, Huawei
The main contribution of this paper is the introduction of SpeechEditBench, a comprehensive benchmark for instruction-guided speech editing that addresses the lack of unified evaluation frameworks in the field. This work significantly enhances the understanding of model capabilities and limitations, paving the way for future advancements in speech editing technologies.
The paper introduces a novel benchmark, SpeechEditBench, which systematically evaluates instruction-guided speech editing across multiple attributes. The methodology is robust, employing an anchor-based evaluation protocol that differentiates between target success and preservation success, which is critical for understanding the capabilities of Speech LLMs. The inclusion of both atomic and compositional editing tasks enhances the benchmark's comprehensiveness, allowing for nuanced assessments of model performance.
The experimental evaluation is thorough, involving eight Speech LLMs and specialized systems across various tasks. The results are well-documented, revealing significant insights into model performance, particularly the challenges of compositional editing and preservation of untargeted attributes. The findings are significant, indicating that no single model excels across all tasks, which highlights the fragmentation in current capabilities.
The paper mentions that data and code will be released upon acceptance, which is a positive step towards reproducibility. However, specific implementation details and the exact nature of the datasets used could be better elaborated to facilitate independent verification of results.
The benchmark currently only supports English and Chinese, limiting its applicability to a broader range of languages. The reliance on automatic metrics for evaluation may not fully capture the subjective quality of the edits, and the focus on single-turn instructions excludes multi-turn editing scenarios, which are crucial for real-world applications.
The development of SpeechEditBench has the potential to significantly advance the field of speech editing and LLMs by providing a structured framework for evaluation. This can lead to improved models that better understand and execute complex speech editing tasks, with implications for applications in content creation, accessibility, and interactive voice technologies. The findings regarding model performance fragmentation can guide future research directions and model development. The main contribution of this paper is the introduction of SpeechEditBench, a comprehensive benchmark for instruction-guided speech editing that addresses the lack of unified evaluation frameworks in the field. This work significantly enhances the understanding of model capabilities and limitations, paving the way for future advancements in speech editing technologies.
Speakers in dialogue continuously adapt their communicative behavior across acoustic, lexical, and semantic dimensions, a phenomenon known as conversational entrainment. Modeling this process requires representations that capture the global structure of interaction, yet prior approaches fail to disentangle dyad-specific patterns from speaker-specific traits, limiting their ability to capture true conversational adaptation. We address this with the Dyadic Distance Matrix (DDM), which encodes all pairwise similarities between the turns of two speakers over an entire conversation, capturing long-range cross-speaker dependencies. This raises a key question: does the DDM represent genuine interaction, or merely reflect individual speaker characteristics? We propose the speaker-switch test, a principled control in which one speaker's turns are replaced with those from an unrelated speaker drawn from a different conversation. This preserves turn-level statistics while disrupting the original dyadic coadaptation. The ability to distinguish real from switched DDMs thus directly evaluates whether the representation encodes interaction-specific structure. Across four embedding types and classifiers including ResNet-50 on the CANDOR corpus, real DDMs are consistently distinguishable from their switched counterparts. Comparisons with LibriSpeech show higher discriminability in read speech, highlighting the role of prosodic variability in naturalistic conversations. GradCAM analysis further reveals distinct structural signatures driving classification. These results establish the speaker-switch test as a robust diagnostic for validating representations of dyadic conversational interaction.
Primary: Indian Institute of Technology Guwahati
All Institutions: Indian Institute of Technology Guwahati
The paper presents a systematic framework for evaluating dyadic interactions in conversational speech through the introduction of the Dyadic Distance Matrix and the speaker-switch test. This work significantly contributes to the understanding of conversational dynamics and has the potential to improve the design of more responsive and context-aware dialogue systems.
The paper introduces the Dyadic Distance Matrix (DDM) as a novel representation for capturing dyadic interactions in conversational speech. The methodology is well-structured, employing a speaker-switch test to validate the DDM's ability to distinguish genuine conversational dynamics from speaker-specific traits. The use of multiple embedding types and classifiers, including ResNet-50, enhances the robustness of the approach. The systematic evaluation across different modalities and the cross-corpus analysis provide a comprehensive understanding of the model's performance in varied contexts.
The experiments are thorough, utilizing the CANDOR corpus and LibriSpeech to assess the effectiveness of the DDM in capturing interaction-specific structures. The classification results demonstrate strong discriminability between real and switched DDMs, particularly with semantic embeddings. The GradCAM analysis adds interpretability, revealing the structural features that contribute to classification decisions. The results are statistically significant and provide valuable insights into the nature of conversational entrainment.
The paper provides sufficient details on the methodology, including data preprocessing, model architectures, and evaluation metrics, which facilitates reproducibility. However, the lack of publicly available code or datasets limits the ease with which others can replicate the findings.
One limitation is the reliance on the CANDOR corpus, which may not generalize to all conversational contexts. Additionally, while the speaker-switch test is a robust evaluation method, it may not capture all nuances of dyadic interaction. The paper could also benefit from a more extensive discussion on the implications of the findings for practical applications in dialogue systems.
The findings have significant implications for advancing conversational AI and dialogue systems by providing a framework to better understand and model dyadic interactions. The ability to distinguish genuine conversational dynamics could enhance applications in areas such as automated dialogue systems, sentiment analysis, and social robotics. The paper presents a systematic framework for evaluating dyadic interactions in conversational speech through the introduction of the Dyadic Distance Matrix and the speaker-switch test. This work significantly contributes to the understanding of conversational dynamics and has the potential to improve the design of more responsive and context-aware dialogue systems.
Self-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context invariance across different speaking styles. We introduce SiamCTC, a framework that integrates Siamese networks with Connectionist Temporal Classification (CTC) to learn speech representations without strict frame-level correspondence. By employing CTC loss to establish flexible, monotonic alignments between differing temporal realizations of the same content, SiamCTC accommodates speed perturbations and other temporal augmentations. This design relaxes frame-wise constraints while preserving temporal coherence and enhancing robustness to speaking-rate variations in downstream tasks. Our experiments demonstrate that SiamCTC leads to more adaptable speech representations, particularly at diverse speaking rates.
Primary: SooHwan
All Institutions: SooHwan, Mark
The main contribution of this paper is the introduction of SiamCTC, a novel framework that leverages monotonic temporal alignment to enhance speech representation learning, demonstrating significant improvements over existing self-supervised methods. The comprehensive analysis of the technical contributions, methodology, and experimental results underscores its significance in the field of audio processing and speech technology.
The proposed SiamCTC framework innovatively combines Siamese networks with Connectionist Temporal Classification (CTC) to address the challenge of temporal alignment in speech representation learning. By allowing flexible, monotonic alignments rather than strict frame-wise correspondences, it effectively captures linguistic invariance across varying speaking styles and rates. The integration of multiple loss components (CTC loss, KL divergence loss, and Temporal InfoNCE loss) is well-justified and enhances the robustness of the learned representations. The methodology is sound, with a clear rationale for each component and its contribution to the overall objective.
The experiments are comprehensive, utilizing the LibriSpeech dataset, which is a standard benchmark in the field. The results demonstrate significant improvements in phoneme error rates (PER) compared to existing models like HuBERT and WavLM. The ablation studies effectively highlight the importance of each loss component, providing clear evidence of the framework's efficacy. However, the paper could benefit from additional metrics and comparisons with more recent models to further validate its performance.
The paper provides sufficient implementation details, including model architecture, training procedures, and hyperparameter settings, which should facilitate reproducibility. However, the absence of a publicly available code repository limits the ease with which other researchers can replicate the results. Including a link to a GitHub repository or similar would enhance reproducibility.
The authors acknowledge the sensitivity of the model to hyperparameters, particularly regarding augmentation strategies and temperature settings. This sensitivity could hinder the model's generalizability across different datasets or applications. Additionally, while the framework shows promise, training from scratch rather than fine-tuning pre-trained models may yield different results, which remains unexplored in this work.
The SiamCTC framework has the potential to significantly advance the field of self-supervised speech representation learning, particularly in applications requiring robustness to variations in speaking styles and rates, such as automatic speech recognition and speaker verification. Its flexible alignment approach could also inspire further research into more adaptive models in related domains. The main contribution of this paper is the introduction of SiamCTC, a novel framework that leverages monotonic temporal alignment to enhance speech representation learning, demonstrating significant improvements over existing self-supervised methods. The comprehensive analysis of the technical contributions, methodology, and experimental results underscores its significance in the field of audio processing and speech technology.
As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.
Primary: Unknown
All Institutions: Unknown
The main contribution of this paper is the introduction of the HAIM dataset and a novel tracking framework that enables granular analysis of AI involvement in music production. This work significantly advances the field by moving beyond binary classifications to a more detailed understanding of hybrid human-AI collaborations in music, thereby addressing a critical gap in current detection methodologies.
The methodology presented in this paper is robust, introducing the HAIM dataset that categorizes music tracks based on the roles of AI and human contributions across various stages of music production. The multi-faceted taxonomy and diverse data sourcing are commendable, as they address the limitations of existing binary classification systems. The use of a modified Fusion Segment Transformer (MuQ-FST) for multilabel tracking is innovative, allowing for a more nuanced understanding of AI involvement in music production.
The experiments conducted are thorough, evaluating multiple existing detection systems against the HAIM dataset. The results highlight the systemic flaws in current detectors when faced with hybrid scenarios, demonstrating the effectiveness of the proposed approach. The performance metrics are well-defined, and the results are presented clearly, showcasing the advantages of the new benchmark.
The paper provides sufficient detail regarding the dataset creation, model architecture, and training procedures, which supports reproducibility. However, the lack of specific URLs for code or data repositories limits the ease of access for other researchers wishing to replicate the study.
The paper acknowledges several limitations, including the imbalance in category sizes, potential overfitting of the model to specific templates, and the need for more diverse mixing and mastering styles. Additionally, the complexity of human roles in music production is not fully captured, which may hinder the model's ability to generalize across different scenarios.
The implications of this research are significant, as it addresses the growing intersection of AI and music production, providing tools for better understanding and tracking AI contributions. This has potential applications in copyright law, music production, and the development of more sophisticated AI detection systems. The main contribution of this paper is the introduction of the HAIM dataset and a novel tracking framework that enables granular analysis of AI involvement in music production. This work significantly advances the field by moving beyond binary classifications to a more detailed understanding of hybrid human-AI collaborations in music, thereby addressing a critical gap in current detection methodologies.
Kinship verification (KV) from voice, the task of determining whether two speakers are biologically related, has received only little attention. Our work establishes a foundational basis for this emerging frontier, contributing to both performance evaluation and detection methodologies. First, leveraging the speech recordings of the large-scale audio-visual dataset, KAN-AV, we propose a revised evaluation protocol that controls for various confounders and adopts a family-disjoint train--test split to address open-set KV. Second, we analyze the close connection between speaker verification and KV, showing that genealogical similarity of speaker pairs plays opposite roles in the two tasks. Third, we tackle KV using three neural speaker embedding extractors (ECAPA-TDNN, WavLM-ECAPA, and ReDimNet) combined with various back-ends. In zero-shot KV including same-speaker target trials, ReDimNet achieves the lowest equal error rate (EER) of $20.8\%$; however, performance degrades to $39.7\%$ under strict kin trials, where same-speaker target trials are excluded. Our best trainable back-end, which applies asymmetric processing of the embedding pair to mitigate age-difference effects, obtains an EER of $32.0\%$ ($18.6\%$ with speaker target trials included). These results highlight the difficulty of KV while showing that speaker embeddings encode familial cues, offering a promising foundation for voice-based kinship analysis.
Primary: University of Eastern Finland
All Institutions: University of Eastern Finland
The paper establishes a foundational basis for voice-based kinship verification, contributing significantly to the field by addressing methodological gaps and proposing innovative solutions. The comprehensive analysis of kinship cues in voice, coupled with rigorous experimental validation, positions this work as a meaningful advancement in audio-based machine learning research.
The paper introduces a novel approach to kinship verification (KV) using voice, leveraging a large-scale audio-visual dataset (KAN-AV) and proposing a revised evaluation protocol that addresses confounding factors. The authors articulate a clear distinction between speaker verification (SV) and KV, emphasizing the unique challenges posed by familial voice similarities. They employ three advanced neural speaker embedding extractors and develop a lightweight asymmetric processing backend to mitigate age-difference effects, showcasing a thoughtful methodology that integrates both theoretical and practical considerations.
The experiments are robust, utilizing a well-curated dataset and a family-disjoint train-test split to evaluate generalization to unseen families. The results demonstrate the effectiveness of the proposed methods, with detailed performance metrics, including equal error rates (EER) for various configurations. The benchmarking against existing methods provides a solid foundation for assessing the contributions of the proposed approaches.
The paper provides sufficient detail regarding the experimental setup, including the data filtering process and the evaluation protocol. However, the absence of a publicly accessible code repository limits reproducibility. Clear descriptions of the models and training conditions are provided, but without code, independent verification of results may be challenging.
The study acknowledges the inherent difficulties in KV, particularly in strict kin trials where performance degrades significantly. The reliance on a specific dataset (KAN-AV) may limit the generalizability of the findings to other contexts or populations. Additionally, while the proposed methods show promise, further validation on diverse datasets would strengthen the claims.
The implications of this research extend to various fields, including forensics, where non-invasive kinship verification could complement traditional DNA profiling methods. The findings may also influence future work in speaker verification and voice analysis, potentially leading to advancements in applications such as familial identification in multimedia content. The paper establishes a foundational basis for voice-based kinship verification, contributing significantly to the field by addressing methodological gaps and proposing innovative solutions. The comprehensive analysis of kinship cues in voice, coupled with rigorous experimental validation, positions this work as a meaningful advancement in audio-based machine learning research.
The localization of moving sound sources using a microphone array is typically based on modifying the signal to compensate for the Doppler effect. In the time domain this compensation is done on a sample-by-sample basis. In the frequency domain short time segments need to be used in which the Doppler effect is assumed to be approximately constant and a discrete Fourier transform is done on each segment. In contrast, the authors developed an inverse 2.5D localization method for uniformly moving single-frequency sources that works in the spectral domain and allows for the use of longer windows. This was achieved by modifying the 2.5D forward model to directly compute the effect of the motion in the static observer position. The method does neither require to modify the measured signal nor does it require quasi-stationary of the measurements within the window used. Unfortunately, this approach is not directly suitable for broad-band stochastic sources, and in the present work we will investigate how the statistical properties of a uniformly moving stochastic source change when observed at a static observer. Using a 2.5D setting, the relation between the power spectral density of the moving source and the Loève spectrum, which is a generalization of the cross-spectral density at the static receivers, was derived. Based on simulated data with speeds up to 100 m\,s$^{-1}$, the work presented here provides a proof of concept for a method based on multi-taper estimates for the Loève spectrum to localize moving broad-band stochastic sources . Currently, the method requires a stationary source signal and that the spectral density is flat within a certain range around the frequency of interest. Also, correlations between sources are currently not considered.
Primary: Acoustics Research Institute, Austrian Academy of Sciences
All Institutions: Acoustics Research Institute, Austrian Academy of Sciences
The paper presents a novel approach to localizing moving broadband noise sources using the Loève spectrum and a 2.5D framework, contributing significantly to the field of acoustic signal processing. The methodology is innovative, addressing key challenges in the localization of stochastic sources, and the experimental validation supports its potential applicability in real-world scenarios.
The paper introduces a novel inverse 2.5D localization method that operates in the spectral domain, allowing for longer window sizes and avoiding the need for signal modification. The authors derive a relationship between the power spectral density of moving sources and the Loève spectrum, which is a significant theoretical contribution. The methodology is well-structured, leveraging multi-taper estimates for spectral analysis, and effectively addresses the challenges associated with localizing moving stochastic sources. However, the method's assumptions, such as requiring a stationary source signal and a flat spectral density, may limit its applicability in more complex real-world scenarios.
The experiments utilize simulated data to validate the proposed localization method, demonstrating its effectiveness in distinguishing moving sources at high speeds. The results are presented clearly, showcasing the correlation between theoretical and estimated spectra across various conditions. The use of a 64-channel microphone array in one of the experiments adds practical relevance to the findings. However, the reliance on simulations may not fully capture the complexities of real-world acoustic environments.
The paper lacks specific implementation details that would facilitate reproducibility, such as code availability or detailed descriptions of the experimental setup. While the methodology is theoretically sound, the absence of a publicly available implementation limits the ability of other researchers to replicate the results.
The method currently requires assumptions that may not hold in all scenarios, such as the need for a stationary source signal and flat spectral density. Additionally, the impact of correlations between sources is not considered, which could affect the localization accuracy in practical applications. The reliance on simulated data also raises questions about the method's robustness in real-world conditions.
The proposed localization method has potential applications in various fields, including transportation noise monitoring, environmental acoustics, and sound source localization in urban settings. By improving the accuracy of moving source localization, this work could contribute to advancements in noise control strategies and urban planning. The paper presents a novel approach to localizing moving broadband noise sources using the Loève spectrum and a 2.5D framework, contributing significantly to the field of acoustic signal processing. The methodology is innovative, addressing key challenges in the localization of stochastic sources, and the experimental validation supports its potential applicability in real-world scenarios.
Speech denoising is an often necessary step not only for human listening, but also for downstream processing by systems lacking robustness to noisy, real-world acoustic conditions. Unfortunately, denoising is a problem where conventional in-domain supervised training is not trivial, as the training targets cannot be annotated by humans: producing a clean version of a naturally-noisy speech recording is itself the task to solve. Supervised training is typically performed through the artificial addition of noise to clean speech recordings, which can only be sourced from controlled domains, a significant limitation due to the poor out-of-domain generalization of neural networks. An alternative is noisy target training (NyTT), which simply replaces the clean speech with in-domain noisy recordings, with the hope that learning to remove the artificial noise will extend to the natural. Though having shown promising results, NyTT's training objective is not minimized by clean speech estimates. We show that by estimating the artificial noise in addition to the naturally-noisy speech, the undesirable optimum can actually be exploited: the residual noise in the speech estimate can be canceled by the noise estimate via simple subtraction. Crucially, the optimum is fully compatible with conventional artificial mixtures, enabling joint training using both types of data with consistent optimization targets, opening the door to improved domain adaptability. The effectiveness of our approach is demonstrated through WHAM! and CHiME-3-based benchmarks.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of a novel approach to speech denoising that effectively exploits noise inseparability through Differential Noise Filtering, significantly improving performance in weakly-supervised settings. The technical contributions and methodology are well-articulated, showcasing a promising advancement in the field of audio processing.
The proposed methodology introduces Differential Noise Filtering (DNF), which innovatively combines noisy target training with conventional supervised training. This dual-output approach allows for the estimation of both the noisy speech and the noise itself, enabling effective noise cancellation through subtraction. The methodology is well-grounded in theoretical analysis, leveraging scale-invariance principles and providing a clear framework for joint training with synthetic data. The integration of these concepts is a notable strength of the paper.
The experiments conducted on the WHAM! and CHiME-3 datasets provide robust evidence of the effectiveness of the proposed method. The reported improvements in SI-SDR and DNSMOS metrics demonstrate the practical applicability of the DNF approach. However, the paper could benefit from a more extensive comparison with state-of-the-art methods and a clearer presentation of results in tabular form.
While the paper outlines the model architecture and training configurations, it lacks specific implementation details and code availability, which may hinder reproducibility. Clearer documentation or a supplementary repository would enhance this aspect.
The paper acknowledges limitations in performance when compared to fully supervised methods, particularly in high-noise scenarios. Additionally, the potential for increased WER due to the cleaner outputs produced by DNF is a notable drawback. The reliance on the quality of the noisy data also poses challenges for generalization.
The proposed method has significant implications for real-world applications in speech processing, particularly in environments where clean speech data is scarce. By improving the robustness of speech denoising systems, this work could enhance communication technologies, assistive devices, and various AI-driven audio applications. The main contribution of this paper is the introduction of a novel approach to speech denoising that effectively exploits noise inseparability through Differential Noise Filtering, significantly improving performance in weakly-supervised settings. The technical contributions and methodology are well-articulated, showcasing a promising advancement in the field of audio processing.
Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.
Primary: PSL University
All Institutions: PSL University, Laboratoire d'Informatique et Systèmes, Université Aix-Marseille
The main contribution of this paper is the development of a context-aware model for distinguishing child-directed speech from adult-directed speech in long-form recordings, significantly enhancing the scalability and accuracy of analyses in children's language environments. The work represents a meaningful advancement in the intersection of machine learning and developmental linguistics, with potential applications in both research and practical settings.
The methodology employed in this paper is robust, utilizing self-supervised learning models specifically tailored for child-directed speech detection. The authors effectively leverage a multilingual dataset and incorporate contextual information into their models, which is a significant advancement over previous isolated utterance approaches. The context-aware fine-tuning strategy is particularly noteworthy, as it addresses the limitations of existing models by enhancing the input with surrounding audio, thereby improving classification performance. The use of multiple self-supervised models and a clear delineation of the addressee classification problem showcases a well-structured approach.
The experimental evaluation is comprehensive, involving a well-defined dataset comprising recordings from diverse languages and sociocultural contexts. The results demonstrate a substantial improvement in classification accuracy through the incorporation of contextual information, with an impressive absolute gain of 13.8% in average F1-score. The paper also contrasts the performance of various models, including both in-domain and out-of-domain pre-trained models, providing a thorough analysis of their effectiveness. However, the lack of a detailed comparison with other state-of-the-art methods could limit the contextual understanding of their results.
The implementation details provided are thorough, including specifics on model training, evaluation metrics, and the computational resources utilized. The authors have made their code available on GitHub, which enhances the reproducibility of their work. However, additional details on hyperparameter tuning and the specific configurations of the models could further aid in replicating the results.
One limitation noted in the paper is the reliance on automatic segmentation, which can introduce errors that propagate through the classification pipeline. Additionally, while the multilingual dataset is a strength, the authors acknowledge that their models may still be limited in their ability to generalize across all languages and sociocultural contexts. The computational cost associated with context-aware fine-tuning is also a concern, as it may hinder practical applications.
This research has significant implications for the field of developmental science, as it enables large-scale analysis of children's language environments without the need for extensive manual annotation. The ability to automatically detect child-directed speech could facilitate studies on language acquisition and development across diverse populations. Furthermore, by releasing their model and code, the authors contribute to the advancement of the field, promoting further research and development in this area. The main contribution of this paper is the development of a context-aware model for distinguishing child-directed speech from adult-directed speech in long-form recordings, significantly enhancing the scalability and accuracy of analyses in children's language environments. The work represents a meaningful advancement in the intersection of machine learning and developmental linguistics, with potential applications in both research and practical settings.
Modern audio processing networks are commonly deployed on accelerators whose peak throughput is obtained through dense linear algebra, whereas conventional acoustic frontends -- a Short-Time Fourier Transform (STFT) followed by sparse Mel aggregation -- remain structurally heterogeneous. This mismatch can introduce memory-bandwidth, dispatch, and intermediate-allocation overheads on contemporary accelerator backends. This work introduces MelT, a single-stage frontend framework in which Mel-spaced Non-Uniform Discrete Fourier Transform (NDFT) bases are precomputed and applied to time-domain acoustic frames through dense General Matrix Multiplication (GEMM) operations. The contribution is not the NDFT operator itself; rather, it is the formulation of Mel-spaced NDFT projection as a GEMM-native audio frontend and its evaluation as a hardware-efficient alternative to conventional STFT+Mel pipelines. Evaluated across platforms ranging from Apple A18 Pro edge hardware to NVIDIA H100 datacenter acceleration, MelT attains up to a $3.75\times$ speedup in inference latency and a $3.52\times$ reduction in energy consumption while maintaining downstream classification accuracy.
Primary: Instituto de Ciências Matemáticas e de Computação, University of São Paulo
All Institutions: Instituto de Ciências Matemáticas e de Computação, University of São Paulo
The paper presents MelT, a novel GEMM-native audio frontend that significantly improves the efficiency of audio feature extraction by reformulating the conventional STFT and Mel aggregation into a single-stage process. This approach not only enhances computational performance but also reduces energy consumption, making it a valuable contribution to the field of audio processing in machine learning.
The methodology presented in the paper is innovative in reformulating the conventional STFT and Mel aggregation process into a single-stage GEMM-native framework. The authors leverage the mathematical foundation of the Non-Uniform Discrete Fourier Transform (NDFT) to directly compute Mel-spaced projections, which is a significant departure from traditional methods. The approach is well-justified, with clear explanations of how it avoids the inefficiencies of multi-stage processing. The integration of dense matrix multiplication into the audio frontend design is particularly noteworthy, as it aligns with modern hardware capabilities.
The experiments are robust, involving multiple hardware platforms (NVIDIA H100, V100, Apple M4 Pro, and A18 Pro) and demonstrating significant speedups and energy reductions. The benchmarks are comprehensive, covering various input durations and providing detailed latency and energy consumption metrics. The downstream task validation on VoxCeleb1 and SPIRA COVID-19 detection further strengthens the findings, showing that the new method maintains competitive performance with traditional approaches.
The paper provides a GitHub repository with source code, benchmark scripts, and configuration files, which enhances reproducibility. The detailed descriptions of experimental setups, including hardware configurations and statistical methodologies, allow other researchers to replicate the experiments effectively.
One limitation discussed is the scaling behavior of the proposed method, which shows diminishing returns as the number of Mel bins increases. The authors acknowledge that while the method is advantageous in the compact-bin regime, it may not perform as well in scenarios requiring a larger number of Mel bins. Additionally, the paper does not explore the potential for further optimization or adaptation to other audio processing tasks beyond the evaluated benchmarks.
The proposed MelT framework has significant implications for the efficiency of audio processing in machine learning applications, particularly in environments where computational resources are limited. By aligning audio feature extraction with the capabilities of modern accelerators, this work could lead to more efficient real-time audio applications, including speech recognition and classification tasks. The findings may inspire further research into hardware-optimized audio processing techniques, potentially influencing future designs of audio frontends in deep learning systems. The paper presents MelT, a novel GEMM-native audio frontend that significantly improves the efficiency of audio feature extraction by reformulating the conventional STFT and Mel aggregation into a single-stage process. This approach not only enhances computational performance but also reduces energy consumption, making it a valuable contribution to the field of audio processing in machine learning.
Multi-pitch estimation (MPE) typically predicts which pitches are active in a mixture, but not which instrument or source produced them. This paper investigates a lightweight slot-attention framework for multi-instrument MPE (MI-MPE), where a mixture CQT is mapped to an unordered set of source-like pitch maps. The model uses permutation-invariant Hungarian matching to avoid fixed output semantics and treats the number of slots as an upper bound on the number of active sources. We further study two modular extensions: a self-supervised timbre encoder that provides training-time targets for slot-level timbre embeddings, and a polyphony branch that regularizes the pitch density of mixture- and slot-level predictions. Experiments show that Hungarian matching substantially improves instrument family decomposition on URMP. Stem-level prediction remains more challenging: timbre and polyphony supervision improve selected configurations, but do not consistently resolve source assignment. The results suggest that slot-based architectures are a promising direction for source-aware MPE, while highlighting the need to couple auxiliary musical cues to slot identity more carefully.
Primary: Ilmenau University of Technology
All Institutions: Ilmenau University of Technology
The paper presents a novel lightweight slot-attention framework for MI-MPE, contributing significantly to the field by addressing the challenges of source decomposition and pitch estimation in complex audio mixtures. The methodology and experimental results indicate a promising direction for future research in music information retrieval.
The paper proposes a lightweight slot-attention framework for multi-instrument multi-pitch estimation (MI-MPE), which innovatively uses permutation-invariant Hungarian matching to allow for flexible output semantics. The methodology is well-structured, introducing a self-supervised timbre encoder and a polyphony branch to enhance the model's capabilities. The use of an unordered set of pitch maps is particularly noteworthy, as it addresses the challenges of fixed output semantics in traditional models. However, the complexity of the model's architecture may pose challenges for practical implementation and deployment.
The experiments are comprehensive, systematically evaluating the proposed model across various configurations and datasets, including URMP and mshoxxDB. The results indicate that the slot-based approach, particularly with the incorporation of timbre and polyphony supervision, shows promise in improving source decomposition and pitch estimation. However, the performance on stem-level predictions remains inconsistent, highlighting the need for further refinement in the model's design and training.
The paper provides a detailed description of the methodology, including the architecture, training protocols, and datasets used. However, the lack of publicly available code or a demo URL limits reproducibility. Clearer documentation or a supplementary material section could enhance the ability of other researchers to replicate the study.
The paper acknowledges that while the slot-based architecture shows potential, source assignment remains a significant challenge. The coupling between auxiliary objectives and slot decomposition is identified as a limitation, suggesting that further research is needed to disentangle these components. Additionally, the performance variability across different datasets indicates that the model may not generalize well to all types of music.
The proposed framework has the potential to advance the field of music information retrieval by enabling more accurate and flexible multi-pitch estimation in complex audio mixtures. This could have applications in automatic music transcription, music analysis, and even real-time audio processing systems. The lightweight nature of the model also suggests it could be deployed in resource-constrained environments, broadening its accessibility. The paper presents a novel lightweight slot-attention framework for MI-MPE, contributing significantly to the field by addressing the challenges of source decomposition and pitch estimation in complex audio mixtures. The methodology and experimental results indicate a promising direction for future research in music information retrieval.
Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.
Primary: Columbia University
All Institutions: Columbia University
The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.
The methodology presented in this paper is robust and innovative, combining implicit affect inference from user speech with explicit valence-arousal (VA) conditioning. The authors construct a novel dataset (Sympatheia-18k) that allows for the training of a speech-to-speech dialogue system capable of generating emotionally appropriate responses. The use of continuous VA coordinates as a conditioning mechanism is a significant advancement over traditional discrete emotion categories, allowing for more nuanced emotional responses. The integration of multimodal emotion sensing modules adds further depth to the system, making it adaptable to various input types. The architecture follows a well-established speech-language model (GLM-4-Voice) but enhances it with emotional conditioning, which is a thoughtful approach to improving empathetic dialogue systems.
The experimental evaluation is comprehensive, utilizing both automated and human assessments to evaluate the empathetic response quality of the Sympatheia system. The authors employ a variety of metrics, including empathy scores from an audio-capable LLM and a human Emotion Mean Opinion Score (MOS) study, which provides a well-rounded view of the model's performance. The results indicate that Sympatheia significantly outperforms baseline models in generating emotionally appropriate responses, validating the effectiveness of the proposed methods. The use of both emotional and neutral splits in the dataset allows for a thorough examination of the model's capabilities across different emotional contexts.
The paper provides detailed implementation details, including training configurations and dataset generation processes, which enhance reproducibility. The availability of the project code and dataset on GitHub and Hugging Face respectively further supports the ability of other researchers to replicate the study. However, the reliance on synthetic data for training may introduce variability that could affect reproducibility in real-world applications.
The paper acknowledges several limitations, including the synthetic nature of the training data, which may not fully capture the complexity of real-world conversations. Additionally, the fixed VA anchors used for emotional conditioning may not universally apply across different cultures or individual expressions of emotion. The authors also note that the current evaluation primarily relies on automated assessments, which may miss nuanced failures in empathy and appropriateness.
The potential applications of Sympatheia are significant, particularly in assistive technologies, education, and mental health support, where emotionally aware interactions can enhance user experience. However, the deployment of such systems raises ethical considerations regarding privacy and the potential for misuse in manipulative contexts. The authors emphasize the need for safeguards and responsible deployment practices to mitigate these risks. The main contribution of this paper is the introduction of Sympatheia, a voice-native framework for emotionally aligned speech dialogue that integrates implicit and explicit affect conditioning. This work represents a significant advancement in the development of empathetic voice assistants, providing a comprehensive approach to generating emotionally appropriate responses in spoken dialogue systems. The combination of a novel dataset, robust methodology, and thorough evaluation underscores its importance in the field of machine learning and audio processing.
We address the problem of out-of-distribution (OOD) detection for target observations embedded in a subspace of the high dimensional data space. Using continuous normalizing flows (CNFs), we propose a Lagrangian sub-flow (LSF) framework designed to isolate and estimate the density for the relevant components in the representation and using the remaining components as context. Through experimentation with models for speech synthesis, we show that CNFs, similarly to other deep generative models (DGMs), are susceptible to the "likelihood paradox", where high likelihood is erroneously assigned to OOD samples. This is attributed to the inductive bias of DGMs that prioritize low-level structural details over high-level semantic coherence. To mitigate this phenomenon, we propose a number of geometric diagnostic signals based on the velocity field over the sub-flow trajectory. Based on these signals, we design metrics for the challenging task of zero-shot phoneme-level mispronunciation detection. Finally, we demonstrate the superiority of these metrics compared to likelihood-based methods on a real-world mispronunciation detection benchmark.
Primary: Norwegian University of Science and Technology
All Institutions: Norwegian University of Science and Technology, Tsinghua University
This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.
The paper introduces a novel Lagrangian sub-flow (LSF) framework for out-of-distribution (OOD) detection using continuous normalizing flows (CNFs). The methodology is well-grounded in fluid dynamics principles, allowing for localized analysis of high-dimensional data while maintaining global context. The approach effectively addresses the "likelihood paradox" by isolating relevant components in the data representation, which is a significant advancement in the field of generative models. The proposed geometric diagnostic signals and metrics for phoneme-level mispronunciation detection are innovative and provide a fresh perspective on OOD detection.
The experiments are robust, utilizing a real-world dataset (CMU Kids) for zero-shot phoneme-level mispronunciation detection. The results demonstrate the superiority of the proposed metrics over traditional likelihood-based methods, highlighting the effectiveness of the LSF framework. The evaluation metrics, including ROC-AUC, are appropriate for the task, although further validation across diverse datasets would strengthen the findings.
The paper provides sufficient details on the experimental setup, including model training and evaluation processes. However, the lack of publicly available code or a demo limits reproducibility. Clear descriptions of the methods and metrics used contribute positively, but access to implementation details would enhance reproducibility.
The study is primarily focused on a specific application in speech synthesis, which may limit the generalizability of the findings. The authors acknowledge the need for further validation across other domains, indicating that the framework's applicability is yet to be fully explored. Additionally, the complexity of the proposed methods may pose challenges for practical implementation in real-time systems.
The proposed framework has the potential to significantly improve OOD detection in various applications beyond speech synthesis, such as computer vision and medical imaging. By enhancing the ability to detect mispronunciations and other anomalies, this work could lead to advancements in automated speech recognition and generative modeling, ultimately benefiting user experience and system reliability. This paper presents a novel framework for using continuous normalizing flows in out-of-distribution detection, significantly advancing the understanding and application of generative models in high-dimensional data analysis. The methodology is innovative, addressing key challenges in the field, and the experimental results demonstrate its effectiveness in a practical application.
Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.
Primary: unknown
All Institutions: unknown
The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.
The methodology employed in the development of QuAP is robust, integrating a hybrid retrieval system with procedural audio synthesis and an intelligent parameter assistant. The use of MobileNet for audio embeddings and the feature-driven bottleneck framework for optimizing synthesis parameters demonstrates a thoughtful approach to addressing the challenges in sound design workflows. However, the paper could benefit from a more detailed description of the implementation specifics and the exact parameters used in the optimization process.
The experimental evaluation is well-structured, utilizing a MUSHRA subjective evaluation to assess the quality of the synthesized audio and an ablation study to compare encoder architectures. The results indicate statistically significant improvements in sound quality for most models, which supports the effectiveness of the proposed system. However, the relatively small sample size in the user evaluation (16 participants) may limit the generalizability of the findings.
While the paper provides a project URL and mentions the use of established datasets and frameworks, it lacks detailed implementation instructions or code availability, which could hinder reproducibility. More explicit documentation on the setup and execution of experiments would enhance this aspect.
The study acknowledges limitations, particularly in the synthesis quality of certain models (e.g., Rocket and Jet) and the narrow scope of sound categories supported by QuAP. The reliance on subjective evaluations may also introduce biases, and the tool's performance in real-world scenarios remains to be fully validated.
QuAP has the potential to significantly impact sound design practices by streamlining workflows and enhancing creative exploration. By unifying retrieval and synthesis, it could facilitate more efficient sound design processes across various industries, including film, gaming, and music production. The focus on maintaining creative agency while providing intelligent assistance is particularly relevant in the context of increasing automation in creative fields. The main contribution of this paper is the introduction of QuAP, a prototype system that integrates content-based audio retrieval and procedural sound generation, thereby addressing the fragmentation in current sound design workflows. This work represents a significant advancement in audio processing, combining innovative methodologies with practical applications, and highlights the importance of user-centered design in the development of creative tools.