Unifying speech, sound, and music generation in one model is hindered by tradeoffs between fidelity, end-to-end training, in-context conditioning, and variable-length synthesis that no current paradigm fully resolves. To address this challenge, we present AudioCALM, a universal audio generation framework that extends autoregressive (AR) next-token prediction from discrete tokens to continuous audio latents: a thin flow-matching head replaces the softmax to predict rectified-flow velocities at each position, and a block-causal AR-Flow attention pattern produces arbitrary-length output. Joint training of multiple audio generation tasks faces an asymmetric text--audio mismatch: speech transcripts align to specific time spans and demand tight, time-aligned attention, whereas sound and music captions describe only overall semantics and rely on diffuse, holistic attention; mixing the two disproportionately degrades sound and music generation. We address this asymmetry at two levels: a data reformulation strategy that unifies all three tasks under a single description-style conditioning interface, and a novel architecture Asymmetric Mixture-of-Modality-Experts (A-MoME), which adds a dedicated residual expert for speech while sound and music share the backbone, incurring no inference overhead on non-speech inputs. Experimental results demonstrate that AudioCALM matches modality-specific state-of-the-art and outperforms prior unified baselines on speech, sound, and music generation benchmarks.
Primary: Hong Kong University of Science and Technology (HKUST)
All Institutions: Hong Kong University of Science and Technology (HKUST), Alibaba Group
AudioCALM presents a compelling unified audio generation framework that effectively bridges the gap between discrete autoregressive modeling and continuous flow matching, achieving state-of-the-art performance across speech, sound, and music domains while introducing novel architectural and data-level solutions to cross-modal interference.
The paper proposes AudioCALM, a unified framework for text-to-speech, text-to-sound, and text-to-music generation. The core methodological innovation is "Continuous Autoregressive Language Modeling" (CALM), which replaces the discrete softmax output of standard autoregressive language models with a continuous flow-matching head that predicts rectified-flow velocities over VAE latents. This allows the model to leverage the streaming and in-context capabilities of AR models while avoiding the information bottleneck of discrete tokenization. Key technical components include: 1) AR-Flow Attention: A block-causal attention pattern that allows bidirectional flow matching within a block of latents while maintaining autoregressive commitment across blocks, enabling variable-length generation. 2) Asymmetric Mixture-of-Modality-Experts (A-MoME): A novel architectural design that adds a dedicated residual expert for speech (which requires tight local alignment) while sharing the backbone for sound and music (which rely on global semantics), addressing the identified "asymmetric mismatch" in joint training. 3) Description-Style Conditioning: A data reformulation strategy using an MLLM to generate long-form, modality-specific descriptions from short captions/transcripts, unifying the conditioning interface across modalities. The approach is theoretically sound and addresses specific pain points in unified audio generation (fidelity vs. flexibility, cross-modal interference).
The evaluation is comprehensive, covering three distinct audio modalities on standard benchmarks (LibriTTS, SeedTTS for speech; AudioCaps for sound; Song-Describer for music). The paper compares AudioCALM against both modality-specific state-of-the-art systems (e.g., CosyVoice 3.0, Stable Audio Open) and prior unified models (UniAudio, UniFlow-Audio). Results show that AudioCALM matches or exceeds SOTA on most metrics, particularly in sound and music generation (FAD, CLAP score) and speech intelligibility (WER). The ablation studies are particularly strong, effectively isolating the contributions of the continuous head, the description-style conditioning, and the A-MoME architecture. The finding that adding speech data disproportionately degrades non-speech generation (and vice versa) is a significant empirical insight that justifies the asymmetric design. The use of both objective metrics (FAD, WER, CLAP) and subjective evaluations (MOS) provides a robust assessment.
The paper provides detailed implementation details, including the VAE architecture (CNN-GAN with iSTFT head), training hyperparameters (AdamW, batch size, learning rate), and the specific prompts used for the MLLM captioning pipeline. The authors release code and weights, and provide the cached annotations for public datasets, which significantly aids reproducibility. The use of open-source datasets (LibriTTS, VGGSound, FMA, etc.) ensures that the training data is accessible. The only potential hurdle is the reliance on Gemini 3 Pro for the offline captioning step, but the authors mitigate this by releasing the prompts and the resulting captions.
The authors acknowledge several limitations: 1) The training data is restricted to English speech and public sound/music corpora, limiting generalization to non-English speech, singing voice, and rare audio events. 2) The backbone scale is limited to 1.7B-4B parameters, leaving open questions about scaling behavior. 3) Long-form generation coherence and termination are not deeply investigated, with the current system relying on a simple stop head. 4) The use of a closed-source MLLM for data preparation introduces a dependency that may not be fully reproducible by all researchers without access to similar models.
AudioCALM represents a significant step towards universal audio generation, which has broad applications in accessibility (TTS), creative industries (music/sound design), and research (data augmentation). However, the power of unified models to clone voices and generate realistic sound effects raises serious concerns about misuse, including impersonation, fraud, and disinformation. The authors address this by implementing safeguards in the license (prohibiting non-consensual cloning) and discussing the need for synthetic audio detection. The release of such a powerful model requires careful consideration of these risks. AudioCALM presents a compelling unified audio generation framework that effectively bridges the gap between discrete autoregressive modeling and continuous flow matching, achieving state-of-the-art performance across speech, sound, and music domains while introducing novel architectural and data-level solutions to cross-modal interference.
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.
Primary: Lightricks
All Institutions: Lightricks, Tel Aviv University
[The paper introduces ScenA, a flow-matching framework for multi-speaker audio scene generation that overcomes the "Reference Shortcut" via high-noise-biased training, enabling robust speaker binding and rich ambient audio generation from minimal natural language and reference inputs.]
The paper proposes ScenA, a novel framework for multi-speaker audio scene generation that conditions a pre-trained text-to-audio flow-matching model on multiple reference voices and free-form natural language prompts. The core methodological innovation lies in the "Reference Shortcut" diagnosis and its mitigation. The authors identify that standard flow-matching training schedules allow the model to bypass text-based speaker binding by relying on acoustic similarity between the noisy target and the clean reference latents. To counter this, they introduce a high-noise-biased timestep distribution (Beta+Uniform mixture) that forces the model to rely on the text prompt for identity assignment during the critical early denoising steps. The architecture is notably minimalist, using concatenated reference latents with lightweight identity-aware positional encodings, avoiding complex identity encoders or structured supervision tags. This approach leverages the inherent capabilities of large-scale in-the-wild audio foundation models to generate ambient textures, overlapping speech, and paralinguistic events jointly with dialogue, a significant departure from traditional speech-only TTS pipelines.
The evaluation is rigorous and well-designed, focusing on both speaker binding fidelity and audio quality. The authors utilize the CoVoMix2-Dialogue benchmark, creating specific subsets (CoVoMix2-Dialogue-20s and CoVoMix2-Dialogue-WildRef) to test performance on studio-clean versus in-the-wild references. They compare ScenA against state-of-the-art multi-speaker dialogue TTS systems (MOSS-TTS, VibeVoice, ZipVoice, Dia). Results show ScenA outperforms baselines on binding-aware metrics (cpWER, cpSIM, ACC), particularly demonstrating robustness when references are noisy or from the wild, where baselines fail significantly. The inclusion of a "Reference Shortcut Probe" provides strong empirical evidence for their hypothesis, showing that the model can identify speakers from noisy targets at low noise levels, validating the need for the high-noise training bias. Human preference tests further support the qualitative superiority of the generated scenes.
The paper provides substantial implementation details, including the backbone architecture (LTX-2 audio stream), training hyperparameters (AdamW, batch size, learning rate schedule), and the specific mathematical formulation of the timestep distribution. The dataset construction pipeline is described in detail, including the use of diarization and captioning models to create the training data. The code is not explicitly linked in the text provided (only a project page URL), but the methodological description is sufficient for reproduction by researchers familiar with flow-matching and diffusion transformers. The ablation studies on positional encodings and augmentation strategies add to the reproducibility and robustness of the claims.
The authors acknowledge several limitations. The generation duration is capped at 20 seconds due to the backbone's constraints, although they note this can be extended with modest fine-tuning. The number of supported speakers is limited to $K_{max}=3$ in the current configuration, constrained by the linear growth of the self-attention sequence with the number of references. The reliance on a pre-trained foundation model means the quality is bound by the underlying model's capabilities and potential biases in the in-the-wild training data. Additionally, the "Reference Shortcut" phenomenon, while solved for this specific setup, highlights a general fragility in reference-conditioned generation that may require similar careful schedule design in other modalities.
This work significantly advances the field of generative audio by demonstrating that complex, structured multi-speaker interactions can be generated using minimal, natural language conditioning on top of general-purpose audio models. This reduces the need for complex, brittle pipeline architectures in dialogue TTS. The ability to generate realistic, ambient-rich conversational audio has applications in virtual reality, gaming, and accessible media creation. However, the ease of cloning voices and generating realistic dialogue raises concerns about deepfakes and misinformation, necessitating responsible use guidelines and watermarking techniques, which are not discussed in the paper. [The paper introduces ScenA, a flow-matching framework for multi-speaker audio scene generation that overcomes the "Reference Shortcut" via high-noise-biased training, enabling robust speaker binding and rich ambient audio generation from minimal natural language and reference inputs.]
Voice reconstruction using Text-to-Speech (TTS) offers a communication method for people with speech disorders, which aims to retain their speaker identity while improving intelligibility. Previous work generally relies on Mean Opinion Score (MOS) to evaluate naturalness and speaker similarity, but this has limited sensitivity and reliability. We propose an evaluation framework with subjective and objective components. Subjectively, we evaluate perceived intelligibility and speaker identity using Best Worst Scaling (BWS) with situational framing. Objectively, we demonstrate that standard measures fail to predict reconstruction success for highly unintelligible speakers, so we introduce a novel dual-reference distributional measure to assess the trade-off between intelligibility and speaker identity. By evaluating the output of 17 zero-shot TTS systems for 193 speakers, we show that our framework provides a reliable and task-aligned approach for assessing voice reconstruction.
Primary: The Centre for Speech Technology Research, University of Edinburgh
All Institutions: The Centre for Speech Technology Research, University of Edinburgh
This paper presents a rigorous and innovative evaluation framework for TTS voice reconstruction, introducing situational framing in subjective evaluation and a novel dual-reference distributional metric that effectively captures the trade-off between intelligibility and speaker identity, addressing critical gaps in current assessment methodologies for assistive speech technologies.
The paper proposes a comprehensive evaluation framework for Text-to-Speech (TTS) voice reconstruction, a task critical for assisting individuals with speech disorders. The methodology addresses two main gaps in current evaluation practices: the limitations of Mean Opinion Score (MOS) in sensitivity and reliability, and the failure of standard objective metrics to correlate with human perception in this specific domain. Subjectively, the authors employ Best Worst Scaling (BWS) with situational framing to isolate intelligibility from speaker identity reconstruction, a nuanced approach that acknowledges the distinct cognitive tasks involved. Objectively, they introduce a novel dual-reference distributional measure (TTSDS Mean) that combines distances to a high-intelligibility generic corpus and the original disordered speaker's reference. This approach attempts to quantify the trade-off between improving intelligibility and preserving speaker identity, addressing the lack of ground truth in this generative task. The methodological rigor is high, particularly in the experimental design of the subjective study and the innovative application of distributional metrics to a domain where they have not been standardly applied.
The experimental setup is robust, involving 17 zero-shot TTS systems evaluated on 193 speakers from the Speech Accessibility Project (SAP) dataset. The dataset covers diverse speech disorders (Parkinson's, Cerebral Palsy, ALS, Down Syndrome), enhancing the generalizability of the findings. The subjective evaluation involved a significant number of listeners (46-47 per condition) and used rigorous statistical modeling (Plackett-Luce). The results clearly demonstrate that standard objective metrics (WER, PER, Speaker Similarity, UTMOS) fail to predict reconstruction success, particularly for highly unintelligible speakers. The proposed TTSDS Mean metric shows strong correlation with subjective reconstruction rankings (rho=0.81 overall), outperforming speaker similarity (rho=0.75). The analysis of system performance reveals that while some systems (IndexTTS2, Qwen3-TTS) perform well on average, they struggle with severely disordered speech, highlighting the complexity of the task. The large-scale evaluation provides a solid empirical basis for the proposed framework.
The paper provides detailed information on the datasets, systems, and evaluation protocols. The use of the public SAP dataset and well-known TTS systems enhances reproducibility. The authors provide a project page with audio examples and listening test instructions, which is crucial for verifying the subjective findings. However, the code for the specific implementation of the TTSDS Mean metric and the exact configuration of the 17 TTS systems (especially if they are open-source but require specific versions) might need clarification. The description of the BWS experimental design is sufficiently detailed for replication. The lack of a public code repository is a minor drawback, but the availability of stimuli and detailed methodology mitigates this.
The study focuses primarily on intelligibility and speaker identity, leaving out other important dimensions such as prosody, accent similarity, and naturalness, which are acknowledged as future work. The subjective evaluation relies on listeners who are not the end-users (people with speech disorders), which may introduce bias or lack of ecological validity. The distributional measure relies on the quality of the reference datasets (LibriTTS and the original disordered speech), and its performance may vary with different TTS architectures or training data distributions. The generalization of the TTSDS Mean metric to other languages or speech disorders not covered in the study is not tested. Additionally, the use of zero-shot TTS systems limits the scope to cloning-based approaches, excluding fine-tuned or specialized reconstruction models.
This work has significant potential impact on the development of assistive communication technologies. By providing a more reliable and task-aligned evaluation framework, it can guide researchers and developers in creating better voice reconstruction systems for people with speech disorders. The findings challenge the reliance on standard TTS metrics and advocate for more nuanced, use-case-specific evaluation methods. The framework can be adopted by the broader TTS and speech accessibility communities to standardize evaluations and facilitate fair comparisons between systems. Ultimately, this contributes to improving the quality of life for individuals with speech impairments by enabling more effective and personalized communication aids. This paper presents a rigorous and innovative evaluation framework for TTS voice reconstruction, introducing situational framing in subjective evaluation and a novel dual-reference distributional metric that effectively captures the trade-off between intelligibility and speaker identity, addressing critical gaps in current assessment methodologies for assistive speech technologies.
Recent end-to-end models for EEG-guided target speech extraction report impressive results, underscoring potential for neuro-steered hearing technologies. However, our analysis reveals that high within-trial performance can be driven by trial-specific EEG structure that acts as shortcuts for target selection, leading to poor generalization on unseen trials. To overcome this gap, we propose TRUST-TSE, a two-stage framework to mitigate shortcut learning. By introducing contrastive pretraining with attended-speaker negative sampling, we encourage the EEG encoder to capture fine-grained EEG--speech alignment while suppressing trial-identity cues. We also employ a confidence-weighted extraction objective based on EEG--source similarity to guide extraction using the learned representations. Experiments on KUL and DTU datasets show that TRUST-TSE outperforms end-to-end baselines under strict cross-trial protocols, addressing a key reliability bottleneck of existing approaches.
Primary: Seoul National University
All Institutions: Seoul National University, University of Iowa
This paper presents a critical analysis of shortcut learning in EEG-guided speech extraction and proposes a robust two-stage training framework (TRUST-TSE) that significantly improves cross-trial generalization, addressing a major reliability bottleneck in neuro-steered audio technologies.
The paper proposes TRUST-TSE, a two-stage framework designed to mitigate shortcut learning in EEG-guided target speech extraction (TSE). The core methodological contribution lies in the diagnosis that end-to-end models exploit trial-specific EEG artifacts (trial identity) rather than genuine attention signals. To counter this, Stage 1 employs contrastive pretraining with a novel "attended-speaker negative sampling" strategy. This forces the EEG encoder to align with specific speech segments within the same trial, thereby suppressing trial-level shortcuts. Stage 2 uses a confidence-weighted SI-SDR objective, where the weight is derived from the similarity between the frozen EEG embedding and the audio embeddings of the attended vs. ignored sources. This allows the extractor to handle ambiguous or contradictory guidance segments by weighting gradients accordingly. The approach is theoretically sound and addresses a critical flaw in current evaluation protocols for neuro-steered audio systems.
The authors conduct rigorous experiments on two public datasets, KUL and DTU, under strict cross-trial protocols. They demonstrate that standard end-to-end baselines (NeuroHeed, M3ANet) suffer significant performance drops when evaluated on unseen trials compared to within-trial evaluations, confirming the shortcut hypothesis. TRUST-TSE consistently outperforms these baselines in cross-trial selection accuracy and separation quality (SI-SDR). The paper includes extensive ablation studies validating the components: the specific negative sampling strategy, the confidence weighting mechanism, and the superiority of contrastive embeddings over envelope decoding. Stress tests (EEG shuffling, trial-wise permutation) further confirm that TRUST-TSE relies on meaningful EEG-audio alignment rather than shortcuts. The results are robust across different window lengths and show generalization to unseen subjects.
The paper provides detailed descriptions of the model architectures, training hyperparameters, and data preprocessing steps. The authors explicitly state that the source code is publicly available on GitHub, which significantly enhances reproducibility. The evaluation protocols are clearly defined, including the specific fold constructions to prevent data leakage. The inclusion of supplementary material with additional metrics (PESQ, STOI) and unseen-subject results adds to the transparency.
The primary limitation is the reliance on public datasets which, while standard, are relatively small in scale and diversity compared to large-scale consumer audio datasets. The performance gains, while statistically significant and methodologically important, are modest in absolute terms (e.g., ~15% accuracy gain on KUL). The method assumes a known-subject setting in the main experiments, although unseen-subject results are provided. The confidence weighting mechanism, while effective, introduces a dependency on the quality of the frozen EEG encoder; if Stage 1 fails to capture attention, Stage 2 may struggle.
This work has significant implications for the development of reliable neuro-steered hearing aids and brain-computer interfaces. By highlighting the fragility of current end-to-end models and providing a robust alternative, it pushes the field towards more rigorous evaluation standards and more reliable real-world deployment. It also contributes to the broader understanding of shortcut learning in multimodal representation learning, offering a template for ensuring that models learn task-relevant features rather than spurious correlations. This paper presents a critical analysis of shortcut learning in EEG-guided speech extraction and proposes a robust two-stage training framework (TRUST-TSE) that significantly improves cross-trial generalization, addressing a major reliability bottleneck in neuro-steered audio technologies.
Aligner-Encoders are recently proposed seq2seq end-to-end ASR models that replace decoder attention by predicting the uth token directly from the u-th encoder position, so the encoder must learn the alignment internally without cross-attention or a transducer lattice. In practice, this alignment often forms abruptly in the upper layers, making training sensitive and brittle on long utterances. We propose InterAligner, which adds an intermediate Aligner objective so alignment can form progressively across depth, together with an intermediate CTC loss (InterCTC) to stabilize optimization. On LibriSpeech with a 17-layer Conformer, a final-only Aligner reaches 5.0/7.8 WER (test-clean/other). InterCTC improves to 3.4/6.0, and InterAligner further reduces WER to 3.1/5.6 with the largest gains on long utterances.
Primary: NTT, Inc.
All Institutions: NTT, Inc.
The paper proposes InterAligner, an intermediate supervision method for Aligner-Encoders that progressively builds alignment across network depth, significantly improving robustness on long utterances in ASR tasks.
The paper addresses a specific and well-defined problem in Aligner-Encoder architectures: the brittleness of alignment formation in deep layers for long utterances. The proposed solution, InterAligner, introduces a hierarchical supervision strategy. By attaching an intermediate Aligner loss at an intermediate layer (layer 15) using a finer-grained tokenization (smaller vocabulary size) and an intermediate CTC loss (InterCTC) at an earlier layer (layer 12), the authors aim to create a "curriculum" for alignment. This approach is technically sound and leverages established concepts of intermediate supervision (common in deep learning) and multi-granularity learning. The novelty lies in the specific application to the structural constraints of Aligner-Encoders, where the one-to-one mapping requires careful management of sequence lengths and token granularities. The method is relatively simple to implement, adding auxiliary heads and losses without altering the core encoder architecture significantly.
The experimental evaluation is robust and comprehensive. The authors use standard benchmarks (LibriSpeech and Common Voice English) and a strong baseline (17-layer Conformer Aligner-Encoder). The results show consistent improvements: InterCTC provides a significant boost, and InterAligner provides further gains, particularly on long utterances (>21s), which validates the core hypothesis. The ablation studies are thorough, investigating the impact of vocabulary size, loss weights, and layer placement. The statistical significance testing adds credibility. The attention visualization provides qualitative support for the progressive alignment hypothesis. However, the gains, while consistent, are moderate in absolute terms (e.g., 3.1 vs 5.0 WER on test-clean for the final comparison, but note the baseline reproduction difficulty mentioned). The comparison is primarily internal (ablation), with limited comparison to other state-of-the-art ASR systems like RNN-T or standard AEDs, though the paper claims competitiveness.
The paper provides sufficient detail for reproduction. The architecture (Conformer-L), training hyperparameters (learning rate, warmup, batch size), and dataset details are clearly stated. The specific layer indices for intermediate losses (12 and 15) and the tokenization sizes (256 vs 1024) are provided. The use of model averaging is standard. The code is not explicitly linked, but the methodology is described with enough precision that implementation should be feasible for researchers in the field.
The primary limitation is the incremental nature of the contribution. It improves an existing architecture but does not propose a fundamentally new paradigm. The gains are most pronounced on long utterances, suggesting limited utility for short-form speech. The method adds computational overhead during training due to the auxiliary heads and losses, though inference remains unchanged (using only the final head). The paper acknowledges the difficulty in reproducing the baseline Aligner-Encoder results, which might make the absolute WER numbers less comparable to other works if the baseline was under-optimized.
This work contributes to the field of Automatic Speech Recognition by making Aligner-Encoders more robust and practical, especially for long-form audio. This can benefit applications requiring low-latency or lightweight decoding where Aligner-Encoders are advantageous. The technique of progressive alignment supervision could potentially be applied to other sequence-to-sequence models with similar structural constraints. The paper proposes InterAligner, an intermediate supervision method for Aligner-Encoders that progressively builds alignment across network depth, significantly improving robustness on long utterances in ASR tasks.
Large audio-language models (LALMs) can reason about audio, yet it remains unclear whether they can perform comparative judgments between two speech signals along emotional, environmental, linguistic, prosodic, and interpersonal dimensions. We study this question in the context of speech emotion recognition (SER), where the model determines which utterance exhibits higher arousal, valence, or dominance. We introduce a reasoning-guided ordinal SER framework that conditions an LALM on paired speech inputs. The model is trained using reasoning traces generated from both semantic audio descriptions and acoustic evidence derived from GeMAPS features, enabling interpretable comparative decisions. Beyond direct supervision, we also employ direct preference optimization to encourage stronger separation for emotional differences. Experiments show that the proposed framework improves preference prediction while requiring only 5% of the training data used by conventional ordinal SER systems.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, The University of Texas at Dallas, NVIDIA
This paper presents a significant advancement in comparative audio reasoning by introducing a reasoning-guided ordinal SER framework that leverages both semantic and acoustic evidence, demonstrating superior data efficiency and cross-domain robustness compared to conventional SSL-based methods.
The paper proposes a novel framework for adapting Large Audio-Language Models (LALMs) to comparative speech emotion recognition (SER). The core innovation lies in the use of "reasoning-guided" training, where the model generates intermediate reasoning traces based on both semantic audio descriptions and discrete GeMAPS acoustic features before making a pairwise comparison. This approach integrates explicit acoustic evidence (GeMAPS) with high-level semantic reasoning, addressing a gap in current LALMs that often rely on implicit feature representations. The use of Direct Preference Optimization (DPO) with constructed correct and incorrect reasoning traces is a sophisticated application of preference learning to enhance both accuracy and interpretability. However, the reliance on a separate large reasoning model (Qwen3-Next-80B) to generate these traces introduces significant computational overhead and potential error propagation if the reasoning model hallucinates or misinterprets the acoustic features. The methodology is sound and theoretically well-motivated, bridging the gap between low-level acoustic analysis and high-level linguistic reasoning.
The experimental setup is robust, utilizing the MSP-Podcast v2.0 corpus as the primary dataset and evaluating cross-domain generalization on BIIC-Podcast and WHiSER. The comparison against strong self-supervised learning (SSL) baselines (WavLM, HuBERT with RankNet/RankList) is appropriate and highlights the data efficiency of the LALM approach. The results demonstrate that the proposed DPO-CoT method significantly outperforms baselines in preference accuracy, particularly in cross-domain settings. The ablation studies effectively isolate the contributions of SFT, DPO, and reasoning traces. The finding that reasoning traces improve performance specifically under DPO (but not SFT) is an interesting and valuable insight into how preference optimization interacts with chain-of-thought reasoning. The data efficiency claim (5% of training data) is compelling and well-supported by the results.
The paper provides sufficient detail regarding the model architectures (Qwen2.5-Omni-3B, Qwen3-Next-80B), feature extraction (GeMAPS), and training protocols (LoRA ranks, DPO parameters). The dataset partitions and preprocessing steps for GeMAPS features are described. However, the specific prompts used for the reasoning model and the exact criteria for "discretizing" GeMAPS features into qualitative levels (low/medium/high) could be more explicitly defined to ensure exact reproducibility. The reliance on a proprietary or specific version of the "Qwen3-Next" model might also pose reproducibility challenges if the weights or specific inference configurations are not publicly available.
A primary limitation is the computational cost and latency introduced by generating reasoning traces using a large 80B parameter model. This makes the approach less suitable for real-time applications compared to direct SSL-based ranking. The quality of the reasoning traces is dependent on the capability of the reasoning model; if the reasoning model fails to correctly interpret the GeMAPS features or the audio description, the subsequent DPO training might reinforce incorrect reasoning patterns, although the paper attempts to mitigate this by regenerating traces. Furthermore, the evaluation is limited to emotional attributes; while the authors suggest generalizability, the specific mechanisms for handling non-emotional comparative tasks (e.g., speaker identity, environmental noise) are not empirically validated in this work.
This work contributes to the development of more interpretable and robust audio understanding systems. By enabling LALMs to perform comparative reasoning with explicit justifications, it enhances trust and diagnostic capabilities in applications such as mental health monitoring, human-computer interaction, and audio content moderation. The approach of grounding LLM reasoning in structured acoustic features could inspire similar frameworks for other audio domains, such as music analysis or environmental sound classification. However, the potential for bias in the reasoning traces, if the underlying models or GeMAPS feature interpretations are biased, remains a concern that should be addressed in future deployments. This paper presents a significant advancement in comparative audio reasoning by introducing a reasoning-guided ordinal SER framework that leverages both semantic and acoustic evidence, demonstrating superior data efficiency and cross-domain robustness compared to conventional SSL-based methods.
Interactive music and live performance relies on real-time human expression, but modern generative music AI remains largely absent from this domain due to its prohibitive inference latency and offline rendering paradigm. To provide pioneer musicians with a novel medium for interactive composition, we should fundamentally change these static models into dynamic, playable instruments. In this paper, we propose a framework that bridges this gap. To achieve the low latency required for live interaction without sacrificing structural coherence, we formulate distillation within a streaming autoregressive latent space. Our approach gets rid of the need for expensive paired audio-latent datasets by utilizing prompt-only inputs to synthesize teacher-guided, chunk-wise trajectories on the fly. Because live instruments require high acoustic fidelity, we introduce music-aware consistency objectives, which combine latent, spectral, and temporal-difference losses, to preserve crucial qualities like timbre, transients, and rhythmic stability during accelerated single-step streaming generation. Implemented via parameter-efficient adaptation, our distillation reduces generation steps to achieve a low real-time factor. Crucially, by operating as a continuous autoregressive stream, the system can seamlessly assimilate dynamic human inputs on the fly, allowing users to instantly steer the musical trajectory without interrupting the audio flow. Ultimately, this work recontextualizes generative text-to-music models not as passive prompt-and-wait systems, but as responsive instruments, opening new frontiers for live human-AI musical co-creation.
Primary: ZhuoLab
All Institutions: ZhuoLab
This paper presents a compelling framework for real-time interactive music generation by combining data-free consistency distillation with music-aware objectives, effectively bridging the gap between high-quality offline generation and low-latency live performance.
The paper proposes a "data-free streaming consistency distillation" framework to enable real-time, interactive music generation. The core technical contribution lies in reformulating text-to-music generation as a continuous autoregressive process in latent space, where a frozen teacher model generates chunk-wise trajectories online (without precomputed paired data) to train a one-step student model. The methodology introduces "music-aware consistency objectives" that combine latent reconstruction loss with spectral (RFFT magnitude) and temporal-difference (L1 of first-order derivative) losses. This approach aims to preserve timbral fidelity and rhythmic stability during aggressive step reduction (from multi-step to single-step). The use of parameter-efficient adaptation (LoRA) for the student model is a standard but effective choice. The novelty is moderate; while the specific combination of data-free online teacher rollout with spectral/temporal consistency losses for *interactive* music streaming is a distinct contribution, the underlying techniques (consistency distillation, LoRA, streaming inference) are well-established in the broader generative AI community. The application to the specific domain of "playable instruments" rather than just fast offline rendering provides a clear problem-solution fit.
The experimental setup is rigorous for the domain. The authors evaluate on a SongDescriber-derived benchmark, using both objective metrics (CLAP, PaSST-KLD, OpenL3-FD) and subjective metrics (MOS for quality, responsiveness, steerability, co-creation). They provide a detailed ablation study on the loss components (latent-only vs. full music-aware) and chunk durations. The results demonstrate that the full music-aware objective significantly improves objective quality scores (e.g., CLAP score increase from 0.329 to 0.361) and subjective interaction scores compared to baselines. The latency analysis is thorough, showing a real-time factor (RTF) well below 1.0 and low startup latency. However, the reliance on subjective metrics for the "interactive" aspect, while necessary, introduces variance. The comparison to "Ground Truth" and "Teacher Offline" provides a good baseline for quality degradation due to distillation. The evaluation of "control latency" is a strong point, addressing a key requirement for the stated use case.
The paper provides sufficient implementation details for reproduction, including the base model (ACE-Step 1.5 XL-Turbo), LoRA hyperparameters (rank 64, scaling 128, dropout 0.1), training steps (2,000), and loss function formulations. The data-free nature of the training (using prompt-only online synthesis) removes the barrier of needing specific paired audio-latent datasets, which aids reproducibility. However, the exact "SongDescriber benchmark" filtering and the specific teacher rollout parameters (e.g., ODE solver steps) are mentioned but could be more precisely defined for exact replication. The code is not publicly linked in the provided text, which is a minor hindrance.
The paper acknowledges that the system operates at a "phrase-level" rather than note-level, which limits fine-grained musical control. The reliance on semantic prompts and high-level controls (energy, density) may not satisfy professional musicians seeking precise compositional control. The "data-free" aspect means the student is limited by the teacher's quality and the diversity of the prompt pool; it does not learn from a curated dataset of high-fidelity musical structures, which might limit the upper bound of quality compared to models trained on massive paired datasets. Additionally, the subjective evaluation sample size (N=20) is relatively small for robust statistical significance in human-computer interaction studies.
This work has significant potential impact on the field of human-AI musical co-creation. By reducing inference latency to real-time levels and enabling continuous steering, it transforms generative music models from static content generators into dynamic instruments. This could lower the barrier for live performance using AI and open new avenues for interactive art and therapy. However, it also raises questions about the devaluation of human musical skill if AI can instantly generate coherent accompaniment, and the potential for misuse in generating copyrighted-style music without clear attribution or compensation mechanisms, although the paper focuses on the technical feasibility. This paper presents a compelling framework for real-time interactive music generation by combining data-free consistency distillation with music-aware objectives, effectively bridging the gap between high-quality offline generation and low-latency live performance.
We present ZONOS2 8B, our latest TTS model, which achieves state-of-the-art naturalness, prosody, and voice cloning fidelity. We improve upon Zonos-v0.1 across scale, data, and training recipe. We scale the model from 1.6B to 8B total parameters (900M active) with a novel mixture-of-experts (MoE) backbone, improving inference latency and throughput. We expand our training corpus from 200K to over 6M hours using a new data processing pipeline, and we simplify our post-training and conditioning recipes to improve naturalness and voice cloning fidelity. We evaluate ZONOS2 8B on quality, speaker similarity, WER, and ZTTS1-Eval, our novel TTS benchmark, where it performs competitively with state-of-the-art systems while maintaining good streaming latency. We release our model weights and example inference code under an Apache 2.0 license on GitHub and Hugging Face.
Primary: Zyphra
All Institutions: Zyphra
ZONOS2 is a significant engineering achievement in open-source TTS, combining MoE efficiency with robust multilingual and voice-cloning capabilities, while establishing a new, more rigorous evaluation standard for the field.
The paper presents ZONOS2, an 8B parameter (900M active) MoE transformer-based Text-to-Speech model. The methodology is a robust engineering synthesis rather than a radical theoretical breakthrough. Key technical components include: 1) **MoE Architecture**: Adapting Mixture-of-Experts (specifically the ZAYA router design) to TTS to balance scale and latency. The authors note significant instability in MoE balancing for audio data compared to text, mitigated by dense start/end layers and top-2 routing. 2) **Tokenization**: Use of DAC (Discrete Audio Codec) with a specific delay pattern to handle multi-codebook autoregressive generation. 3) **Text Input**: Shift from phonemes to byte-level tokenization to improve multilingual robustness and avoid G2P errors, a pragmatic choice supported by scaling laws. 4) **Speaker Conditioning**: Use of ECAPA-TDNN embeddings projected via LDA to reduce nuisance factor leakage (noise, duration) and prevent overfitting/shortcutting. 5) **Conditioning**: Introduction of speaking-rate and quality tokens, along with data augmentation during training to decouple speaker identity from acoustic quality. The approach is technically sound and addresses known pain points in open-source TTS (latency, multilingual G2P failure, voice cloning fidelity).
The evaluation is comprehensive and addresses current gaps in TTS benchmarks. The introduction of **ZTTS1-Eval** is a significant contribution, providing a benchmark with 17 languages, in-the-wild spontaneous speech, and modern scoring models (Qwen3-ASR, ReDimNet, MSR-UTMOS) compared to outdated stacks in Seed-TTS-Eval. Results show ZONOS2 is competitive with closed-source leaders (ElevenLabs, Cartesia) and open-source peers (Fish S2, Qwen3 TTS) in speaker similarity and naturalness (UTMOS). It excels in prosodic distribution (TTSDS2) and diversity (DS-WED). However, Word Error Rate (WER) is notably higher than some competitors (e.g., Qwen3 TTS, ElevenLabs) in several languages, suggesting a trade-off between naturalness/cloning and strict intelligibility, which is acknowledged. The "Quality Mode" ablation shows a clear trade-off between intelligibility and speaker similarity.
The paper provides model weights, inference code, and the benchmark dataset on GitHub and Hugging Face under Apache 2.0. The training pipeline details (data sources, ASR ensemble filtering, multi-stage training) are described in sufficient detail for reproduction by a team with similar computational resources. The specific hyperparameters for the MoE balancing and the exact composition of the 6M-hour dataset are less transparent but the general recipe is clear.
1) **Intelligibility**: WER scores are suboptimal compared to top-tier closed-source models, particularly in non-English languages and "hard" English sets. 2) **MoE Instability**: The authors admit that MoE balancing for audio is significantly harder than for text, implying potential training fragility or sensitivity to hyperparameters. 3) **Latency Claims**: While MoE improves throughput, the absolute latency for an 8B model compared to smaller specialized models (1.7B) is not explicitly quantified in terms of RTF (Real-Time Factor) in the provided text, though claimed to be good for streaming. 4) **Benchmark Bias**: As a new benchmark, ZTTS1-Eval's adoption and community trust are unproven.
This work contributes to the democratization of high-quality TTS by releasing a strong open-source model. The new benchmark sets a higher standard for evaluation, pushing the field away from outdated metrics and single-language focus. It highlights the importance of prosody and diversity in evaluation. The release of such a large model also raises questions about the environmental cost of training and the potential for misuse in voice cloning/deepfakes, though the Apache 2.0 license and open nature encourage community oversight and safety research. ZONOS2 is a significant engineering achievement in open-source TTS, combining MoE efficiency with robust multilingual and voice-cloning capabilities, while establishing a new, more rigorous evaluation standard for the field.
Existing Reinforcement Learning (RL) research for Text-to-Speech (TTS) focuses on large language models (LLMs), leaving Flow-Matching (FM) under-explored. We present FlowTTS-GRPO, an online RL framework for FM-based TTS. By converting ordinary differential equation (ODE) trajectories into stochastic differential equation (SDE) paths, our method enables direct fine-tuning of open-source FM models without auxiliary models. We show that a weighted reward combination converges faster than a probabilistic scheme, and identify three practical optimizations: omitting classifier-free guidance (CFG) during training accelerates convergence; synthesizing hard cases improves robustness; and applying RL to the FM component enhances audio-detail metrics. Experiments on CosyVoice 3.0 and F5-TTS demonstrate objective and subjective preference gains in speaker similarity and perceptual quality, with F5-TTS also improving intelligibility.
Primary: Alibaba Group
All Institutions: Alibaba Group, Tongyi Lab
[One sentence main contribution]. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper successfully adapts Group Relative Policy Optimization (GRPO) to Flow-Matching based TTS by introducing stochasticity via SDE conversion, demonstrating that online RL can significantly enhance speaker similarity and perceptual quality in both hybrid and pure FM architectures without auxiliary models.
The paper proposes FlowTTS-GRPO, an online reinforcement learning framework tailored for Flow-Matching (FM) based Text-to-Speech (TTS) models. The core technical novelty lies in adapting the Group Relative Policy Optimization (GRPO) algorithm, originally designed for Large Language Models (LLMs), to continuous diffusion/flow models. This is achieved by converting the deterministic Ordinary Differential Equation (ODE) sampling trajectory into a Stochastic Differential Equation (SDE) path, thereby introducing the necessary stochasticity for policy gradient estimation. The authors formulate the FM decoding process as a Markov Decision Process (MDP) where actions are velocity predictions. They employ a multi-objective reward structure combining speaker similarity (SS), ASR-based intelligibility (CER/WER), and perceptual quality (DNSMOS). A key methodological contribution is the analysis of reward fusion strategies, demonstrating that weighted combination with standard deviation normalization converges faster and more stably than probabilistic assignment. Additionally, they identify practical optimizations such as omitting Classifier-Free Guidance (CFG) during training to enhance exploration and using hard-case synthesis to improve robustness. The approach is model-agnostic regarding the FM backbone, successfully applied to both LLM-FM hybrid (CosyVoice 3.0) and pure FM (F5-TTS) architectures.
The experimental evaluation is comprehensive, covering two major open-source TTS systems (CosyVoice 3.0 and F5-TTS) and multiple languages (Chinese, English, and several European languages). The authors utilize the Seed-TTS-Eval benchmark, reporting improvements in speaker similarity (surpassing closed-source baselines like Seed-TTS on SS1 for Chinese), perceptual quality (DNSMOS), and intelligibility (WER/CER). The inclusion of subjective A/B preference tests strengthens the claims by correlating objective metrics with human judgment. Ablation studies effectively isolate the impact of reward combination strategies, CFG omission, and hard-case training. The results demonstrate that RL on the FM component primarily enhances acoustic details and timbre, while RL on the LLM component (as seen in comparative baselines) is more critical for semantic alignment, providing valuable architectural insights.
The paper provides sufficient detail for reproduction, including the MDP formulation, SDE conversion equations, reward definitions, and training hyperparameters (LoRA ranks, noise levels, window sizes). The use of widely available models (CosyVoice 3.0, F5-TTS, Whisper, Paraformer) and datasets (WenetSpeech4TTS, LibriTTS) facilitates replication. However, the specific implementation details of the SDE windowing and the exact weighting coefficients for the multi-objective reward are provided, though the code is not publicly linked in the text. The distinction between training and inference CFG usage is clearly explained.
The primary limitation is the computational cost associated with online RL, requiring multiple rollouts per prompt and significant GPU resources (8 GPUs mentioned). The method relies on proxy rewards (DNSMOS, ASR, SS embeddings) which may not perfectly align with all aspects of human perception, although subjective tests mitigate this concern. The improvement in intelligibility for the LLM-FM hybrid (CosyVoice) is limited because the semantic content is determined by the frozen LLM front-end; RL on the FM can only refine acoustic realization, not correct semantic errors. Furthermore, the "hard case" synthesis strategy, while effective, relies on heuristic augmentations that may not cover all edge cases in natural speech.
This work significantly advances the field of generative audio by bridging the gap between discrete token-based RL (used in LLMs) and continuous flow-based generation. It enables the fine-tuning of high-quality, open-source FM models without the need for complex auxiliary reward models or value networks, democratizing access to advanced RL techniques in TTS. The findings on reward conflicts and optimization strategies provide generalizable insights for other continuous generative tasks. The potential for improved voice cloning and natural speech synthesis has broad applications in accessibility, entertainment, and human-computer interaction, though it also raises concerns regarding voice impersonation and deepfakes. [One sentence main contribution]. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper successfully adapts Group Relative Policy Optimization (GRPO) to Flow-Matching based TTS by introducing stochasticity via SDE conversion, demonstrating that online RL can significantly enhance speaker similarity and perceptual quality in both hybrid and pure FM architectures without auxiliary models.
Speech Language Models achieve reasoning capabilities, but are often hindered by massive parameter counts and a tendency to prioritize linguistic priors over acoustic features. While contrastive decoding enhances grounding by contrasting audio-aware and text-only logits, it increases inference latency. We propose Contrastive Audio-Aware Distillation (CAAD), a framework that internalizes the teacher's contrastive reasoning into the student model's weights. To overcome the high computational training overhead in the dual-path token-by-token contrastive distillation process, we introduce a synchronized teacher-forcing strategy. Anchored by unified Pseudo-Ground Truths, this mechanism enables simultaneous full-sequence generation of the teacher's contrastive distributions, allowing student to distill the audio-aware signal efficiently. Overall, CAAD yields a ~8% relative gain over standard knowledge distillation on Dynamic-SUPERB and successfully reduces linguistic bias in MCR-BENCH.
Primary: National Taiwan University
All Institutions: Graduate Institute of Electrical Engineering, National Taiwan University, Graduate Institute of Communication Engineering, National Taiwan University, NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE), National Taiwan University
The paper presents a novel and effective method for distilling contrastive decoding into Speech Language Models, addressing critical efficiency and bias challenges in multimodal AI through a synchronized teacher-forcing strategy anchored by metadata.
The paper proposes Contrastive Audio-Aware Distillation (CAAD), a method to compress Speech Language Models (SLMs) by distilling the benefits of Contrastive Decoding (CD) into a student model's weights. The core innovation is a "synchronized teacher-forcing strategy" that uses a "Pseudo-Ground Truth" (Pseudo-GT) generated from text metadata to anchor both the audio-aware (positive) and text-only (negative) teacher passes. This allows for parallel training, avoiding the sequential bottleneck of standard autoregressive contrastive decoding. The approach effectively transforms a test-time inference technique (CD) into a training-time objective. While the concept of distilling decoding strategies is not entirely new, the specific mechanism of using metadata-anchored pseudo-GT to enable parallel contrastive distillation in SLMs is a novel and practical engineering contribution. It addresses a genuine computational bottleneck in applying CD to large models.
The experimental evaluation is robust, utilizing the Dynamic-SUPERB benchmark and the MCR-BENCH for conflict resolution. The results demonstrate that the CAAD-distilled 3B student model significantly outperforms standard KD and even the greedy decoding baseline of the 8B teacher on several metrics, particularly in paralinguistic tasks (PAR) and conflict resolution (MCR-BENCH Shift). The ablation studies effectively validate the components of the method, showing that metadata-based Pseudo-GT outperforms audio-based synchronization and that the contrastive weight is crucial for mitigating linguistic bias. The comparison against Contrastive Decoding at inference time highlights the efficiency gain (single-path vs. dual-path) while acknowledging the performance gap, which is a fair and honest assessment.
The paper provides sufficient detail regarding the model architectures (Llama-3.2-8B teacher, Llama-3.2-3B student), training configurations (learning rate, optimizer, loss weights), and datasets (DeSTA2, Dynamic-SUPERB, MCR-BENCH). The code repository is linked, which significantly aids reproducibility. The description of the Pseudo-GT generation process is clear enough for replication.
The primary limitation is the dependency on the quality of the Pseudo-GT. If the metadata-derived text is inaccurate or lacks nuance, the distillation signal may be noisy. Additionally, the method assumes that the teacher model's contrastive reasoning is transferable via KL divergence, which may not capture all nuances of the teacher's decision boundary. The performance of the student, while improved, still lags behind the teacher's contrastive decoding performance, indicating that some information is lost in the distillation process. The paper also notes that the efficacy is bounded by the student's capacity.
This work contributes to the democratization of SLMs by enabling efficient, low-latency inference without sacrificing the robustness offered by contrastive methods. By mitigating linguistic bias, it promotes more reliable multimodal AI systems, which is crucial for applications in accessibility, customer service, and interactive agents where audio cues are critical. The method is generalizable to other multimodal LLMs beyond speech. The paper presents a novel and effective method for distilling contrastive decoding into Speech Language Models, addressing critical efficiency and bias challenges in multimodal AI through a synchronized teacher-forcing strategy anchored by metadata.
Continuous Variational Autoencoders (VAEs) serve as the fundamental continuous tokenizer for modern neural audio generation systems, enabling high-fidelity reconstruction while providing a compact, smooth latent space for downstream generative priors. However, continuous VAEs face a fundamental conflict among compression rate, reconstruction fidelity, and latent space topology, which we formalize as the Rate-Distortion-Regularity Trilemma. This trilemma stems from a topological mismatch: the isotropic Gaussian prior in standard VAEs imposes a flat latent geometry that fails to accommodate audio's hierarchical nature, where low-frequency components are structured and compressible while high-frequency components are stochastic and incompressible, leading to disordered information packing in which crucial semantic features are interleaved with high-entropy noise. To address this challenge, we propose Structured Topology-Aware Regularization (STAR), a general training strategy that reshapes latent space geometry by imposing a growth-based constraint field, routing structural and textural information into channel subspaces with matching capacities. STAR is applicable to any VAE architecture and effectively resolves the trilemma, as demonstrated in CNN-based VAEs. We further present STAR-VAE, which combines STAR with a hybrid CNN-Mamba architecture for local feature extraction and linear-complexity global context modeling, and STAR-Gen, an LLM-based Flow Matching framework that leverages STAR-VAE's structured latent space for high-fidelity generation without vector quantization artifacts. Experiments across diverse audio domains show that STAR-VAE achieves state-of-the-art reconstruction fidelity and enhanced semantic information preservation, while the structured latent space improves both traditional diffusion models and STAR-Gen for text-to-audio generation.
Primary: Hong Kong University of Science and Technology
All Institutions: Hong Kong University of Science and Technology, Tongyi Fun Team, Alibaba Group
The paper presents a significant advancement in continuous audio tokenization by introducing Structured Topology-Aware Regularization (STAR), which effectively resolves the Rate-Distortion-Regularity Trilemma in VAEs through a theoretically motivated capacity gradient, leading to state-of-the-art performance in both audio reconstruction and LLM-based flow matching generation.
The paper proposes Structured Topology-Aware Regularization (STAR), a novel regularization strategy for Variational Autoencoders (VAEs) that replaces the standard isotropic Gaussian prior with a channel-wise structured constraint field. The core theoretical contribution is the formalization of the "Rate-Distortion-Regularity Trilemma," arguing that isotropic priors cause "disordered information packing" in audio VAEs. STAR addresses this by imposing a Gamma-Growth function on the KL divergence weights, creating a "capacity gradient" that aligns latent channel capacity with the spectral hierarchy of audio (low-frequency structure vs. high-frequency texture). The authors combine this with a hybrid CNN-Mamba architecture (STAR-VAE) for efficient global context modeling and introduce STAR-Gen, an LLM-based Flow Matching framework for text-to-audio generation. The methodology is well-motivated, theoretically grounded in information theory (power-law decay), and technically sound, offering a generalizable solution to a known problem in continuous tokenization.
The experimental evaluation is comprehensive and robust. The authors provide extensive ablation studies validating the STAR regularization, including comparisons of different growth functions (Step, Linear, Gamma) and hyperparameters. They demonstrate that STAR-VAE achieves state-of-the-art reconstruction fidelity on AudioCaps and Song Describer datasets, outperforming strong baselines like Stable Audio Open and HiFi-VAE. Crucially, they validate the "Reconstruction Drift" phenomenon in high-capacity encoders without STAR, reinforcing their theoretical claims. For generation, STAR-Gen achieves SOTA performance on text-to-audio tasks, significantly outperforming diffusion-based baselines in perceptual quality (FD_openl3) and semantic alignment (CLAP). The inclusion of human evaluation (MOS) and linear probing for semantic information adds significant weight to the empirical claims.
The paper provides detailed implementation specifications, including dataset preprocessing steps, architectural details (ResNet blocks, Mamba dimensions, normalization strategies), and training configurations (loss weights, optimizer settings, hardware). The two-stage training strategy (pre-training with isotropic KL, fine-tuning with STAR) is clearly described. The project page URL is provided, suggesting code or demos may be available, though the GitHub link is not explicitly in the text. The level of detail is sufficient for reproduction by other researchers in the field.
The paper focuses primarily on audio generation and reconstruction. While STAR is claimed to be architecture-agnostic, the empirical validation is limited to audio domains. The integration of Mamba introduces linear complexity but may still face challenges with extremely long sequences compared to sparse attention mechanisms, though this is mitigated by the VAE compression. The STAR-Gen model relies on a large LLM backbone (Qwen3), which may limit its deployment on resource-constrained devices compared to smaller diffusion models. Additionally, the "Gamma-Growth" parameter requires tuning, although the paper suggests a default value.
This work advances the field of neural audio generation by providing a more robust and semantically rich continuous tokenization method. By resolving the trilemma between compression, fidelity, and regularity, STAR-VAE enables higher-quality audio synthesis with fewer artifacts. The integration with LLM-based Flow Matching opens new avenues for scalable and controllable audio generation. The potential applications in creative content production, sound design, and music composition are significant, though the authors appropriately note the risks associated with high-fidelity generative audio, such as misinformation and intellectual property concerns. The paper presents a significant advancement in continuous audio tokenization by introducing Structured Topology-Aware Regularization (STAR), which effectively resolves the Rate-Distortion-Regularity Trilemma in VAEs through a theoretically motivated capacity gradient, leading to state-of-the-art performance in both audio reconstruction and LLM-based flow matching generation.
Generative music systems can now produce impressive audio from text prompts, but audio outputs are difficult to inspect, edit, and diagnose as musical structure. We introduce Libretto, an agent-facing framework for symbolic music generation and revision. Libretto uses an LLM-native grammar with explicit onset slots, voices, and bar-level organization, then evaluates each piece in a corpus-calibrated statistical space over rhythm, harmony, melody, texture, form, and variation. The same structural axes support retrieval, diagnosis, copy-risk control, and iterative self-revision. Across gap filling, reference-guided full-piece generation, gradual morphing, and educational music generation, Libretto turns symbolic music from a raw token sequence into a measurable and editable object for language-model agents.
Primary: University of California, Berkeley
All Institutions: University of California, Berkeley
Libretto presents a structured, agent-centric framework for symbolic music generation that leverages corpus-calibrated structural metrics to enable interpretable diagnosis and iterative self-revision, offering a significant methodological contribution to controllable AI music composition despite lacking perceptual validation.
The paper introduces "Libretto," a framework designed to bridge the gap between generative audio models and symbolic music representation by creating an LLM-native grammar for symbolic music. The core methodological contribution is not a new neural architecture for generation, but rather a structured representation system and an evaluation loop. The grammar explicitly defines onset slots, voices, and bar-level organization, making the symbolic output directly editable and interpretable by an LLM agent. The evaluation mechanism relies on a "corpus-calibrated statistical cloud" comprising 29 structural axes (rhythm, harmony, melody, texture, form, variation) computed from the symbolic representation. These axes are mapped to percentiles against a reference corpus, allowing the agent to diagnose structural deviations (e.g., "too sparse," "harmonically unstable") rather than relying on black-box aesthetic scores. The agent loop involves generation, measurement against these axes, and iterative self-revision based on musician-readable feedback. This approach shifts the focus from end-to-end differentiable generation to a retrieval-augmented, self-correcting agentic workflow.
The authors evaluate the framework across four tasks: gap filling, reference-guided full-piece generation, gradual morphing, and educational music generation. They use a corpus of 314 MIDI files from the Lakh MIDI Dataset. The experiments demonstrate that the structural axes can distinguish between genres (e.g., Jazz vs. Folk on harmonic complexity) and that the agent loop improves pass rates for structural validity (e.g., gap-filling pass rate increased from 12% to significantly higher levels with the loop). The paper provides qualitative examples and quantitative metrics for copy-risk and structural degeneracy. However, the evaluation is largely internal and self-referential; it measures how well the generated pieces fit the *defined* structural axes and avoid copying, rather than assessing musical quality via human listening tests or comparison to state-of-the-art audio generation models (like Suno or Udio) in terms of perceptual quality. The dataset size (314 songs) is small for corpus calibration, though sufficient for the descriptive statistics used.
The paper provides a clear description of the grammar, the 29 axes, and the evaluation gates. The code and project website are linked, which enhances reproducibility. The reliance on a specific LLM (Claude Code with Opus 4.8) for the agent loop is noted, which allows other researchers to replicate the agentic behavior, though the specific prompts and retrieval mechanisms would need to be carefully reconstructed from the text and code. The definition of the structural axes is mathematically precise in the appendix.
A significant limitation is the lack of perceptual evaluation. The system optimizes for structural properties defined by the authors, but there is no evidence that these properties correlate with human judgments of musical quality or "goodness." The small reference corpus (314 songs) limits the generalizability of the statistical cloud, potentially biasing the "idiomatic" norms towards the specific genres present in that small set. The abstraction of the grammar (ignoring velocity, timbre, micro-timing) means it cannot capture expressive performance nuances, limiting its applicability to purely structural composition tasks. Furthermore, the reliance on a proprietary LLM (Claude) for the agent loop raises questions about the accessibility and cost of the method for broader research communities.
Libretto offers a novel perspective on AI-assisted music creation by treating symbolic music as a structured, editable object for LLM agents. This could empower musicians and educators by providing tools for targeted theory practice, gap-filling, and style exploration. It contributes to the broader field of AI creativity by demonstrating how structured representations can enhance the controllability and interpretability of generative models. However, the potential for generating high-quality, commercially viable music is limited by the lack of audio fidelity and expressive nuance in the current symbolic representation. Libretto presents a structured, agent-centric framework for symbolic music generation that leverages corpus-calibrated structural metrics to enable interpretable diagnosis and iterative self-revision, offering a significant methodological contribution to controllable AI music composition despite lacking perceptual validation.
Diffusion models show potential for speech enhancement but lack linguistic guidance. We condition a diffusion-based model on wav2vec 2.0 features from noisy input, injected at the U-Net bottleneck via Feature-wise Linear Modulation (FiLM). Phonetic representations from wav2vec 2.0 features of degraded speech, anchor the reverse diffusion process. While a frozen wav2vec 2.0 encoder extracts features, a learned FiLM generator produces scale and shift parameters modulating the bottleneck with minimal overhead. Motivated by the optimal Bayesian causal estimator under a linear-Gaussian state-space model, FiLM coefficients are aggregated via exponential smoothing for temporal compression. Evaluation on VoiceBank-DEMAND and LibriMix shows competitive performance against the unconditioned baseline in PESQ, STOI, SI-SDR and DNSMOS. We consistently record an improvement of 0.4 on PESQ score, suggesting self-supervised representations effectively condition diffusion-based speech enhancement.
Primary: University of Maryland
All Institutions: University of Maryland
This paper presents a theoretically motivated and empirically effective method for conditioning diffusion-based speech enhancement with self-supervised features, offering a compelling alternative to standard conditioning strategies despite a noted trade-off in source separation metrics.
The paper proposes a novel conditioning mechanism for diffusion-based speech enhancement by integrating self-supervised learning (SSL) features from wav2vec 2.0. The core technical contribution lies in the injection of these phonetic representations into the U-Net bottleneck via Feature-wise Linear Modulation (FiLM). A significant methodological strength is the theoretical derivation of the temporal aggregation strategy for the FiLM coefficients. By modeling the phonetic state as a random walk and the projected coefficients as noisy observations, the authors derive that the optimal causal estimator is a Kalman filter, which simplifies to exponential moving average (EMA) at steady state. This provides a principled, theoretically grounded alternative to ad-hoc pooling methods (like mean pooling) for handling the temporal mismatch between frame-level SSL features and the global context required at the diffusion bottleneck. The choice to apply FiLM only at the bottleneck, supported by ablation studies, demonstrates a nuanced understanding of feature abstraction levels in U-Nets.
The experimental evaluation is conducted on two standard benchmarks: VoiceBank-DEMAND and LibriMix. The results show consistent improvements in perceptual metrics (PESQ, STOI, DNSMOS) compared to the unconditioned StoRM baseline, with a notable 0.4 improvement in PESQ on VB-DEMAND. However, the paper acknowledges a trade-off: a degradation in SI-SDR, attributed to aggressive noise suppression. This is a critical observation; while perceptual quality improves, the objective source separation metric suffers, suggesting the model may be over-smoothing or removing non-speech components that contribute to the separation score. The evaluation includes ablations on the smoothing coefficient and conditioning location, which strengthen the claims. However, the lack of subjective listening tests (MOS) limits the validation of the perceptual gains claimed by the DNSMOS and PESQ improvements. The comparison is primarily against StoRM and a few other diffusion baselines, which is appropriate but could be broader to include recent SSL-enhanced discriminative models.
The paper provides sufficient detail regarding the model architecture (U-Net configuration, wav2vec 2.0 base model), training hyperparameters (learning rate, epochs, optimizer), and datasets. The theoretical derivation of the EMA smoothing is clearly explained. However, the code is not publicly linked in the text provided, and specific details on the noise types and mixing conditions for LibriMix are somewhat generic ("min" mixing mode). The claim of "minimal overhead" is supported by FLOPs analysis, but exact inference latency comparisons would be more useful for reproducibility in real-time applications.
The primary limitation is the trade-off between perceptual quality and source separation fidelity (SI-SDR). The model improves PESQ but reduces SI-SDR, which may be undesirable for applications requiring strict source isolation. Additionally, the reliance on wav2vec 2.0, while effective, ties the method to a specific SSL model; the authors mention future work with WavLM/HuBERT, but the current work does not explore the sensitivity to the choice of SSL encoder. The theoretical derivation assumes a linear-Gaussian state-space model, which is a simplification of the complex, non-linear dynamics of speech and noise. Finally, the evaluation lacks subjective human listening tests, which are the gold standard for speech enhancement quality.
This work contributes to the field of audio processing by bridging self-supervised learning and generative modeling. It demonstrates that linguistic/phonetic information can effectively guide diffusion processes, potentially leading to more robust speech enhancement systems that preserve speech content even in low-SNR conditions. This has implications for telecommunications, hearing aids, and speech recognition preprocessing. The theoretical connection between Kalman filtering and exponential smoothing for feature aggregation is a generalizable insight that could apply to other temporal sequence modeling tasks. This paper presents a theoretically motivated and empirically effective method for conditioning diffusion-based speech enhancement with self-supervised features, offering a compelling alternative to standard conditioning strategies despite a noted trade-off in source separation metrics.
We propose AugCodec, a low-bitrate disentangled neural speech codec that leverages data augmentation to decompose speech into three distinct components: semantic, speaker, and prosody tokens. Specifically, we employ tailored augmenta tion strategies to transform speech into distinct variants, each serving as input for extracting tokens that preserve the target attribute while suppressing others. This disentanglement strategy enables substantial reduction in token rate. Further more, we introduce an augmentation loss that aligns semantic encoder outputs between source and voice-converted speech, encouraging speaker-agnostic embeddings while mitigating the acoustic mismatch induced by voice conversion. Experiments on LibriSpeech test-clean demonstrate that AugCodec significantly outperforms state-of-the-art methods in both reconstruction quality and disentanglement, while operating at only 12.5Hz with three token streams.
Primary: Amazon
All Institutions: Amazon
AugCodec introduces a data-augmentation-driven disentanglement strategy for neural speech codecs, achieving state-of-the-art reconstruction and content preservation at low bitrates by isolating semantic, speaker, and prosody features through tailored input transformations and alignment losses.
The paper proposes AugCodec, a novel approach to disentangled neural speech coding that leverages data augmentation as a primary mechanism for feature separation. The core innovation lies in the input preprocessing strategies: using a voice-converted speech variant for the semantic encoder to enforce speaker invariance, a different utterance from the same speaker for the speaker encoder to isolate identity, and low-frequency STFT components for the prosody encoder to capture F0 and lower harmonics while discarding timbral content. This is complemented by an "augmentation loss" that aligns semantic embeddings between the original and voice-converted speech, explicitly penalizing speaker information leakage. The architecture employs separate streams for semantic, speaker, and prosody tokens, which are quantized independently (using VQ for semantic, FSQ for speaker/prosody) and fused via a learned expansion and modulation mechanism (FiLM) before decoding. The semantic encoder uses a learned compression/expansion scheme to reduce frame rate while preserving dynamics, addressing a key limitation of average pooling.
The authors evaluate AugCodec on LibriSpeech test-clean, comparing against strong baselines including BiCodec, Mimi, Qwen-TTS-Tokenizer-12Hz, and FACodec. The results demonstrate that AugCodec achieves superior reconstruction quality (PESQ, UTMOS) and significantly lower Word Error Rates (WER) compared to baselines at comparable or lower bitrates (e.g., 387.5 bps for AugCodec-2 vs 412.5 bps for Mimi). The disentanglement capability is validated through voice conversion experiments, where AugCodec shows better content preservation (lower WER) than FACodec and BiCodec. The ablation study confirms the importance of the augmentation loss, showing performance degradation when it is removed. The results are robust across different codebook sizes and frame rates.
The paper provides detailed architectural specifications, including layer dimensions, kernel sizes, and loss weights. It specifies the use of open-source components like wav2vec 2.0, ECAPA-TDNN, and Seed-VC. The training setup (optimizer, learning rate, batch size, iterations) is clearly described. However, the specific random seeds and exact data splitting procedures for the 3000 hours of LibriLight-medium data are not fully detailed, which might introduce minor variability. The use of off-the-shelf voice conversion models introduces a dependency that is well-documented but adds complexity to the training pipeline.
The reliance on an external voice conversion model (Seed-VC) for generating semantic training data is a potential limitation, as errors or biases in the VC model could propagate to the codec. The prosody encoder's reliance on low-frequency STFT components might struggle with complex harmonic structures or high-frequency prosodic cues, although the authors argue this minimizes correlation with semantic/speaker features. The method operates at a maximum frame rate of 12.5Hz for semantic tokens, which is low but may not be sufficient for all real-time applications requiring ultra-low latency without further optimization. The paper does not report results on out-of-domain datasets or diverse languages, limiting the generalizability assessment.
AugCodec contributes to the development of efficient, disentangled speech representations, which are foundational for speech language models, voice conversion, and text-to-speech systems. By enabling low-bitrate transmission with high quality and disentanglement, it facilitates more scalable and privacy-preserving speech applications. The emphasis on speaker invariance in semantic tokens has implications for reducing bias and protecting speaker identity in downstream tasks. AugCodec introduces a data-augmentation-driven disentanglement strategy for neural speech codecs, achieving state-of-the-art reconstruction and content preservation at low bitrates by isolating semantic, speaker, and prosody features through tailored input transformations and alignment losses.
Understanding speaker attributes is crucial for voice-related applications, yet conventional approaches rely on fixed categorical labels, lacking semantic richness and zero-shot generalizability. We propose a novel framework for open-set speaker attribute prediction leveraging Large Language Model (LLM) embeddings to represent attributes in a continuous semantic space. To bridge the cross-modal gap, we introduce a keyword-appending strategy that structures broad semantic representations into a compact, discriminative manifold. Furthermore, we employ a top-k negative loss to establish robust decision boundaries in crowded semantic regions. Experimental results on LibriTTS-P demonstrate that our method outperforms closed-set benchmarks and generalizes effectively to unseen synonyms. Geometric analysis suggests that our strategies regularize the embedding manifold, balancing semantic cohesion with predictive clarity.
Primary: Seoul National University
All Institutions: Seoul National University, Artificial Intelligence Institute, Interdisciplinary Program in Artificial Intelligence
This paper presents a well-executed extension of speaker attribute prediction into the open-set domain using LLM embeddings, demonstrating that structural regularization of the semantic manifold can significantly improve both performance and generalizability.
The paper proposes a novel framework for open-set speaker attribute prediction by leveraging LLM embeddings as continuous targets rather than discrete categorical labels. The core methodological contributions are the "keyword-appending strategy" to ground semantic ambiguity in a specific domain (e.g., appending "speech" to "cute") and a "top-k negative penalization" loss to manage semantic crowding in the embedding manifold. The approach effectively bridges the cross-modal gap between acoustic features (ECAPA-TDNN) and textual semantics (GPT-OSS-20B). The geometric analysis providing evidence for manifold regularization is a strong theoretical component. However, the novelty is slightly tempered by the fact that using LLM embeddings for zero-shot/open-set classification is an established paradigm (e.g., CLIP-style alignment), and the specific application here is a logical extension rather than a fundamental algorithmic breakthrough. The "apple" keyword finding is intriguing but suggests the mechanism might be more about structural regularization than semantic grounding, which is an interesting insight but limits the semantic interpretability claim.
The experimental evaluation is robust within the scope of the dataset. The use of LibriTTS-P is appropriate as it is the standard for this specific task. The comparison against the closed-set baseline (Vove) demonstrates clear performance gains in both closed-set F1 scores and zero-shot synonym generalization. The inclusion of a geometric analysis (Center Sim, Total Variance, PCA Log-determinant) adds significant value, providing empirical backing for the claim that the proposed methods structure the embedding space effectively. The ablation studies on different keywords (speech, voice, face, man, apple) are particularly valuable, revealing that the benefit of keyword appending is not strictly semantic but structural. The results are statistically sound and well-presented.
The paper provides sufficient implementation details for reproduction. It specifies the backbone (ECAPA-TDNN), the LLM used (GPT-OSS-20B), the loss functions, hyperparameters (margin, k, weights), and the dataset splits. The reference to the baseline code (https://github.com/jaejunL/vove) is helpful. One minor ambiguity is the exact version or specific prompt used to generate the GPT-OSS-20B embeddings, though this is a minor detail. The weight assignment for intensity levels (1.5, 1.0, 0.5) is clearly defined.
The primary limitation is the reliance on a single dataset (LibriTTS-P), which restricts the generalizability of the findings to other speaker attribute corpora or styles. The authors acknowledge this. Additionally, the use of a proprietary LLM (GPT-OSS-20B, described as "open-weights from OpenAI" which is a contradiction in terms or a specific internal model) might limit reproducibility for others without access to that specific model. The "apple" keyword result, while insightful, highlights that the model might be sensitive to arbitrary structural shifts in the embedding space rather than pure semantic meaning, which could be a vulnerability in noisy real-world applications.
This work contributes to the broader field of interpretable AI in speech processing. By moving from black-box embeddings to semantically rich, explainable attributes, it enables more controllable and transparent voice technologies (e.g., TTS, voice conversion). The open-set capability allows for more flexible user interfaces where users can describe speakers in natural language. The geometric insights into embedding manifolds are also relevant to broader multimodal learning research. This paper presents a well-executed extension of speaker attribute prediction into the open-set domain using LLM embeddings, demonstrating that structural regularization of the semantic manifold can significantly improve both performance and generalizability.
Neural speech codecs efficiently compress speech and have become a foundation for speech generation, but they are typically learned as holistic representations that intertwine linguistic content, speaker identity, and prosody. While this design is effective for zero-shot voice cloning, it hinders downstream tasks that require prosody preservation or transfer, such as voice conversion. To address this, we introduce ProsoCodec, a prosody-oriented speech codec that models prosody as a conditional residual rather than as a disentangled stream. Specifically, by conditioning both the encoder and decoder on text and speaker embeddings as prefix tokens, the discrete bottleneck is encouraged to capture prosodic variation not explained by content and speaker. To further preserve prosody, we use the low-frequency mel band and train the model on paired same-speaker utterances. Experiments on voice conversion show improved prosody preservation and reduced source-timbre leakage.
Primary: Chung-Ang University
All Institutions: KAIST, Chung-Ang University, The Chinese University of Hong Kong
ProsoCodec presents a novel and effective approach to voice conversion by treating prosody as a conditional residual in a discrete speech codec, achieving state-of-the-art performance in prosody preservation and content fidelity through clever architectural conditioning and training strategies.
The paper proposes ProsoCodec, a neural speech codec specifically designed for voice conversion (VC) by modeling prosody as a conditional residual. The core innovation lies in the architectural conditioning: by providing explicit text and speaker embeddings as prefix tokens to both the encoder and decoder, the discrete bottleneck is forced to encode only the "residual" information, which the authors argue corresponds to prosody. This is a clever application of residual learning in the latent space. The use of Binary Spherical Quantization (BSQ) provides a compact discrete representation. The decoder utilizes a Diffusion Transformer (DiT) with flow-matching, which is a modern and effective approach for high-fidelity waveform/spectrogram reconstruction. The "dual-utterance" training strategy is a pragmatic and effective heuristic to prevent prompt-style leakage, where the model might otherwise copy the reference speaker's prosody instead of the source's. The methodology is sound, well-motivated, and builds logically on recent trends in discrete speech representation and diffusion-based generation.
The experimental setup is comprehensive, comparing ProsoCodec against several strong baselines including DDDM-VC, UniAudio, HierSpeech++, FACodec, Seed-VC, and Vevo. The evaluation covers both objective metrics (WER, Speaker Similarity, RMSE of F0, UTMOS) and subjective metrics (MOS for similarity, prosody, and naturalness). The results show that ProsoCodec achieves state-of-the-art performance in terms of content preservation (lowest WER) and prosody preservation (lowest F0 RMSE), while maintaining competitive speaker similarity and naturalness. The ablation studies are particularly valuable, demonstrating the individual contributions of the low-frequency mel input, the dual-utterance training, and the explicit conditioning. The analysis of the trade-off between bitrate and timbre leakage adds depth to the understanding of discrete codecs for VC.
The paper provides detailed implementation information, including dataset (LibriTTS, VCTK), model architecture (Transformer encoder, DiT decoder), hyperparameters (learning rate, batch size, optimizer), and pre-trained models used for ASR and speaker embedding (Qwen3-ASR, CAM++). The use of standard datasets and widely available pre-trained models enhances reproducibility. However, the specific initialization from "TaDiCodec" and the exact details of the binary spherical quantization implementation could benefit from more precise citation or code availability, though the description is generally sufficient for a competent researcher to replicate.
The paper acknowledges that strong reconstruction does not always imply strong conversion, a known issue in discrete speech models. The reliance on external ASR and speaker verification models introduces potential error propagation; if the ASR transcript is incorrect, the text conditioning will be wrong, potentially affecting prosody encoding. The dual-utterance training strategy, while effective, requires paired data from the same speaker, which might limit its applicability in scenarios where such pairs are scarce, although LibriTTS provides ample such data. The focus on prosody preservation might come at the cost of some fine-grained spectral details, as evidenced by the use of low-frequency mel bands, though the diffusion decoder helps mitigate this.
ProsoCodec contributes to the field of speech processing by offering a robust solution for voice conversion, a task with significant applications in entertainment, accessibility, and privacy. By decoupling prosody from timbre and content more effectively than previous holistic codecs, it enables more natural and controllable speech synthesis. The approach of using residual learning with explicit conditioning is a generalizable technique that could be applied to other speech generation tasks requiring fine-grained control over specific acoustic attributes. ProsoCodec presents a novel and effective approach to voice conversion by treating prosody as a conditional residual in a discrete speech codec, achieving state-of-the-art performance in prosody preservation and content fidelity through clever architectural conditioning and training strategies.
Recent speech research involves increasingly large datasets, complex models, and diverse experimental workflows. However, existing frameworks require substantial engineering effort to support such experiments. We present ESPnet3, a speech and audio research framework built on a modular system architecture with configuration-driven dataset composition and unified Python-based workflows. ESPnet3 introduces a DataOrganizer abstraction for flexible dataset integration and dataset sharding for memory-efficient large-scale training, while allowing recipe-specific logic through lightweight stage overrides. In OWSM pre-training experiments, ESPnet3 reduces per-epoch training time by \emph{21.1 minutes} compared to ESPnet2 and achieves \emph{>80\% GPU utilization} in multi-node training. Fine-tuning experiments show that new models and datasets can be integrated with around \emph{46 lines of additional code}. ESPnet3 will be publicly released with model checkpoints and training logs.
Primary: Carnegie Mellon University
All Institutions: Carnegie Mellon University, Brno University of Technology, Instituto Superior Técnico, Hanyang University, Hitachi Astemo, Shanghai Jiao Tong University
ESPnet3 represents a crucial evolution in speech research infrastructure, transforming ESPnet from a collection of recipes into a scalable, modular framework capable of supporting the data- and compute-intensive demands of modern foundation models.
The paper presents ESPnet3, a significant architectural refactoring of the ESPnet speech processing framework. The core methodological contribution lies in the decoupling of experiment logic from core infrastructure through a modular system architecture. Key technical innovations include the `DataOrganizer` abstraction for configuration-driven dataset composition using Hydra, and a shard-based iteration mechanism for memory-efficient large-scale training. The shift from shell/Perl-based orchestration in ESPnet2 to a unified Python-based workflow using PyTorch Lightning and HuggingFace Datasets represents a substantial engineering improvement for scalability and maintainability. While the individual components (Hydra, PyTorch Lightning, HuggingFace Datasets) are not novel, their specific integration and abstraction design for the speech domain, particularly the `BaseSystem` pattern, constitute a valid and impactful engineering contribution.
The evaluation focuses on system-level metrics rather than model performance improvements. The authors demonstrate a 21.1-minute reduction in per-epoch training time and >80% GPU utilization in multi-node OWSM pre-training. They provide concrete evidence of reduced engineering effort, citing a reduction from 2,289 lines to 70 lines for the OWSM recipe and a 46-line integration for new datasets. The fine-tuning experiments on Whisper with LoRA and the FalAR dataset serve to validate the framework's extensibility. However, the absence of new SOTA model results means the paper does not demonstrate that ESPnet3 enables *better* models, only *more efficient* development and training. The system-level gains are well-documented and convincing.
The paper promises public release of code, checkpoints, and training logs. The use of standard libraries (PyTorch, Hydra, HuggingFace) enhances reproducibility. The detailed comparison of code lines and memory usage provides clear benchmarks for other developers. The reliance on external large-scale datasets (OWSM, FalAR) and specific hardware (Bridges2, Delta) means exact replication of timing results may vary, but the architectural reproducibility is high.
The primary limitation is that this is an infrastructure paper, not a model architecture paper. It does not propose new neural network architectures or learning algorithms. The novelty is largely in software engineering and system design. The evaluation lacks comparison with other modern speech frameworks like NVIDIA NeMo or HuggingFace Transformers in terms of raw throughput or ease of use for specific tasks, focusing instead on internal comparisons with ESPnet2. The claim of "foundation model era" readiness is supported by scale but not by demonstrating capabilities beyond ASR.
ESPnet3 addresses a critical bottleneck in speech research: the high engineering overhead for large-scale experiments. By lowering the barrier to entry for multi-node, large-dataset training and simplifying the integration of new models and datasets, it has the potential to accelerate research in low-resource languages and multimodal speech systems. The modular design encourages community contributions and long-term maintainability of the ESPnet ecosystem. ESPnet3 represents a crucial evolution in speech research infrastructure, transforming ESPnet from a collection of recipes into a scalable, modular framework capable of supporting the data- and compute-intensive demands of modern foundation models.
Neural networks outperform classical GCC-PHAT for Time-Difference-of-Arrival (TDOA) estimation in noise and reverberation, yet their internal strategy remains unexplored. To uncover it, we turn GCC-PHAT's mathematical steps into diagnostic targets, probing hidden layers of three architectures (MLP, CNN, Transformer) and complementing with gradient attribution and causal frequency masking. We find that cross-power computation consistently emerges across all architectures and conditions, while PHAT whitening, the defining step of GCC-PHAT, fails to emerge. Instead, networks learn a magnitude-aware frequency weighting that preserves per-frequency reliability information discarded by PHAT. This makes PHAT an information bottleneck: removing it from both classical and neural GCC pipelines improves performance under additive noise. On real-world reverberant data, PHAT remains the best classical weighting, but end-to-end networks achieve lower error by learning data-adaptive weighting.
Primary: Institute of Science Tokyo
All Institutions: Institute of Science Tokyo
This paper provides a compelling interpretability study of neural networks for TDOA estimation, demonstrating that networks bypass the classical PHAT whitening step in favor of magnitude-aware weighting, thereby offering actionable insights for improving hybrid audio processing pipelines.
The paper employs a rigorous interpretability framework, specifically representation probing, to analyze the internal representations of neural networks trained for Time-Difference-of-Arrival (TDOA) estimation. By mapping hidden layer activations to specific mathematical steps of the classical GCC-PHAT algorithm (cross-power spectrum, PHAT whitening, phase), the authors create a precise diagnostic tool. The methodology is sound, utilizing linear and nonlinear probes, gradient attribution, and causal frequency masking to disentangle what information is represented versus how it is weighted. The choice of architectures (MLP, CNN, Transformer) with varying frequency connectivity allows for a controlled study of how different inductive biases affect the emergence of these algorithmic components.
The experimental design is comprehensive, covering synthetic data with controlled noise and reverberation, simulated speech data, and real-world recordings from the LOCATA challenge. The results are robust, consistently showing that while cross-power computation emerges reliably across all models and conditions, PHAT whitening does not. Instead, networks learn a magnitude-aware weighting scheme that correlates with signal reliability. The causal masking experiments strongly support the gradient attribution findings, demonstrating that high-energy bins are causally important for performance. The finding that removing PHAT whitening improves performance in additive noise is a significant empirical contribution, challenging a long-standing assumption in the audio signal processing community.
The paper provides a clear description of the network architectures, training procedures (AdamW, Huber loss, etc.), and data generation processes. The code is publicly available on GitHub, and the use of standard libraries (LibriSpeech, pyroomacoustics, LOCATA) ensures that the experimental setup can be replicated. The inclusion of specific hyperparameters and dataset sizes enhances reproducibility.
The study focuses primarily on pairwise TDOA estimation using short-time Fourier transform (STFT) representations. It does not explore waveform-domain models or end-to-end systems that might learn different internal representations. Additionally, the analysis is limited to single-source scenarios; the behavior in multi-source environments with overlapping speech or noise is noted as an open question. The reliance on linear probes, while standard, might miss complex nonlinear interactions, although the authors mitigate this with nonlinear probe controls.
This work has significant implications for the design of hybrid signal processing and neural network systems. By identifying PHAT whitening as an information bottleneck in noisy conditions, it suggests that classical pipelines can be improved by removing this step or replacing it with learned, data-adaptive weighting. It also provides a template for using probing to understand and improve other audio processing algorithms, bridging the gap between classical signal processing theory and modern deep learning practices. The findings encourage a re-evaluation of fixed preprocessing steps in favor of learnable, interpretable alternatives. This paper provides a compelling interpretability study of neural networks for TDOA estimation, demonstrating that networks bypass the classical PHAT whitening step in favor of magnitude-aware weighting, thereby offering actionable insights for improving hybrid audio processing pipelines.
Current text-guided audio editing methods rely on paired training data, predefined operation templates, and separate processing pipelines across speech, music, and sound. We present Bagpiper-Edit to enable open-ended audio editing via free-form natural language instructions. We reformulate audio editing as a rich-caption rewriting task by treating a rich caption as the semantic representation of an audio clip. The user request is translated into an edited caption, which then guides Bagpiper-Edit to generate the target edited audio with the original audio as contextual acoustic anchor. This unlocks the potential of free-form editing, and circumvents the need for paired audio-editing training data, enabling powerful zero-shot editing capabilities. Evaluations across speech, audio, and free-form editing show Bagpiper-Edit maintains good consistency to the original audio and achieves similar performance to other expert models in most cases. Demo: https://bagpiper-edit.github.io, Codes: https://github.com/espnet/espnet/pull/6417 & https://github.com/HsunGong/espnet
Primary: Shanghai Jiao Tong University
All Institutions: Auditory Cognition and Computational Acoustics Lab, Shanghai Jiao Tong University, Carnegie Mellon University, Language Technologies Institute
Bagpiper-Edit introduces a novel self-supervised framework for zero-shot open-ended audio editing by reformulating the task as rich-caption rewriting, effectively eliminating the need for paired editing datasets while maintaining high acoustic consistency.
The paper proposes Bagpiper-Edit, a framework that reformulates open-ended audio editing as a rich-caption rewriting task. The core innovation lies in bypassing the need for paired audio-editing training data by introducing a self-supervised training paradigm. This involves two strategies: audio repetition (to teach timbre/background preservation) and audio segmentation (to teach continuity across adjacent clips). The method leverages a pre-trained audio foundation model (Bagpiper-Base) and uses an external LLM to rewrite the rich caption based on user instructions, which then guides the audio generation conditioned on the original audio as an "acoustic anchor." The approach is theoretically sound and addresses a significant bottleneck in the field: the scarcity of high-quality paired editing datasets. The use of contiguous audio segments for self-supervision is a clever and resource-efficient technique for learning acoustic consistency.
The evaluation covers three domains: speech editing, audio-event editing, and free-form editing. The authors compare Bagpiper-Edit against specialized expert models (CosyVoice-3, Ming-UniAudio-Edit, Step-Audio-EditX) and generative models (AudioLDM2). Results show that Bagpiper-Edit (specifically the Multi-Turn variant) achieves competitive performance, often surpassing expert models in speaker similarity preservation and matching them in naturalness and editing accuracy. The metrics used (WER, SpkSIM, DNSMOS, FAD, CLAP scores) are standard and appropriate. The inclusion of LLM-based human preference scoring adds a layer of subjective evaluation, though reliance on LLMs for judgment can be noisy. The ablation between Single-Turn and Multi-Turn patterns provides useful insight into the training dynamics.
The paper provides links to a demo and a GitHub PR for the code. The training data sources (YODAS, LAION-Audio, etc.) are publicly available. The self-supervised training strategy is clearly described, allowing for potential reproduction. However, the reliance on specific large LLMs (Qwen3-235B, Qwen3-8B) for caption extraction and rewriting introduces external dependencies that might vary in performance depending on the version or API access. The exact hyperparameters for the self-supervised phase are partially detailed, but the full pipeline complexity might make exact replication challenging without the complete codebase.
The paper acknowledges that as a zero-shot model, it may exhibit less stability than expert models trained on massive paired data. It also notes limitations in handling highly complex acoustic environments like multi-speaker separation. The reliance on an external LLM for caption rewriting means that errors in the LLM's understanding or rewriting can propagate to the audio generation. Additionally, the "rich caption" representation, while powerful, may lose some fine-grained acoustic details present in the original audio if the captioning process is not perfectly lossless.
This work significantly lowers the barrier for open-ended audio editing by removing the dependency on expensive paired datasets. It enables a more natural, free-form interface for audio manipulation, which has broad applications in content creation, accessibility, and audio post-production. The self-supervised approach could inspire similar techniques in other modalities where paired data is scarce. However, the ease of editing audio also raises concerns about deepfakes and misinformation, necessitating robust watermarking or detection mechanisms, which are not discussed in the paper. Bagpiper-Edit introduces a novel self-supervised framework for zero-shot open-ended audio editing by reformulating the task as rich-caption rewriting, effectively eliminating the need for paired editing datasets while maintaining high acoustic consistency.
Speaker-decoupled speech codecs can reduce bitrate by separating global speaker attributes from local content and prosody, while supporting voice conversion. Existing speaker-decoupled codecs face a trade-off: methods that explicitly suppress speaker leakage often rely on multi-stage or auxiliary training, whereas simpler designs can leave residual speaker information in local tokens. We propose SDP-Codec, a speaker-decoupled, pitch-injected codec trained with a single-stage optimization pipeline. SDP-Codec derives local tokens from continuous pre-quantization features of a pretrained self-supervised encoder and injects normalized F0 via a pitch encoder-decoder with global-conditioned denormalization and soft-label pitch reconstruction objective. Across 16 kHz and 24 kHz settings, SDP-Codec achieves competitive reconstruction and strong zero-shot voice conversion at comparable bitrates, with the lowest speaker-probing accuracy among compared systems, suggesting reduced speaker leakage.
Primary: Graduate School of Culture Technology, KAIST
All Institutions: Graduate School of Culture Technology, KAIST
SDP-Codec presents a compelling single-stage approach to speaker-decoupled speech coding, effectively balancing reconstruction quality and voice conversion capability through innovative pitch injection and soft-label loss mechanisms. The rigorous evaluation, including speaker probing and comprehensive ablation, strengthens the claim that the method successfully reduces speaker leakage in local tokens while preserving content and prosody, offering a practical solution for low-bitrate neural speech codecs.
The paper proposes SDP-Codec, a speaker-decoupled speech codec designed to minimize bitrate while supporting zero-shot voice conversion. The core methodological contribution lies in the decoupling strategy: it uses a single-stage optimization pipeline where local tokens are derived from continuous pre-quantization features of a pretrained vq-wav2vec encoder, rather than the quantized units themselves. To address the loss of prosodic information inherent in this compression, the authors inject normalized F0 via a dedicated pitch encoder-decoder. A key innovation is the use of a soft-label pitch reconstruction objective (Gaussian-blurred one-hot bins) and global-conditioned denormalization, which allows the global branch to handle speaker-dependent pitch range while the local branch handles content and normalized prosody. The global branch utilizes WavLM features compressed via a perceiver resampler to provide time-invariant speaker embeddings. This design attempts to resolve the trade-off between explicit speaker suppression (complex training) and residual leakage (poor VC).
The evaluation covers reconstruction quality (UTMOS, WER, STOI) and zero-shot voice conversion (SECS, F0 correlation, NMOS, SMOS) at 16 kHz and 24 kHz. The paper compares SDP-Codec against strong baselines including LSCodec, BiCodec, MSRCodec, and EZ-VC. Results indicate that SDP-Codec achieves competitive reconstruction quality, particularly in STOI and F0 correlation, while demonstrating superior speaker similarity in VC tasks compared to other codec-based systems. The speaker-probing accuracy experiment is a strong addition, quantitatively demonstrating reduced speaker leakage in the local tokens compared to competitors. The ablation studies effectively validate the components, showing that the soft-label pitch loss and the use of continuous pre-quantization features are critical for performance.
The paper provides source code and a demo link. Training details are provided, including dataset splits (LibriSpeech, LibriTTS, MLS), training steps, and hardware. However, the use of "internal" data for some baselines (Vevo) and the specific configuration of baselines (e.g., merging thresholds) to match bitrates introduces some variability. The description of the architecture is sufficiently detailed for replication, and the frozen pretrained components (vq-wav2vec, WavLM, FCPE) are standard.
The paper acknowledges that content fidelity (WER) remains a limitation, trading some reconstruction quality for VC performance. The reliance on a pretrained pitch extractor (FCPE) introduces a dependency on external tools for feature extraction, which might not be perfectly aligned with the codec's internal representations. The evaluation is primarily on English datasets (LibriSpeech, LibriTTS, MLS), limiting the assessment of multilingual robustness, although the authors mention Indic languages in the table header, the main text focuses on English. The single-stage training, while simpler, might not achieve the same level of disentanglement as multi-stage methods that explicitly optimize for speaker suppression.
SDP-Codec contributes to the field of efficient speech representation learning, enabling low-bitrate transmission and flexible voice conversion. This has implications for telecommunication, speech language models (SLMs) by providing compact tokens, and creative applications in voice cloning and conversion. The reduced speaker leakage is a positive step towards more ethical and controllable voice conversion systems. However, the potential for misuse in deepfake generation remains a concern, though the low bitrate and specific use case mitigate this slightly compared to high-fidelity generative models. SDP-Codec presents a compelling single-stage approach to speaker-decoupled speech coding, effectively balancing reconstruction quality and voice conversion capability through innovative pitch injection and soft-label loss mechanisms. The rigorous evaluation, including speaker probing and comprehensive ablation, strengthens the claim that the method successfully reduces speaker leakage in local tokens while preserving content and prosody, offering a practical solution for low-bitrate neural speech codecs.
As expressive text-to-speech (TTS) and voice conversion (VC) systems increasingly generate non-verbal vocalizations (NVVs) to enhance naturalness, reliable speaker verification (SV) becomes essential to objectively assess identity consistency across both verbal and non-verbal segments. Yet current SV systems generalize poorly to NVVs, and fine-tuning on NVV data causes catastrophic forgetting of speech performance. We present the first systematic study across 10 NVV types and propose a framework combining frozen Data2Vec self-supervised features with ECAPA-TDNN, enhanced by a Mixture of Experts (MoE) module with learned domain-aware routing. A conditional distillation loss on speech inputs via a pretrained teacher retains speech-to-speech accuracy, while a contrastive loss bridges the speech-NVV domain gap. Our method reduces speech-NVV EER from 38.93% to 22.66% over a pretrained baseline, and improves speech EER from 13.17% to 9.24% via distillation.
Primary: National Taiwan University
All Institutions: University of Michigan, National Taiwan University, Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, National Taiwan University Artificial Intelligence Center of Research Excellence
This paper presents a novel MoE-based speaker verification framework that effectively bridges the domain gap between verbal and non-verbal vocalizations, significantly improving cross-domain verification performance while mitigating catastrophic forgetting through conditional distillation.
The paper proposes a Mixture of Experts (MoE) framework integrated with a frozen Data2Vec self-supervised learning (SSL) front-end and an ECAPA-TDNN backend to address speaker verification (SV) across verbal and non-verbal vocalizations (NVVs). The core methodological innovation lies in the combination of three components: (1) an Inter-Layer Residual MoE (IR-MoE) architecture that routes features to specialized experts for speech vs. NVV, governed by learned domain-aware constraints (entropy maximization, KL divergence for intra-event consistency, and cosine-margin for inter-event separation); (2) a conditional knowledge distillation loss using a frozen WavLM teacher to prevent catastrophic forgetting of speech capabilities during NVV adaptation; and (3) a supervised contrastive loss that bridges the domain gap by enforcing shared speaker manifolds across speech and NVV pairs. The approach is technically sound, leveraging established SSL representations while introducing architectural and loss-function modifications to handle the heterogeneity of non-phonemic acoustic signals. The use of MoE for domain separation in SV is a novel application, moving beyond standard fine-tuning paradigms.
The experimental evaluation is comprehensive, utilizing the NonverbalTTS dataset with 10 distinct NVV categories. The authors provide a systematic analysis of domain mismatch, demonstrating the significant performance drop of zero-shot SV models on NVVs (EER ~39%). They compare against multiple baselines, including standard SSL models (WavLM, Data2Vec, Voc2Vec) and fusion strategies. The proposed IR-MoE model achieves a substantial reduction in cross-domain EER (NvS) from 38.93% to 22.66% and improves speech EER (SvS) from 13.17% to 9.24% compared to a fine-tuned baseline. The ablation studies effectively isolate the contributions of the MoE architecture and the conditional distillation loss. The results are statistically significant and demonstrate that the proposed method successfully mitigates catastrophic forgetting while improving NVV verification. The inclusion of mDCF metrics adds robustness to the evaluation.
The paper provides detailed implementation specifications, including the optimizer (Adam), learning rate schedule (cosine annealing), batch composition strategy (speaker-balanced with cross-domain prioritization), and specific loss weights. The use of standard, open-source components (Data2Vec, ECAPA-TDNN, WavLM) enhances reproducibility. The description of the progressive training schedule and the specific constraints for the MoE routing (entropy, KL, cosine-margin) is sufficiently detailed for replication. However, the exact codebase for the MoE integration and the specific preprocessing steps for the NonverbalTTS dataset (beyond MFA alignment) might require careful interpretation, though the overall pipeline is clear.
A primary limitation is the reliance on the NonverbalTTS dataset, which, while diverse, may not capture the full spectrum of NVVs found in real-world expressive TTS/VC systems (e.g., complex emotional laughter, gasps, or synthesized artifacts). The MoE architecture introduces additional parameters and complexity compared to standard fine-tuning, which could impact inference latency, although this is not explicitly measured. The "conditional" distillation relies on the model's ability to correctly route inputs during inference; if the router misclassifies an input (e.g., a speech-like NVV), the distillation benefit might not be fully realized or could even be detrimental if the wrong expert is activated. Furthermore, the performance on NVV-NVV (NvN) verification remains relatively high (EER ~27-28%), indicating that while cross-domain generalization is improved, intra-NVV verification is still challenging.
This work has significant implications for the evaluation and development of expressive TTS and Voice Conversion systems, where NVVs are crucial for naturalness. By providing a reliable SV metric for NVVs, it enables better objective evaluation of these systems, potentially accelerating their adoption in virtual assistants, gaming, and accessibility tools. It also contributes to the broader field of multi-modal and multi-domain speaker recognition, offering a template for handling heterogeneous acoustic inputs. The mitigation of catastrophic forgetting is a generalizable technique for continual learning in speech processing. This paper presents a novel MoE-based speaker verification framework that effectively bridges the domain gap between verbal and non-verbal vocalizations, significantly improving cross-domain verification performance while mitigating catastrophic forgetting through conditional distillation.
Speech deepfake countermeasures (CMs) are compared almost exclusively by equal error rate (EER), a metric computed at an oracle threshold chosen on the labeled test set. Deployed CMs enjoy no such oracle: a threshold must be fixed in advance and applied to unlabeled target data. We audit this gap with a frozen state-of-the-art SSL-AASIST detector trained on ASVspoof 2019 LA. While its in-domain EER is 0.21%, transferring its LA-calibrated threshold to the In-the-Wild corpus yields a half total error rate (HTER) of 39.5%, with 78.7% of bona fide speech rejected, even though the In-the-Wild EER (11.2%) appears moderate. We then test whether popular unlabeled test-time corrections close this gap, and first prove a simple proposition: any strictly increasing score transform, including z-norm, temperature/shift calibration, and embedding mean alignment under a frozen linear head, cannot change EER. An audit of seven corrections on In-the-Wild and ASVspoof 2021 DF confirms the proposition empirically and exposes two further failure modes: AS-norm with an unlabeled target cohort collapses (EER 11.2% to 60.2%), and pseudo-label calibration that reduces HTER by 38% relative on In-the-Wild degenerates to 50% HTER on DF21, whose spoof prior is 96%. No audited correction reduces EER by more than 1% relative. We recommend reporting HTER at a transferred threshold alongside EER.
Primary: Xidian University
All Institutions: Xidian University
This paper provides a crucial theoretical and empirical audit of speech deepfake detection evaluation, proving that common unlabeled score calibrations cannot improve EER and demonstrating the severe deployment failures hidden by standard metrics, thereby advocating for more realistic evaluation protocols in the field.
The paper presents a critical methodological audit of speech deepfake countermeasures (CMs), specifically focusing on the discrepancy between evaluation metrics (EER) and deployment reality (HTER at fixed thresholds). The core theoretical contribution is a rigorous proof of monotone invariance: any strictly increasing transformation of scores (including z-norm, temperature scaling, and embedding mean alignment) cannot alter the Equal Error Rate (EER). This is a significant theoretical insight that debunks the efficacy of many popular "unlabeled test-time corrections" often cited in adjacent fields like speaker verification. The methodology involves freezing a state-of-the-art SSL-AASIST model and systematically applying seven different correction strategies (C1-C7) to evaluate their impact on both in-domain and out-of-domain performance. The approach is clean, logically sound, and effectively isolates the operating-point shift from the score distribution shift. EXPERIMENTAL_EVALIGATION: The experimental evaluation is robust and well-designed. The authors use a frozen SSL-AASIST model trained on ASVspoof 2019 LA and test it on two distinct out-of-domain datasets: In-the-Wild (ITW) and ASVspoof 2021 DF. The results starkly illustrate the paper's thesis: while the in-domain EER is near-zero (0.21%), the transferred threshold leads to catastrophic failure on ITW (HTER 39.5%, 78.7% FRR). The audit of seven corrections confirms the theoretical proposition: monotone methods (C1-C3) fail to improve EER and often worsen HTER, while non-monotone methods (C4-C7) show limited improvement or collapse (e.g., AS-norm with unlabeled cohort collapses to 60.2% EER due to cohort contamination). The analysis of Failure Mode III (prior sensitivity) on DF21, where pseudo-label calibration fails due to a 96% spoof prior, provides deep practical insight into the fragility of current calibration techniques.
The paper provides sufficient detail for reproduction. It specifies the model architecture (SSL-AASIST with wav2vec 2.0 front-end), the specific checkpoint used, the data subsets (verified against official protocols), and the exact procedures for each correction method (e.g., L-BFGS settings for temperature scaling, leave-one-out for AS-norm). The code for the audit protocol is likely available given the arXiv submission context, though not explicitly linked in the text provided. The verification of labels against official protocol files adds a layer of trust to the experimental setup.
The primary limitation is the scope of the audit: it focuses on a single model architecture (SSL-AASIST). While the monotone invariance proposition is model-agnostic, the magnitude of errors for non-monotone methods may vary across different architectures (e.g., CNN-based vs. Transformer-based detectors). Additionally, the audit uses specific subsets of the ASVspoof datasets (4/9 shards for LA, 4/80 for DF21), which may not fully represent the entire dataset's variability. The paper also notes that HTER is a single operating point; applications requiring different cost functions (e.g., t-DCF) might see different absolute numbers, although the qualitative conclusions likely hold.
This paper has significant implications for the speech anti-spoofing community. By highlighting that EER is a misleading metric for deployment readiness, it challenges the status quo of benchmarking in the field. The recommendation to report HTER at transferred thresholds alongside EER could lead to more realistic evaluation standards. The proof that many common calibration techniques are theoretically incapable of improving EER saves researchers from pursuing dead ends and directs attention to non-monotone corrections or fundamental model improvements. This work promotes more rigorous and deployment-aware evaluation practices in audio security and deepfake detection. This paper provides a crucial theoretical and empirical audit of speech deepfake detection evaluation, proving that common unlabeled score calibrations cannot improve EER and demonstrating the severe deployment failures hidden by standard metrics, thereby advocating for more realistic evaluation protocols in the field.
Recent flow-matching text-to-speech (TTS) models, such as F5-TTS, rely on a reference transcript at inference time, obtained from an external ASR system. This dependency makes zero-shot TTS brittle for accented or dysarthric speakers, precisely the scenarios where it is most needed. Moreover, we find that text-based reference conditioning can propagate atypical acoustic patterns from atypical speech into synthesis, even when ground-truth transcripts are available. To address this, we propose RTFree-F5, which replaces the reference transcript with continuous self-supervised speech representations mapped into F5-TTS's text-conditioning space via a lightweight adapter, while reusing the pretrained checkpoint. On dysarthric speech, RTFree-F5 reduces WER from 24.6% to 10.4%, surpassing even the ground-truth reference transcript baselines, while improving naturalness and remaining competitive on standard benchmarks without requiring any reference transcript.
Primary: Korea Advanced Institute of Science and Technology
All Institutions: Korea Advanced Institute of Science and Technology, University of Illinois Urbana-Champaign
RTFree-F5 significantly advances zero-shot TTS by introducing a transcript-free conditioning mechanism using SSL features, demonstrating superior intelligibility for atypical speakers and revealing critical limitations in text-based reference conditioning. The paper presents a robust, well-evaluated solution to a persistent problem in speech synthesis, with clear implications for both technical research and real-world accessibility applications.
The paper proposes RTFree-F5, a novel conditioning mechanism for flow-matching TTS models (specifically F5-TTS) that eliminates the need for reference transcripts. The core innovation is replacing the text-encoder output for the reference audio with a projection of continuous self-supervised learning (SSL) features (from WavLM) into the text-conditioning space. This is a clever architectural modification that leverages the robustness of SSL representations to handle atypical speech (dysarthria, accents) where ASR fails. The two-stage training strategy (aligning the projector, then joint fine-tuning) is methodologically sound and addresses the distribution shift between the original within-utterance infilling objective and the new cross-utterance conditioning setup. The approach is technically elegant as it reuses the heavy pretrained backbone, requiring only a lightweight adapter.
The evaluation is comprehensive, covering both standard zero-shot TTS benchmarks (LibriSpeech-PC, SeedTTS) and challenging atypical speech datasets (SAP for dysarthria, L2-ARCTIC for accents). The results demonstrate significant improvements in intelligibility (WER) for atypical speakers, outperforming even oracle transcript baselines. This is a critical finding that highlights a fundamental flaw in text-based reference conditioning for non-standard speech. The trade-off between intelligibility and speaker similarity (SIM) is honestly reported and analyzed. The inclusion of standard metrics (UTMOS, SIM) alongside WER provides a well-rounded view of performance. The comparison against ASR-based baselines further strengthens the claim that the method is robust to transcription errors.
The paper provides sufficient implementation details for reproduction, including the specific SSL model (WavLM-Large), the architecture of the projector (2-layer MLP), training hyperparameters (learning rates, epochs, optimizer), and the data construction process (cross-utterance pairs from LibriTTS). The use of publicly available models (F5-TTS, WavLM, Vocos) and datasets ensures that the work is reproducible. The code for the projector and training script would likely need to be released, but the architectural description is clear enough to implement.
The primary limitation is the trade-off between intelligibility and speaker similarity. While intelligibility improves, speaker similarity scores drop, suggesting that some unique voice characteristics are lost or altered in the process of normalizing the speech. The authors acknowledge this but suggest it may be acceptable for accessibility applications. Additionally, the method relies on the quality of the SSL encoder; if the SSL features do not capture the necessary speaker identity information as effectively as the original text+acoustic context, performance might suffer for typical speakers (though results show it remains competitive). The method also introduces a slight inference overhead due to the SSL feature extraction and projection, although this is likely negligible compared to the TTS generation itself.
This work has significant positive broader impact, particularly for accessibility. By enabling high-quality, intelligible speech synthesis for individuals with dysarthria or strong accents without requiring perfect transcripts, it removes a major barrier to entry for assistive communication technologies. It democratizes access to zero-shot TTS for populations that are currently underserved by standard systems. The technical contribution also advances the field of multimodal conditioning, demonstrating the utility of SSL features as a robust alternative to text for reference conditioning in generative models. RTFree-F5 significantly advances zero-shot TTS by introducing a transcript-free conditioning mechanism using SSL features, demonstrating superior intelligibility for atypical speakers and revealing critical limitations in text-based reference conditioning. The paper presents a robust, well-evaluated solution to a persistent problem in speech synthesis, with clear implications for both technical research and real-world accessibility applications.
Streaming speech enhancement requires balancing algorithmic latency against quality, yet existing approaches largely treat this as a binary causal versus non-causal choice. LaCo-SENet addresses this issue with two mechanisms parameterized by a single training-time hyperparameter. First, asymmetric temporal padding redistributes past and future context in convolutions, enabling systematic latency configuration. Second, dual-buffer streaming combines state buffers for past context with lookahead buffers that supply future context at both the input and feature levels. Selective state updates also prevent future-frame leakage into the streaming state, ensuring training-inference consistency. On VoiceBank+DEMAND, a fixed-budget (1.37M parameters) backbone yields a family of models spanning 12.5-75.0 ms, with PESQ rising from 3.35 to 3.43. At just 12.5 ms (fully causal), a PESQ of 3.35 matches or exceeds the prior causal state-of-the-art (3.27 at 46.5 ms).
Primary: Pohang University of Science and Technology (POSTECH)
All Institutions: Pohang University of Science and Technology (POSTECH), Intus Co. Ltd.
[One sentence main contribution]. LaCo-SENet introduces a latency-configurable streaming speech enhancement framework using asymmetric temporal padding and selective state updates to prevent future-frame leakage, achieving state-of-the-art causal performance at ultra-low latency. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a robust solution to the latency-quality trade-off in streaming speech enhancement. By decoupling these factors through asymmetric padding and a novel dual-buffer mechanism with selective state updates, the authors enable a single fixed-parameter model to span a wide range of algorithmic latencies. The technical contribution is significant because it addresses a fundamental implementation challenge (state corruption) that has previously hindered flexible streaming designs. The results demonstrate that careful architectural management of context buffers can yield superior causal performance compared to models specifically designed for higher latencies. This approach offers a practical path for deploying high-quality speech enhancement on resource-constrained devices where latency is a hard constraint.
The paper proposes LaCo-SENet, a streaming speech enhancement architecture that decouples algorithmic latency from model quality by introducing asymmetric temporal padding. The core innovation lies in the "dual-buffer" streaming framework, which manages state buffers for past context and lookahead buffers for future context. Crucially, the authors identify and solve the "state corruption" problem inherent in naive asymmetric padding implementations, where future frames recorded in state buffers would leak into subsequent chunks. They resolve this via "selective state updates," ensuring that only current-chunk frames update the recurrent state, thereby maintaining training-inference consistency. The approach is technically sound, leveraging standard convolutional operations (Dense Dilated Depthwise Blocks) but applying them in a novel streaming configuration. The method is elegant in its simplicity: a single hyperparameter (padding ratio) controls the latency-quality trade-off without altering the parameter count or receptive field size.
The evaluation is conducted on the standard VoiceBank+DEMAND dataset. The results are compelling: at 12.5 ms latency (fully causal), LaCo-SENet achieves a PESQ of 3.35, outperforming the previous causal state-of-the-art (aTENNuate at 3.27 PESQ, but with 46.5 ms latency). The model scales gracefully up to 75.0 ms (PESQ 3.43) and 200.0 ms (PESQ 3.47). The ablation study effectively demonstrates the necessity of the selective state update mechanism, showing catastrophic performance drops (below noisy baseline) when it is disabled. The throughput analysis (RTF) is also included, showing real-time capability. However, the evaluation is limited to a single dataset (VoiceBank+DEMAND) and a single metric suite (PESQ, STOI, CSIG, etc.). While the results are strong for the benchmark, the lack of evaluation on more recent, larger-scale datasets (e.g., DNS Challenge, CHiME) limits the generalizability claims.
The paper provides sufficient detail for reproduction. The architecture (PrimeK-Net backbone), hyperparameters (channel dimensions, kernel sizes, STFT parameters), and training loss components are explicitly listed. The mathematical formulation of the asymmetric padding and selective state updates is clear. The code is not provided, but the method relies on standard PyTorch/TensorFlow operations (convolutions, buffering, masking), making implementation feasible for researchers in the field.
The primary limitation is the scope of evaluation. VoiceBank+DEMAND is a relatively small, older dataset with specific noise types. Performance on real-world, non-stationary noise or diverse acoustic environments is not reported. Additionally, while the latency is configurable, the model does not adapt latency dynamically based on input SNR or content; it is fixed per model instance. The PESQ gains at higher latencies (75-200ms) are marginal compared to the latency cost, suggesting diminishing returns for very high lookahead in this specific architecture.
This work has significant implications for real-time audio applications such as teleconferencing, hearing aids, and voice assistants, where low latency is critical but quality cannot be sacrificed. By enabling a single model to serve a range of latency requirements, it simplifies deployment pipelines for edge devices with varying computational constraints. The technique of selective state updates for preventing future-frame leakage is a generalizable contribution to streaming deep learning architectures beyond just speech enhancement. [One sentence main contribution]. LaCo-SENet introduces a latency-configurable streaming speech enhancement framework using asymmetric temporal padding and selective state updates to prevent future-frame leakage, achieving state-of-the-art causal performance at ultra-low latency. [Comprehensive analysis of the technical contribution, methodology, and significance to the field]. The paper presents a robust solution to the latency-quality trade-off in streaming speech enhancement. By decoupling these factors through asymmetric padding and a novel dual-buffer mechanism with selective state updates, the authors enable a single fixed-parameter model to span a wide range of algorithmic latencies. The technical contribution is significant because it addresses a fundamental implementation challenge (state corruption) that has previously hindered flexible streaming designs. The results demonstrate that careful architectural management of context buffers can yield superior causal performance compared to models specifically designed for higher latencies. This approach offers a practical path for deploying high-quality speech enhancement on resource-constrained devices where latency is a hard constraint.
Mean opinion score (MOS) prediction models are widely used as proxy metrics in text-to-speech (TTS) research, yet their ability to capture quality differences beyond acoustic fidelity remains unclear. We investigate this via controlled perturbations on speech: acoustic degradation, prosodic errors, and manipulation of speaker-specific characteristics such as pitch and speaking rate. We obtained MOS predictions for these speech samples from both human listeners and the model, and analyzed the differences in their perceptual characteristics. Results show that most models track acoustic degradation well, while all are insensitive to prosodic errors despite large subjective score drops. For speaker characteristics, models exhibit a double dissociation: strong mean fundamental frequency (F0) biases absent in human ratings, yet insensitivity to speaking rate and F0 variability that humans notice. These findings highlight limitations of scalar MOS prediction beyond acoustic fidelity.
Primary: Nagoya Institute of Technology
All Institutions: Nagoya Institute of Technology, LY Corporation
This paper provides a critical diagnostic analysis of automatic MOS prediction models, demonstrating their insensitivity to prosodic errors and misalignment with human perception regarding speaker characteristics, thereby highlighting the need for more perceptually aligned evaluation metrics in speech synthesis research.
The paper employs a rigorous diagnostic methodology to probe the internal representations and output sensitivities of existing automatic MOS prediction models (UTMOS, DNSMOS, NISQA, SHEET-MB, SHEET-BV). By applying controlled, independent perturbations to speech signals—specifically acoustic degradation, prosodic (accent) errors, and speaker characteristics (pitch/rate)—the authors isolate specific quality dimensions. This "ablation-style" evaluation of the *evaluator* rather than the *generator* is a sophisticated approach to understanding model failure modes. The use of Japanese pitch-accent language provides a critical test case for prosodic sensitivity that is often overlooked in English-centric benchmarks. The methodology is sound, well-controlled, and directly addresses the gap between scalar model outputs and multidimensional human perception.
The experimental design is robust. Group A confirms that models generally track acoustic fidelity well (with some exceptions like SHEET-MB on MP3 compression), validating the baseline utility of these models. Group B provides the most striking result: a complete dissociation between human sensitivity to prosodic errors (large MOS drops) and model insensitivity (negligible score changes). This is a significant finding. Group C reveals a "double dissociation" in speaker characteristics: models are biased by mean F0 (likely due to training data distributions) while ignoring F0 variability and speaking rate, which humans notice. The statistical reporting (SRCC, Pearson r, confidence intervals) is adequate. The comparison across six distinct models allows for a comprehensive view of the field's current state.
The paper provides sufficient detail for reproduction. It specifies the datasets (JVS, NANSY-TTS internal data), the perturbation methods (SiFi-GAN for pitch, WORLD for rate), and the models evaluated (using the VERSA toolkit). The specific conditions (e.g., clipping levels, SNR values, accent swap probabilities) are clearly defined. However, the use of an "internal dataset" for the TTS generation in Group B limits full independent reproduction of that specific subset, though the methodology is clear. The VERSA toolkit ensures standard inference, enhancing reproducibility for the model evaluation part.
The primary limitation is the reliance on a single language (Japanese) for the prosodic analysis. While Japanese is a pitch-accent language, it is unclear if these findings generalize to tonal languages (e.g., Mandarin) or non-tonal languages with different prosodic structures (e.g., stress-accent languages like English). Additionally, the study focuses on *naturalness* and *quality* as perceived by humans, but does not explore whether these model biases correlate with other metrics like intelligibility or speaker similarity. The sample size for the subjective evaluation (15 listeners) is on the lower end for robust statistical power, though acceptable for a diagnostic study.
This work has significant implications for the TTS and speech processing communities. It challenges the uncritical use of automatic MOS predictors as proxies for human evaluation, particularly in contexts where prosody and speaker identity are crucial (e.g., expressive TTS, voice conversion). It highlights that current SSL-based models may encode spurious correlations (like mean F0) rather than perceptually relevant features. This could guide future model training, suggesting the need for explicit prosodic supervision or multi-objective loss functions that account for speaker variability. It serves as a cautionary tale for relying solely on scalar metrics in AI research. This paper provides a critical diagnostic analysis of automatic MOS prediction models, demonstrating their insensitivity to prosodic errors and misalignment with human perception regarding speaker characteristics, thereby highlighting the need for more perceptually aligned evaluation metrics in speech synthesis research.
Existing mean opinion score (MOS) prediction models typically predict utterance-level naturalness MOS and can be insensitive to localized pitch-accent errors. We propose Pitch-Accent-focused Speech Quality Assessment (PASQA), which explicitly targets pitch-accent correctness. To train our model, we construct a controlled Japanese accent-error dataset by changing accent patterns using an accent-controllable text-to-speech system, and compute a pseudo accent-quality score from the accent-error rate. PASQA builds on self-supervised representations and employs mora-conditioned fusion, ranking loss, an auxiliary accent-error localization task, and speaker-invariant training. Experiments show that conventional models fail to preserve the ordering by accent-error severity, whereas PASQA achieves high ordering accuracy on both seen and unseen speakers. Further, PASQA shows stronger agreement with human accent-correctness judgments. The code is available at https://github.com/lycorp-jp/PASQA.
Primary: LY Corporation
All Institutions: LY Corporation
PASQA introduces a specialized speech quality assessment model for Japanese pitch accents, leveraging synthetic data and multi-task learning to achieve superior sensitivity to prosodic errors compared to general MOS predictors. This work addresses a critical niche in TTS evaluation, providing a robust, reproducible, and human-aligned metric for accent correctness that can drive improvements in synthetic speech naturalness.
The paper proposes PASQA, a model specifically designed to assess pitch-accent correctness in Japanese speech, addressing a gap where standard MOS predictors are insensitive to localized prosodic errors. The methodology is sound and well-structured. Key innovations include the use of mora-conditioned fusion (incorporating linguistic structure into SSL features), a ranking loss (Bradley-Terry) to enforce ordinal relationships between error severities, an auxiliary frame-level error detection head for localized supervision, and a Gradient Reversal Layer (GRL) for speaker-invariant training. The use of self-supervised representations (wav2vec 2.0) as a backbone is appropriate for this task. The integration of linguistic tokens via cross-attention is a strong design choice for a language-specific prosodic task.
The experimental setup is rigorous. The authors construct a large-scale synthetic dataset with controlled accent errors, which is a significant contribution in itself, allowing for precise evaluation of sensitivity to error severity. The evaluation metrics (Order Accuracy, SRCC, KTAU) are well-chosen to test the specific hypothesis that PASQA preserves severity ordering better than conventional models. The results clearly demonstrate that PASQA outperforms both traditional acoustic feature-based models and general SSL-based MOS predictors (like UTMOS, DNSMOS, NISQA) on this specific task. The ablation study effectively isolates the contribution of each component (mora fusion, ranking loss, frame head, GRL). The inclusion of an out-of-domain (OOD) evaluation on GPT-4o-mini-TTS speech adds robustness to the claims.
The paper provides sufficient detail for reproduction. It specifies the backbone (wav2vec 2.0), the TTS system used for data generation (NANSY-TTS), the loss functions, hyperparameters (learning rate, batch size, loss weights), and the dataset construction pipeline. The code is made available on GitHub, which significantly enhances reproducibility. The synthetic nature of the training data allows other researchers to replicate the data generation process exactly.
The primary limitation is the reliance on synthetic data for training. While the controlled errors are well-defined, there may be a domain gap between synthetic accent errors and natural speech variations or errors produced by other TTS systems. The model is specifically tailored for Japanese pitch accent; while the methodology could be extended, the paper does not demonstrate cross-lingual applicability. Additionally, the "pseudo" accent-quality scores are derived from error rates, which assumes a linear or monotonic relationship between error count and perceived quality, which might not perfectly capture human perceptual nuances (e.g., context-dependent tolerance).
This work has significant implications for the development and evaluation of Text-to-Speech systems, particularly for languages with complex prosodic systems like Japanese. By providing a tool that aligns better with human judgments of accent correctness, it enables more efficient iteration cycles for TTS developers. It also contributes to the broader field of speech quality assessment by highlighting the need for task-specific metrics beyond general naturalness. The open-source release promotes further research in prosodic quality evaluation. PASQA introduces a specialized speech quality assessment model for Japanese pitch accents, leveraging synthetic data and multi-task learning to achieve superior sensitivity to prosodic errors compared to general MOS predictors. This work addresses a critical niche in TTS evaluation, providing a robust, reproducible, and human-aligned metric for accent correctness that can drive improvements in synthetic speech naturalness.
Streaming zero-shot voice conversion struggles to disentangle timbre from linguistic content without degrading utility or inflating latency. Current methods rely on information bottleneck (IB) or speaker perturbation. While IB filters out timbre, it discards prosody, forcing models to explicitly inject features like fundamental frequency. This often requires buffering future frames, creating algorithmic lookahead latency. On the other hand, existing perturbation methods largely overlook the crucial trade-off between timbre leakage and utility preservation. Recognizing this neglected trade-off, we find that the inherent objective of Speaker Anonymization (SA) aligns well with balancing these factors. Thus, we introduce SA as a novel perturbation mechanism to explicitly mitigate timbre leakage while retaining prosodic utility. Crucially, SA's robust representations significantly alleviate the generator's reliance on future context, enabling our strictly causal, zero-lookahead network. Audio samples are available at https://amphionteam.github.io/Zero-VC-demo/.
Primary: The Chinese University of Hong Kong, Shenzhen
All Institutions: The Chinese University of Hong Kong, Shenzhen, Shenzhen Loop Area Institute, Shenzhen Transsion Holdings Co., Ltd., Amphion Technology Co., Ltd.
This paper presents a practical and effective solution for low-latency streaming voice conversion by repurposing Speaker Anonymization techniques to optimize the timbre-utility trade-off, enabling a strictly causal architecture that outperforms existing methods in both quality and latency.
The paper proposes Zero-VC, a streaming zero-shot voice conversion system that leverages Speaker Anonymization (SA) as a perturbation mechanism to disentangle timbre from linguistic content. The core methodological insight is that SA, which aims to conceal identity while preserving prosody, provides a robust feature representation that reduces the generator's reliance on future context. This allows for a strictly causal, zero-lookahead architecture (20ms latency) using causal convolutions in a HiFi-GAN-based decoder. The approach combines an off-the-shelf SA module with a streaming encoder (distilled w2v-bert-2.0) and a WavLM-based timbre extractor. While the application of SA to VC is a logical and clever adaptation of existing privacy techniques, the architectural novelty is incremental, relying heavily on standard components (HiFi-GAN, WavLM) and a specific preprocessing step. The claim of "novelty" rests more on the systematic analysis of the leakage-utility trade-off in perturbation methods than on a fundamentally new neural architecture.
The evaluation is comprehensive, comparing Zero-VC against non-streaming SOTA models (LSCodec, CosyVoice, Seed-VC) and discussing latency relative to streaming baselines. The authors use a robust set of metrics including Speaker Similarity (SS-S, SS-R), WER, F0 Pearson Coefficients, OVRL (DNSMOS), and subjective NMOS/SMOS. The results show that Zero-VC achieves superior timbre conversion (lowest SS-S, highest SS-R) and competitive quality scores while maintaining ultra-low algorithmic latency (20ms). The ablation studies effectively demonstrate that SA outperforms other perturbation methods in balancing leakage and utility, and that the SA-perturbed features saturate in performance with minimal lookahead, validating the zero-lookahead design. The use of an open-source evaluation dataset (seed-tts-eval) and clear reporting of RTF adds credibility. However, the comparison against closed-source streaming models (StreamVC, RT-VC) is limited to latency claims, which is a common but necessary limitation in this field.
The paper provides sufficient implementation details, including the dataset (LibriTTS), model architectures (w2v-bert-2.0, WavLM, HiFi-GAN), training hyperparameters (optimizer, learning rate, loss weights), and evaluation metrics. The use of off-the-shelf models for SA and timbre extraction aids reproducibility, although the specific version of the SA module is linked to a GitHub repository. The code for the Zero-VC model itself is not explicitly linked in the text provided (only the demo page), which slightly hinders immediate reproduction, but the architectural description is clear.
The primary limitation is the reliance on an external, off-the-shelf SA module for preprocessing during training. This introduces an additional dependency and potential error source, and the authors acknowledge that this pre-processing step may introduce training overhead. The paper suggests future work to integrate SA end-to-end, which implies the current system is not fully unified. Additionally, the evaluation is limited to English, and the cross-lingual capability is noted as future work. The reliance on WavLM for timbre extraction, while effective, adds computational cost compared to lighter-weight speaker embedding extractors.
Zero-VC has significant implications for real-time voice conversion applications, such as privacy-preserving communication, virtual avatars, and assistive technologies. By achieving hard-real-time latency without sacrificing conversion quality, it addresses a critical bottleneck in deploying VC systems. The use of SA also aligns with growing concerns about voice privacy and deepfake mitigation, offering a technical pathway to anonymize voices while maintaining communicative utility. The open-source demo and clear methodology contribute to the broader ML community's understanding of streaming audio generation constraints. This paper presents a practical and effective solution for low-latency streaming voice conversion by repurposing Speaker Anonymization techniques to optimize the timbre-utility trade-off, enabling a strictly causal architecture that outperforms existing methods in both quality and latency.
AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during pretraining. Existing benchmarks cannot answer this question because they evaluate transcription under fixed prompting conditions and rarely include explicit contextual inputs. We introduce IndicContextEval, a 56-hour multilingual benchmark of natural speech from 555 speakers across 8 Indian languages and 23 professional domains. We design a 7-level prompting framework that progressively introduces contextual signals, including metadata, natural-language descriptions, entity lists in English and native script, and adversarial prompts with incorrect entities. Evaluating five models reveals substantial differences in context utilisation behaviour, highlighting the need for explicit evaluation of contextual grounding in AudioLLMs.
Primary: Indian Institute of Technology Madras
All Institutions: AI4Bharat, Indian Institute of Madras, Sarvam AI
IndicContextEval introduces a rigorous, multilingual benchmark that reveals critical gaps in how AudioLLMs utilize contextual information, demonstrating that while some models effectively leverage native-script entity biasing, others suffer from blind reliance or contextual blindness, thereby establishing a new standard for evaluating contextual grounding in speech recognition systems.
The paper proposes a novel evaluation framework, IndicContextEval, designed to probe the contextual grounding capabilities of Audio Large Language Models (AudioLLMs). The core methodological contribution is the design of a 7-level prompting taxonomy (L0-L6) that systematically varies the type and quality of textual context provided to the model (from no context to adversarial incorrect entities). This controlled experimental design allows for the isolation of specific contextual signals (metadata, natural language descriptions, entity lists) and the measurement of their impact on transcription accuracy (WER) and entity recognition (NEER). The approach is rigorous in its control of variables, aiming to distinguish between parametric memorization and genuine contextual utilization.
The authors evaluate five leading AudioLLMs (GPT-4o Transcribe, Gemini 3 Flash, Sarvam Audio, Gemma-3N, and a standalone IndicConformer baseline) on a newly collected dataset of 56 hours of natural speech across 8 Indian languages and 23 professional domains. The results are insightful, revealing significant disparities in how models handle context. For instance, GPT-4o Transcribe shows robust contextual reasoning, while Gemma-3N exhibits "blind reliance" on entity prompts, even when they are adversarial. The finding that natural-language descriptions often outperform structured metadata, and that native-script entity biasing yields the largest gains, provides concrete empirical evidence for the field. The use of NEER as a primary metric for entity biasing is well-chosen and adds depth to the standard WER evaluation.
The paper provides a clear description of the dataset creation process, including speaker demographics, recording styles (read vs. extempore), and quality control measures. The prompt taxonomy is explicitly defined, allowing other researchers to replicate the evaluation protocol. The code and benchmark data are made publicly available via GitHub, which significantly enhances reproducibility. The inclusion of specific model versions and the detailed breakdown of results by language and context level further support reproducibility efforts.
The dataset, while diverse, is limited to 8 Indian languages and 23 domains, which may not capture the full spectrum of global linguistic diversity or domain-specific challenges. The reliance on commercial models (GPT-4o, Gemini) limits the ability to fully inspect internal mechanisms, although the black-box evaluation is appropriate for the benchmark's goals. The "adversarial" prompts in L6 are limited to incorrect domain entities; more sophisticated adversarial attacks (e.g., semantically similar but incorrect entities) could provide deeper insights into model robustness. Additionally, the dataset size (56 hours) is relatively small compared to large-scale ASR benchmarks, which may limit the statistical power of some analyses, particularly for lower-resource languages within the set.
This work has significant implications for the development and deployment of multilingual AudioLLMs, particularly in low-resource and high-context domains like healthcare, legal, and technical support in India. By highlighting the risks of blind reliance on context or failure to utilize it, the benchmark encourages the development of more robust and interpretable models. It also underscores the importance of native-script support and the challenges of cross-lingual entity biasing. The public release of the benchmark will facilitate fairer comparisons and drive progress in contextual ASR for Indic languages. IndicContextEval introduces a rigorous, multilingual benchmark that reveals critical gaps in how AudioLLMs utilize contextual information, demonstrating that while some models effectively leverage native-script entity biasing, others suffer from blind reliance or contextual blindness, thereby establishing a new standard for evaluating contextual grounding in speech recognition systems.
Controllable text-to-speech (TTS) has become a key research focus. However, methods based on either reference speech or text descriptions lack flexibility and precise control, and recent joint approaches remain loosely coupled, with speech modeling timbre and text controlling global style. We propose FineCombo-TTS, a unified framework for speech synthesis grounded in reference speech and guided by text descriptions, enabling flexible and precise control over acoustic attributes. Instead of explicit attribute disentanglement, we learn a unified acoustic representation and introduce a Conditional Flow Matching (CFM)-based Speech Variance Predictor to model fine-grained reference-to-target transformations guided by text descriptions. To support relative attribute control, we construct FineEdit, a structured paired dataset that explicitly encodes source-to-target attribute variations. Experiments demonstrate that our approach achieves flexible, precise, and expressive controllable TTS.
Primary: Tsinghua University
All Institutions: Shenzhen International Graduate School, Tsinghua University, Inner Mongolia University, Tencent
FineCombo-TTS introduces a novel CFM-based framework for precise, text-guided, reference-grounded speech synthesis, addressing key limitations in existing joint-control methods through a unified acoustic representation and a new paired dataset for relative attribute control.
The paper proposes FineCombo-TTS, a unified framework for controllable Text-to-Speech (TTS) that integrates reference speech and text descriptions. The core methodological contribution is the introduction of a Conditional Flow Matching (CFM)-based Speech Variance Predictor. Unlike previous methods that often treat timbre and style as separate, loosely coupled modules, this approach learns a unified acoustic attribute representation. It uses a pre-trained FACodec timbre extractor and a residual style encoder to form a source embedding, which is then transformed into a target embedding guided by text descriptions via CFM. This allows for fine-grained, relative attribute control (e.g., "make it faster" relative to the reference) rather than absolute control. The use of CFM for this transformation task is a novel application in the TTS domain, leveraging the efficiency and stability of flow matching over diffusion or autoregressive methods for this specific conditioning task. The integration of Classifier-Free Guidance (CFG) for both text and description conditions further enhances control precision.
The experimental evaluation is comprehensive, covering prosody, emotion, and timbre control. The authors construct a new dataset, FineEdit, consisting of paired source-target triplets with explicit control descriptions, addressing a gap in existing datasets that lack relative control annotations. They compare FineCombo-TTS against a re-implemented baseline, VoxInstruct-Joint, which adapts an LLM-based TTS model for joint control. The results show significant improvements in Instruction Following (MOS-I) and Controlled Accuracy, particularly in prosody and emotion tasks. The model also maintains high speaker similarity (MOS-S, SECS), indicating effective timbre preservation. The ablation studies on CFG strategies and the residual style encoder provide additional validation of the design choices. The use of both subjective (MOS) and objective (WER, SECS, FPC, Emotion-A) metrics strengthens the evaluation.
The paper provides detailed descriptions of the architecture, including the FACodec timbre extractor, T5 text encoder, and the CFM UNet backbone. The training strategy is clearly outlined in two stages. The dataset construction process for FineEdit is described in detail, including the sources (LibriTTS-R, ESD) and the methods for generating paired data. The availability of the demo and dataset at the provided URL enhances reproducibility. However, the specific hyperparameters for the CFM training (e.g., number of steps, specific loss weights beyond MSE) could be more explicitly stated, though the general framework is clear.
The paper acknowledges that existing description-based datasets are limited, which motivated the creation of FineEdit. However, the reliance on synthetic or manually annotated relative pairs may introduce biases or limitations in the diversity of control expressions compared to large-scale unsupervised data. The model's performance on out-of-distribution text descriptions or highly complex, multi-attribute simultaneous controls is not extensively explored. Additionally, the computational cost of training the CFM predictor, while potentially lower than diffusion, is still a factor compared to simpler autoregressive models. The evaluation is primarily on English data, limiting generalizability to other languages without further testing.
FineCombo-TTS contributes to the field of controllable speech synthesis by providing a more flexible and precise method for generating speech with specific attributes guided by natural language. This has implications for accessible technology, creative content generation, and personalized voice assistants. The construction of the FineEdit dataset also provides a valuable resource for future research in relative attribute control. The work aligns with the broader trend of integrating multiple conditioning modalities in generative models to achieve finer control over outputs. FineCombo-TTS introduces a novel CFM-based framework for precise, text-guided, reference-grounded speech synthesis, addressing key limitations in existing joint-control methods through a unified acoustic representation and a new paired dataset for relative attribute control.
Reliable sound source localization is fundamental to robot audition, enabling autonomous robots to perceive spatial cues and operate effectively in dynamic environments. Classical methods such as Multiple Signal Classification (MUSIC) offer strong theoretical foundations but degrade under low signal-to-noise ratios. While deep learning-based approaches achieve promising performance, they often struggle with limited generalization across conditions. To address these challenges, we propose NeuralMUSIC, a hybrid neural-subspace framework for robotic sound source localization. Specifically, a neural network first estimates the spatial covariance matrix from multichannel microphone observations. The predicted covariance is then integrated into a classical MUSIC pipeline with eigenvalue decomposition (EVD) and pseudo-spectrum computation, followed by a Frequency Attention Fusion (FAF) module to produce the final DOA estimates. To improve data efficiency, we further introduce a Self-supervised Spatial Correlation Learning (SSCL) strategy that leverages unlabeled acoustic data to capture spatial structure. Extensive experiments across different robotic tasks demonstrate that NeuralMUSIC achieves competitive localization accuracy while exhibiting improved robustness and cross-domain generalization.
Primary: Nanyang Technological University
All Institutions: Nanyang Technological University
NeuralMUSIC presents a robust hybrid framework for robot sound source localization by combining neural covariance estimation with classical subspace methods, achieving superior accuracy and generalization across diverse acoustic environments. The integration of self-supervised learning for spatial correlation and adaptive frequency fusion addresses critical challenges in low-SNR and broadband scenarios, offering a significant advancement in reliable robotic audition systems.
The paper proposes NeuralMUSIC, a hybrid framework that integrates a neural network for spatial covariance matrix estimation into the classical Multiple Signal Classification (MUSIC) algorithm. The approach addresses key limitations of classical MUSIC (noise sensitivity, broadband processing) and pure deep learning methods (black-box nature, poor generalization). Key innovations include: 1) A neural encoder to predict the spatial covariance matrix, which is then used in the standard MUSIC eigen-decomposition pipeline. 2) A Frequency Attention Fusion (FAF) module to adaptively weight frequency bins for broadband DOA estimation. 3) A self-supervised Spatial Correlation Learning (SSCL) strategy using masked channel reconstruction to leverage unlabeled data. 4) An adaptive source-number prediction module. The methodology is well-grounded in signal processing theory and effectively bridges the gap between model-based and data-driven approaches. The integration of SSCL is a particularly strong design choice for robotic applications where labeled data is scarce.
The authors conduct extensive experiments on four datasets: GSC (simulated), AV16.3 (real-world speaker), SLoClas (acoustic events), and AFPILD (pedestrian footsteps). They compare against classical methods (MUSIC, NormMUSIC, Beamforming, TOPS, FRIDA) and deep learning baselines (CRNN, Transformer, DOANet, DeepDAE, DeepMusic, DA-Music). Results show consistent improvements in Mean Absolute Angular Error (MAAE) across all datasets and configurations (single-source, multi-source, unknown source number). Ablation studies validate the contributions of FAF and SSCL. Additional experiments on data efficiency, SNR robustness, and cross-domain generalization (cross-room, cross-array) further demonstrate the method's robustness. The evaluation is comprehensive and convincing.
The paper provides detailed descriptions of the network architecture, loss functions, and experimental settings (STFT parameters, optimizer, hyperparameters). The code is made available on GitHub. The use of standard datasets (GSC, AV16.3, SLoClas, AFPILD) facilitates reproduction. The description of the SSCL masking strategies and the hybrid pipeline is clear.
The method relies on a neural network to estimate the covariance matrix, which may still suffer from distribution shifts if the acoustic environment or array geometry differs significantly from the training data, although the paper shows some resilience. The performance on AFPILD, while best among baselines, is still relatively high in error (10.24 degrees), indicating challenges with footstep sounds. The cross-array generalization, while better than pure DL methods, is not perfect and degrades with large geometry mismatches. The paper acknowledges these in the limitations section.
This work contributes to the field of robot audition and spatial audio processing. By providing a robust, data-efficient, and interpretable solution for sound source localization, it enables more reliable autonomous robots in dynamic environments. The hybrid approach offers a template for integrating physical priors with deep learning in other signal processing tasks. The self-supervised learning strategy is broadly applicable to other domains with limited labeled data. NeuralMUSIC presents a robust hybrid framework for robot sound source localization by combining neural covariance estimation with classical subspace methods, achieving superior accuracy and generalization across diverse acoustic environments. The integration of self-supervised learning for spatial correlation and adaptive frequency fusion addresses critical challenges in low-SNR and broadband scenarios, offering a significant advancement in reliable robotic audition systems.