We introduce MMAE, a Massive Multitask Audio Editing benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. Spurred by the shift toward intelligent creation, interactive editing has rapidly expanded from visual domains, pioneered by models like Nano-banana 2 for images and Gemini-Omni for video, into audio. However, the current evaluation infrastructure lags severely, remaining highly fragmented and restricted to specific subdomains or basic operations. Unlike existing benchmarks that are limited in scope, MMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. Furthermore, we establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. Meticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. Our extensive evaluation of leading models reveals that current systems remain far from achieving reliable edits. Strikingly, the Exact Match Rate (EMR) consistently falls below 5% and plummets to an absolute 0% in complex, mixed-modality tasks, exposing critical bottlenecks in precise execution and structural robustness. We hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems.
Primary: Shanghai Jiao Tong University
All Institutions: Shanghai Jiao Tong University, Shanghai Innovation Institute, Nanyang Technological University, Hunyuan Team, Tencent, Tianjin University, ZODA, Peking University, Fudan University
The paper benchmarks five leading audio editing models (Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ) along with two baselines (Identity, Noise). The inclusion of baselines provides essential context for interpreting model performance. A notable detail is the acknowledgment of input length constraints for some models, leading to evaluation on a subset of samples (801 out of 2,000) for MMEdit, Audio-Omni, and SmartDJ, while others are evaluated on the full set. This pragmatic approach ensures fair comparison given current model limitations. The results are striking and provide clear diagnostic insights. The consistently low Exact Match Rate (EMR) across all models (below 5%, and 0% for complex mixed-modality tasks) unequivocally demonstrates that current systems are far from achieving reliable, flawless edits. This highlights a significant challenge for the field. The detailed analysis by task complexity and modality reveals universal performance degradation for more complex and mixed-modality tasks, exposing a lack of structural robustness and cross-domain synchronization. The observed trade-off between Instruction Following Rate (IF
MMAE introduces a meticulously designed benchmark for instruction-based audio editing, addressing a critical gap in the field. The core methodology revolves around a comprehensive, multi-dimensional taxonomy and a novel rubric-based evaluation paradigm. The taxonomy is a strong point, categorizing tasks along three orthogonal dimensions: Modality (7 types, including mixed modalities like sound-music-speech), Complexity (6 levels from 'Single' to 'Multi-hop' and 'Multi-round'), and Operation (8 types, local and global edits). This systematic classification allows for fine-grained analysis of model capabilities across a vast spectrum of real-world scenarios, moving beyond the fragmented and restricted scope of previous benchmarks. The distributions across these dimensions are presented, indicating a diverse dataset. The rubric-based evaluation paradigm is particularly innovative and robust for open-ended generative tasks. It decomposes free-form editing tasks into 17,741 atomic, verifiable criteria, assessed along two crucial dimensions: Instruction Following (IF) and Consistency (CR). The four principles guiding rubric design—Completeness, Atomicity, Orthogonality, and Objectivity—are well-articulated and essential for ensuring the reliability and interpretability of the evaluation. Using an external, high-performance audio language model (Qwen3-Omni) as a judger, with multiple queries and majority voting, adds a layer of objectivity and scalability that traditional human MOS ratings often lack for such complex tasks. The Exact Match Rate (EMR) serves as a stringent, holistic metric for perfect execution. The data curation pipeline is rigorous, involving five stages from expert brainstorming to iterative quality inspection. The human-agent collaborative annotation workflow, leveraging an agentic pipeline (Omni-Detective) for detailed audio captions and LLMs for initial rubric drafts, followed by human refinement, is a pragmatic approach to balance efficiency and quality for a large-scale, complex dataset. This hybrid approach is crucial for generating high-fidelity samples and rubrics.
The paper benchmarks five leading audio editing models (Step-Audio-EditX, Ming-UniAudio, MMEdit, Audio-Omni, SmartDJ) along with two baselines (Identity, Noise). The inclusion of baselines provides essential context for interpreting model performance. A notable detail is the acknowledgment of input length constraints for some models, leading to evaluation on a subset of samples (801 out of 2,000) for MMEdit, Audio-Omni, and SmartDJ, while others are evaluated on the full set. This pragmatic approach ensures fair comparison given current model limitations. The results are striking and provide clear diagnostic insights. The consistently low Exact Match Rate (EMR) across all models (below 5%, and 0% for complex mixed-modality tasks) unequivocally demonstrates that current systems are far from achieving reliable, flawless edits. This highlights a significant challenge for the field. The detailed analysis by task complexity and modality reveals universal performance degradation for more complex and mixed-modality tasks, exposing a lack of structural robustness and cross-domain synchronization. The observed trade-off between Instruction Following Rate (IF
We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.
Primary: Johannes Kepler University
All Institutions: Johannes Kepler University, Sony Computer Science Laboratories
The experimental evaluation is comprehensive and well-structured, providing both objective and subjective evidence for LiveBand's superiority. **Datasets and Metrics**: Training and evaluation on Slakh2100, a standard multi-instrument dataset, ensures comparability. The use of an internal corpus for one variant (LiveBand$_int$) adds a dimension of real-world data. The chosen objective metrics (FAD$_vgg/CLAP$, BA$_F1$, COCOLA) are highly relevant for assessing audio quality, rhythmic accuracy, and mix adherence in music generation. Evaluating over two 10-second segments to measure drift is particularly insightful for streaming models. **Baselines**: Comparing against StreamMusicGen (SMG), a recent and relevant causal autoregressive model, provides a strong benchmark. The inclusion of a bidirectional upper bound (LiveBand$_bid$) and ground truth offers valuable context for potential performance ceilings. **Results**: LiveBand consistently outperforms SMG across all objective metrics (FAD, BA, COCOLA) and anticipation settings (0s, 0.1s, 1s). This is a significant finding, especially the ability to maintain strong performance even with 1-second anticipation, demonstrating that the model can generate meaningfully ahead of the
LiveBand proposes a robust and well-thought-out methodology for real-time, causal music accompaniment generation. The core innovation lies in training a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision. This design choice directly addresses the critical challenges of real-time systems: strict causality, low latency, and maintaining musical coherence without future lookahead. A key strength is the elimination of teacher forcing and its associated exposure bias. By feeding the generator only independent per-step Gaussian noise and causally available mix context, the training and inference conditions are matched by design. This allows for parallel computation during training while ensuring that the model's internal state accurately reflects its own generated history during autoregressive inference, avoiding the drift common in teacher-forced models. This is a significant improvement over prior autoregressive approaches that struggle with error accumulation. The use of a pre-trained *causal* audio autoencoder (a retrained CoDiCodec variant) is crucial, as it ensures that the latent representations themselves respect causality, which is fundamental to the real-time constraint. Operating in a continuous latent space, as opposed to discrete tokens, likely contributes to the high-fidelity claim and potentially smoother generation. Adversarial sequence-level supervision from a non-causal 1D convolutional discriminator is another strong methodological choice. This encourages the generator to produce holistically realistic and coherent sequences, rather than focusing on frame-level accuracy which can be overly rigid for musical tasks requiring continuous realignment. The discriminator's non-causal nature is justified by its training-only role, allowing it to provide a richer signal. Finally, the Adaptive Gradient Penalty (AdaGP) is a practical and elegant solution to stabilize adversarial training. By dynamically adjusting the gradient penalty weight to maintain a fixed discriminator advantage, it reduces the burden of hyperparameter tuning and improves training robustness, a common pain point in GANs. The generator's architecture, a pre-norm transformer with adaptive layer normalization, query-key normalization, Rotary Position Embeddings, and FlexAttention with scalar attention sink, incorporates modern best practices for stable and efficient transformer training.
The experimental evaluation is comprehensive and well-structured, providing both objective and subjective evidence for LiveBand's superiority. **Datasets and Metrics**: Training and evaluation on Slakh2100, a standard multi-instrument dataset, ensures comparability. The use of an internal corpus for one variant (LiveBand$_int$) adds a dimension of real-world data. The chosen objective metrics (FAD$_vgg/CLAP$, BA$_F1$, COCOLA) are highly relevant for assessing audio quality, rhythmic accuracy, and mix adherence in music generation. Evaluating over two 10-second segments to measure drift is particularly insightful for streaming models. **Baselines**: Comparing against StreamMusicGen (SMG), a recent and relevant causal autoregressive model, provides a strong benchmark. The inclusion of a bidirectional upper bound (LiveBand$_bid$) and ground truth offers valuable context for potential performance ceilings. **Results**: LiveBand consistently outperforms SMG across all objective metrics (FAD, BA, COCOLA) and anticipation settings (0s, 0.1s, 1s). This is a significant finding, especially the ability to maintain strong performance even with 1-second anticipation, demonstrating that the model can generate meaningfully ahead of the
Language models increasingly serve as the backbone of text-to-speech (TTS) systems, yet we understand little about the representations they build when text and generated speech tokens share a single residual stream. We train BatchTopK sparse autoencoders on the LM backbone of CosyVoice3 and introduce a modality-aware auto-interp pipeline that labels each feature from where it fires-text-prefix context, 1-second speech clips, or both. The recovered features are interpretable, spanning phonemes, laughter, accent prompts and speaker gender. Steering through the SAE latent space shows these features are causal rather than merely descriptive: targeted interventions raise laughter probability from 0.02 to 0.79, flip perceived speaker gender, and control speech rate while preserving spoken content. SAE features thus serve both as interpretability objects and as control directions for TTS synthesis.
Primary: T-Tech
All Institutions: T-Tech, AI Foundation and Algorithm Lab
The experimental evaluation is comprehensive and well-structured, providing both quantitative and qualitative insights. 1. **Layer Sweep Analysis**: Figure 1 effectively illustrates the shift in feature modality composition and explained variance across layers. This provides valuable insights into the LM's internal processing, showing a transition from mixed/text-heavy to audio-heavy features in middle layers, and a surprising reversion to text-modal in the final hidden state. This analysis is novel for TTS LMs. 2. **Auto-Interp Quality**: Figure 2 provides quantitative evaluation of the auto-interpretation quality, showing text-modal labels are most verifiable (AUROC 0.921), followed by audio-modal (0.653), and mixed (0.558). While the mixed feature scores are lower, they are still above chance, and the qualitative examples (Table 1) show that even some mixed features are clean. 3. **Qualitative Examples**: Table 1 presents compelling examples of interpretable features, spanning phonemes, laughter, accent prompts, speaker gender, words, and even sub-lexical patterns. These examples strongly support the claim of interpretability. 4. **Feature Steering**: This is the highlight of the results. The paper demonstrates *causal control* over three distinct speech properties: * Laughter probability (0.015 to 0.791). * Perceived speaker gender (wav2vec2 P(male) from 0.629 to 0.944 or 0.063). * Speech rate (voiced duration from 3.96s to 10.57s or 2.75s). These results are highly impactful, showing that the identified features are not just descriptive but functional. The separation of gender steering by prompt speaker's original gender (Figure 4) adds further robustness. 5. **Concept Probing Experiments**: The appendix details supervised probes for laughter, emotion, and accent, showing that these concepts are linearly decodable early in the network and that SAE-latent probes closely track raw
The methodology is a well-executed adaptation of established mechanistic interpretability techniques (Sparse Autoencoders, LLM-based auto-interpretation) to a novel and challenging domain: the residual stream of a generative Text-to-Speech (TTS) Language Model. The core innovation lies in the "modality-aware auto-interp pipeline." This involves: 1. **SAE Training**: Training BatchTopK SAEs on the Qwen2.5-0.5B backbone of CosyVoice3, a standard and robust approach for decomposing activations. The choice of dictionary size and sparsity (d=16384, k=50) is reasonable for a model of this size. 2. **Evidence Extraction**: A crucial step that correctly identifies whether an activation occurs in the text-prefix or speech-token segment. This allows for modality-specific evidence (marked text window vs. 1-second audio clip). 3. **Feature Modality Tagging**: A simple yet effective heuristic (speech fraction > 0.8 for audio-modal, < 0.2 for text-modal, otherwise mixed) to categorize features. This is essential for guiding the auto-interpretation process. 4. **Automatic Labeling**: Using Gemini 3.0 Pro with modality-aware prompts is a clever way to leverage powerful LLMs for interpretation. The prompts are carefully designed to elicit specific descriptions based on the evidence type. 5. **Detection-Style Evaluation**: Adapting the protocol from text-only to mixed text/audio evidence, using rank-held-out examples and AUROC/balanced accuracy, provides a quantitative measure of label quality. 6. **SAE Feature Steering**: This is the most impactful methodological contribution. Instead of directly adding residual vectors, the intervention occurs in the SAE latent space, modifying selected feature activations and then decoding back to the residual stream. This demonstrates the *causal* role of the identified features, moving beyond mere correlation. The intervention mechanism is clearly described, ensuring locality to the SAE feature subspace.
The experimental evaluation is comprehensive and well-structured, providing both quantitative and qualitative insights. 1. **Layer Sweep Analysis**: Figure 1 effectively illustrates the shift in feature modality composition and explained variance across layers. This provides valuable insights into the LM's internal processing, showing a transition from mixed/text-heavy to audio-heavy features in middle layers, and a surprising reversion to text-modal in the final hidden state. This analysis is novel for TTS LMs. 2. **Auto-Interp Quality**: Figure 2 provides quantitative evaluation of the auto-interpretation quality, showing text-modal labels are most verifiable (AUROC 0.921), followed by audio-modal (0.653), and mixed (0.558). While the mixed feature scores are lower, they are still above chance, and the qualitative examples (Table 1) show that even some mixed features are clean. 3. **Qualitative Examples**: Table 1 presents compelling examples of interpretable features, spanning phonemes, laughter, accent prompts, speaker gender, words, and even sub-lexical patterns. These examples strongly support the claim of interpretability. 4. **Feature Steering**: This is the highlight of the results. The paper demonstrates *causal control* over three distinct speech properties: * Laughter probability (0.015 to 0.791). * Perceived speaker gender (wav2vec2 P(male) from 0.629 to 0.944 or 0.063). * Speech rate (voiced duration from 3.96s to 10.57s or 2.75s). These results are highly impactful, showing that the identified features are not just descriptive but functional. The separation of gender steering by prompt speaker's original gender (Figure 4) adds further robustness. 5. **Concept Probing Experiments**: The appendix details supervised probes for laughter, emotion, and accent, showing that these concepts are linearly decodable early in the network and that SAE-latent probes closely track raw