ECCV 2026

Don't Settle at the Mode!

Mitigating diversity collapse in pretrained flow models
via feature self-guidance.

Pradhaan S Bhat1*, Rishubh Parihar1*, Abhijnya Bhat2, R. Venkatesh Babu1

1 Indian Institute of Science  ·  2 Stanford University

* Equal contribution

Fig. 1. I.I.D. sampling collapses to similar samples; Group Inference adds diversity at high cost (4.6s); our feature self-guidance recovers diversity in style, color, layout, and identity at near-I.I.D. latency at near-I.I.D. latency (1.7s vs 1.59s).

Abstract

State-of-the-art flow models generate stunning images from text or image prompts. However, they suffer from diversity collapse when generating multiple samples under the same conditioning.

Existing methods address this issue via either latent guidance, which has limited effectiveness, or sample selection, which relies on external reward models that incur significant inference-time overhead.

In this work, we introduce an efficient, training-free self-guidance mechanism to mitigate diversity collapse without requiring additional reward models. Specifically, we disperse the internal features of the flow model during batch generation with feature self-guidance. Further, to keep the features close to the manifold, we introduce a manifold regularization step that projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions.

Our method integrates seamlessly as a plug-and-play module into pretrained flow models, adding only a marginal inference cost. Experiments demonstrate significant improvements in diversity while preserving fidelity across several conditional flow models, including multi-step and few-step text-to-image, depth-to-image, and reference image generation.

Motivation

Diversity collapse is a feature-space problem.

FLUX.1 generates near-identical samples for the same prompt because its internal DiT features ht collapse. We evaluate this hypothesis by perturbing DiT features with Gaussian Noise and find that this increases sample diversity. Feature Distance Drives Sample Diversity! However, naive perturbation comes at a cost of noisy artifacts as it drives features into low-density regions.

Motivation panel 1.
Fig. 2a. Perturbing MMDiT features (ht) by injecting gaussian noise increases their pairwise distance.
Motivation panel 2.
Fig. 2b. Perturbed features (ht) lead to higher diverse samples; perturbing latents (xt) causes image corruption.
Method

Disperse, refine, blend.

Three key operations inside a single MMDiT block, in the initial denoising steps. Rest remains unchanged.

Method overview (placeholder)
Fig. 3. Overview of the Disperse-and-Refine module. We disperse the MMDiT features of block B2 during the early denoising window t ∈ [1.0, 0.8]. To ensure that the dispersed features remain on the manifold, we incorporate a refinement step that regularizes the features, preventing them from drifting into low-probability regions. Finally, we linearly interpolate between dispersed and refined features to obtain a good diversity-faithfulness tradeoff.
01

Disperse

For a batch of N latents, take their intermediate features at MMDiT block B2. Push the features apart using iterative self-guidance to expand sample variety.

Operates on features. No backprop through blocks.

02

Refine

Re-process the dispersed features through the same MMDiT block. This projects them back onto the conditional feature manifold, undoing the off-manifold drift introduced by raw dispersion.

Reuses existing block weights. No extra parameters.

03

Blend

Linearly interpolate the dispersed (ht) and refined (h̃t) features with a single knob β. Low β favors raw diversity; high β promotes strong regularization and increases prompt adherence. We default to β = 0.4.

One user-facing hyperparameter.

Results

One Framework, Many Models & Conditions!

Select a model & a prompt to see our samples in comparison to i.i.d samples. Hover over images to enlarge.

Models
Prompt
Comparison

Side by side against baselines.

Select a prompt at the top, then a baseline at the bottom, and compare our samples across baselines.

Prompt
Baseline
VS
Ours Feature Self-Guidance
Baseline

Two prompts (“a photo of a bear” and “a white dog and a blue potted plant”) have full coverage across all baselines. The remaining prompts compare against the strongest, reward-based method, Group Inference which selects best N samples of M candidates.

Quantitative

High Diversity at Near-I.I.D Costs.

Match or exceed baselines on diversity with i.i.d.-comparable prompt adherence, at just 1.07× I.I.D. latency.

Diversity vs sample throughput.
Fig. 4. Diversity vs. sample throughput across methods. Bubble size is peak VRAM. We sit closest to the ideal top-right corner in terms of diversity and sample throughput with I.I.D-level VRAM consumption.
Hyperparameter scaling from the rebuttal.
Fig. 5. Diversity vs Prompt Adherence Tradeoff across baselines. Our method shows much higher diversity with good prompt adherence compared to most of the baselines. Notably, Group Inference is several times slower than our method.
BibTeX

Cite this work.

If you found our work useful, please consider citing us.

@misc{bhat2026dontsettlemodemitigating,
      title={Don't Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance}, 
      author={Pradhaan S Bhat and Rishubh Parihar and Abhijnya Bhat and R. Venkatesh Babu},
      year={2026},
      eprint={2606.27371},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.27371}, 
}