Don't Settle at the Mode! — Feature Self-Guidance for Diverse Flow Models

Fig. 1. I.I.D. sampling collapses to similar samples; Group Inference adds diversity at high cost (4.6s); our feature self-guidance recovers diversity in style, color, layout, and identity at near-I.I.D. latency at near-I.I.D. latency (1.7s vs 1.59s).

Abstract

State-of-the-art flow models generate stunning images from text or image prompts. However, they suffer from diversity collapse when generating multiple samples under the same conditioning.

Existing methods address this issue via either latent guidance, which has limited effectiveness, or sample selection, which relies on external reward models that incur significant inference-time overhead.

In this work, we introduce an efficient, training-free self-guidance mechanism to mitigate diversity collapse without requiring additional reward models. Specifically, we disperse the internal features of the flow model during batch generation with feature self-guidance. Further, to keep the features close to the manifold, we introduce a manifold regularization step that projects these dispersed features back onto the data manifold, ensuring diverse generation without sacrificing alignment with the input conditions.

Our method integrates seamlessly as a plug-and-play module into pretrained flow models, adding only a marginal inference cost. Experiments demonstrate significant improvements in diversity while preserving fidelity across several conditional flow models, including multi-step and few-step text-to-image, depth-to-image, and reference image generation.

Motivation

Diversity collapse is a feature-space problem.

FLUX.1 generates near-identical samples for the same prompt because its internal DiT features h_t collapse. We evaluate this hypothesis by perturbing DiT features with Gaussian Noise and find that this increases sample diversity. Feature Distance Drives Sample Diversity! However, naive perturbation comes at a cost of noisy artifacts as it drives features into low-density regions.

Motivation panel 1. — **Fig. 2a.** Perturbing MMDiT features (h_t) by injecting gaussian noise increases their pairwise distance.

Motivation panel 2. — **Fig. 2b.** Perturbed features (h_t) lead to higher diverse samples; perturbing latents (x_t) causes image corruption.

Method

Disperse, refine, blend.

Three key operations inside a single MMDiT block, in the initial denoising steps. Rest remains unchanged.

Method overview (placeholder) — **Fig. 3.** Overview of the Disperse-and-Refine module. We disperse the MMDiT features of block B₂ during the early denoising window `t ∈ [1.0, 0.8]`. To ensure that the dispersed features remain on the manifold, we incorporate a refinement step that regularizes the features, preventing them from drifting into low-probability regions. Finally, we linearly interpolate between dispersed and refined features to obtain a good diversity-faithfulness tradeoff.

01

Disperse

For a batch of N latents, take their intermediate features at MMDiT block B₂. Push the features apart using iterative self-guidance to expand sample variety.

Operates on features. No backprop through blocks.

02

Refine

Re-process the dispersed features through the same MMDiT block. This projects them back onto the conditional feature manifold, undoing the off-manifold drift introduced by raw dispersion.

Reuses existing block weights. No extra parameters.

03

Blend

Linearly interpolate the dispersed (h_t) and refined (h̃_t) features with a single knob β. Low β favors raw diversity; high β promotes strong regularization and increases prompt adherence. We default to β = 0.4.

One user-facing hyperparameter.

Results

One Framework, Many Models & Conditions!

Select a model & a prompt to see our samples in comparison to i.i.d samples. Hover over images to enlarge.

Models

Result Gallery Hover any image to enlarge

Prompt

Comparison

Side by side against baselines.

Select a prompt at the top, then a baseline at the bottom, and compare our samples across baselines.

Comparison Gallery Hover any image to enlarge

Prompt

Baseline

VS

★ Ours Feature Self-Guidance

Baseline

Two prompts (“a photo of a bear” and “a white dog and a blue potted plant”) have full coverage across all baselines. The remaining prompts compare against the strongest, reward-based method, Group Inference which selects best N samples of M candidates.

Quantitative

High Diversity at Near-I.I.D Costs.

Match or exceed baselines on diversity with i.i.d.-comparable prompt adherence, at just 1.07× I.I.D. latency.

Diversity vs sample throughput. — **Fig. 4.** Diversity vs. sample throughput across methods. Bubble size is peak VRAM. We sit closest to the ideal top-right corner in terms of diversity and sample throughput with I.I.D-level VRAM consumption.

Hyperparameter scaling from the rebuttal. — **Fig. 5.** Diversity vs Prompt Adherence Tradeoff across baselines. Our method shows much higher diversity with good prompt adherence compared to most of the baselines. Notably, Group Inference is several times slower than our method.

BibTeX

Cite this work.

If you found our work useful, please consider citing us.

@misc{bhat2026dontsettlemodemitigating,
      title={Don't Settle at the Mode! Mitigating Diversity Collapse in Pretrained Flow Models via Feature Self-Guidance}, 
      author={Pradhaan S Bhat and Rishubh Parihar and Abhijnya Bhat and R. Venkatesh Babu},
      year={2026},
      eprint={2606.27371},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.27371}, 
}