Scaling Group Inference for Diverse and High-Quality Generation

Carnegie Mellon University
Snap Research
Teaser for diverse generation.

TL;DR: Treat outputs as a set; formulate group inference as a Quadratic Integer Programming (QIP) problem; scale efficiently with progressive pruning.


Abstract: Generative models typically sample outputs independently, and recent inference-time guidance and scaling algorithms focus on improving the quality of individual samples. However, in real-world applications, users are often presented with a set of multiple images (e.g., 4-8) for each prompt, where independent sampling tends to lead to redundant results, limiting user choices and hindering idea exploration. In this work, we introduce a scalable group inference method that improves both the diversity and quality of a group of samples. We formulate group inference as a quadratic integer assignment problem: candidate outputs are modeled as graph nodes, and a subset is selected to optimize sample quality (unary term) while maximizing group diversity (binary term). To substantially improve runtime efficiency, we use intermediate predictions of the final sample at each step to progressively prune the candidate set, allowing our method to scale up efficiently to large input candidate sets. Extensive experiments show that our method significantly improves group diversity and quality compared to independent sampling baselines and recent inference algorithms. Our framework generalizes across a wide range of tasks, including text-to-image, image-to-image, image prompting, and video generation, enabling generative models to treat multiple outputs as cohesive groups rather than independent samples.


Method Overview

Method overview.
Group inference overview. Given a large number of M candidate noises, we gradually reduce the candidate set through iterative denoising and pruning. At each step, we leverage the diffusion model to denoise the candidates. We then compute a quality metric (unary term) and pairwise distances (binary term), and solve a quadratic integer programming (QIP) problem to progressively prune the set. This ultimately yields a small final group of K diverse and high-quality outputs.

Reliability of Intermediate Previews

Correlation Between Intermediate and Final Generation Scores. The left side shows a FLUX.1 Dev model generating an image from the prompt "A photo of a horse". We visualize how the image evolves during the reverse diffusion process, showing intermediate predictions at different steps. Notice how these intermediate previews closely resemble the final output.
The right side quantitatively analyzes this relationship by plotting Spearman correlations between scores at intermediate steps versus final scores. Both Unary and Binary metrics show strong correlations that quickly approach 1.0, even early in the generation process. This demonstrates we can effectively use intermediate predictions to filter promising samples early on.

Gallery of Results

Qualitative results.
Gallery. Qualitative results that show the advantage of our proposed group inference method over I.I.D. sampling for text-to-image generation and depth-to-image generation. Top row shows results with FLUX.1 Schnell, the second row uses FLUX.1 Dev, and the last two rows use FLUX.1 Depth as the base model.

Qualitative results.
Depth to image results. Additional qualitative results using FLUX.1 Depth as the base model.
Different diversity metrics.
Using different diversity metrics. Our method allows for targeted diversity by defining different pairwise objectives. The second and third rows show results where the unary quality term is identical but the pairwise binary term is varied. The middle row uses a color-based binary term, while the bottom row uses a DINObased binary term to achieve semantic and structural diversity.

BibTeX

@article{Parmar2025group,
  title={Scaling Group Inference for Diverse and High-Quality Generation},
  author={Gaurav Parmar and Or Patashnik and Daniil Ostashev and Kuan-Chieh (Jackson) Wang and Kfir Aberman and Srinivasa Narasimhan and Jun-Yan Zhu},
  year={2025},
  journal={arXiv preprint arXiv:2508.15773},
}