Comment by maalber

Efficiently evaluating the performance of text-to-image models is difficult as it inherently requires subjective judgment and human preference, making it hard to compare different models and ultimately quantify progress. We developed a system to efficiently source annotations at scale and based on this present a framework for rigorous evaluation of image generation models. With this, we present the ranking of four popular image generation models, MidJourney, DALLE-3, Stable Diffusions, and the latest star, Flux.1. The ranking is based on more than 2 million annotations across 4512 images and three criterias; style, coherence, and text-to-image alignment. Through integration with mobile apps, we reach a diverse set of annotators and we show that the regional distribution closely resembles the distribution of the world population ensuring lower risk of biases.

If you want to get a quick feel of our system, check out our free Compare Tool (https://www.rapidata.ai/compare) which resembles a single one of the 27k comparisons created for the ranking.