EvalMuse-40K : A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation

Paper (EvalMuse-40K) Video GitHub

LeaderBoard

Teaser.

Overview of EvalMuse-40K, consisting of (a) data collection, (b) data annotation, and (c) evaluation methods. (a) We collected 2K real prompts and 2K synthetic prompts, using the MILP sampling strategy to ensure category balance and diversity. Prompts are further divided into elements, and corresponding questions are generated. We also use various T2I models to generate images. (b) During data annotation, we labeled the fine-grained alignment levels of image-text pairs, structural problems about the generated images, and some extra information. For annotated data with large differences in overall alignment scores, we performed re-annotation to ensure the reliability of the benchmark. (c) We proposed two effective image-text alignment evaluation methods: one is FGA-BLIP2, using a fine-tuned vision-language model for direct fine-grained scoring of image-text pairs, and another is PN-VQA, adopting positive-negative VQA manner for zero-shot evaluation.

Video

Abstract

Recently, Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated metrics have emerged to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated metrics is limited by existing small datasets. Additionally, these datasets lack the capacity to assess the performance of automated metrics at a fine-grained level. In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our benchmark. This allows us to comprehensively evaluate the effectiveness of image-text alignment metrics for T2I models. Meanwhile, we introduce two new methods to evaluate the image-text alignment capabilities of T2I models: FGA-BLIP2 which involves end-to-end fine-tuning of a vision-language model to produce fine-grained image-text alignment scores and which adopts a novel positive-negative VQA manner in VQA models for zero-shot fine-grained evaluation. Both methods achieve impressive performance in image-text alignment evaluations. We also use our methods to rank current AIGC models, in which the results can serve as a reference source for future study and promote the development of T2I generation.

Data Statistics in EvalMuse-40K

Word Cloud

We used the word cloud to visualize the distribution of words among the elements of prompts in EvalMuse-40K.

Real Prompt Sampling

Distribution of data before and after sampling in four dimensions using MILP strategy. We randomly sampled 100K prompts from DiffusionDB and then categorized them in four dimensions. It can be observed that through MILP sampling, the distribution of prompts across various dimensions is more balanced.

Alignment Score Distribution

We visualize the distribution of alignment scores in EvalMuse-40K. It can be seen that the alignment scores are widely distributed, providing a rich sample for evaluating the consistency of existing models in metrics of image-text alignment with respect to human preferences.

Differences in Human Preferences

We visualize differences in alignment scores across human annotations. We found 75% of alignment scores differ by less than 1, showing high annotation consistency. For larger differences, we re-annotated to reduce bias.

Fine-Grained Annotation Quantity and Scores

We visualize the number and scores of fine-grained annotations. Most categories have alignment scores around 50%, ensuring balanced positive and negative samples. And we found that AIGC models show weaker consistency in counting, spatial relationships, and activities with prompts.

Evaluation Results in EvalMuse-40K

Overall Alignment Evaluation on Multiple Benchmarks

Fine-Grained Alignment Evaluation on EvalMuse-40K

Alignment Evaluation for T2I Models

Examples for Alignment Evaluation

Overall Alignment Score evaluation

Flicker and Subject

Flicker and Subject

Flicker and Subject

Flicker and Subject

Fine-Grained Alignment Score Evaluation

Flicker and Subject

Flicker and Subject

Flicker and Subject

Flicker and Subject

Submission Guideline

Our EvalMuse-40K can be used to evaluate the following three tasks, including:

Evaluating the correlation of the overall image-text alignment scores with human preferences.
Evaluating the correlation of the fine-grained image-text alignment scores with human preferences.
Evaluating the performance of the T2I model on the image-text alignment task.

Detailed information can be found in the leaderboard.

For evaluating model correlation with human preference, you can download our dataset from Huggingface.You can train with our training set (with human-annotated scores) and output the results of the model on the test set. Since the test set does not include human-annotated scores (they will be available later), you can email fanhaotian@bytedance.com to submit your result in JSON format and get the correlation with human preferences.

For evaluating the image-text alignment performance of the T2I model, we recommend using FGA-BLIP2, which achieves good performance in both overall alignment and fine-grained alignment.