evalscope/docs/en/user_guides/backend/rageval_backend/clip_benchmark.md

11 KiB

(clip_benchmark)=

CLIP Benchmark

This framework supports the CLIP Benchmark, which aims to provide a unified framework and benchmark for evaluating and analyzing CLIP (Contrastive Language-Image Pretraining) and its variants. Currently, the framework supports 43 evaluation datasets, including zero-shot retrieval tasks with the evaluation metric of recall@k, and zero-shot classification tasks with the evaluation metric of acc@k.

Supported Datasets

Dataset Name Task Type Notes
muge zeroshot_retrieval Chinese Multimodal Dataset
flickr30k zeroshot_retrieval
flickr8k zeroshot_retrieval
mscoco_captions zeroshot_retrieval
mscoco_captions2017 zeroshot_retrieval
imagenet1k zeroshot_classification
imagenetv2 zeroshot_classification
imagenet_sketch zeroshot_classification
imagenet-a zeroshot_classification
imagenet-r zeroshot_classification
imagenet-o zeroshot_classification
objectnet zeroshot_classification
fer2013 zeroshot_classification
voc2007 zeroshot_classification
voc2007_multilabel zeroshot_classification
sun397 zeroshot_classification
cars zeroshot_classification
fgvc_aircraft zeroshot_classification
mnist zeroshot_classification
stl10 zeroshot_classification
gtsrb zeroshot_classification
country211 zeroshot_classification
renderedsst2 zeroshot_classification
vtab_caltech101 zeroshot_classification
vtab_cifar10 zeroshot_classification
vtab_cifar100 zeroshot_classification
vtab_clevr_count_all zeroshot_classification
vtab_clevr_closest_object_distance zeroshot_classification
vtab_diabetic_retinopathy zeroshot_classification
vtab_dmlab zeroshot_classification
vtab_dsprites_label_orientation zeroshot_classification
vtab_dsprites_label_x_position zeroshot_classification
vtab_dsprites_label_y_position zeroshot_classification
vtab_dtd zeroshot_classification
vtab_eurosat zeroshot_classification
vtab_kitti_closest_vehicle_distance zeroshot_classification
vtab_flowers zeroshot_classification
vtab_pets zeroshot_classification
vtab_pcam zeroshot_classification
vtab_resisc45 zeroshot_classification
vtab_smallnorb_label_azimuth zeroshot_classification
vtab_smallnorb_label_elevation zeroshot_classification
vtab_svhn zeroshot_classification

Environment Preparation

Install the required packages

pip install evalscope[rag] -U

Configure Evaluation Parameters

task_cfg = {
    "work_dir": "outputs",
    "eval_backend": "RAGEval",
    "eval_config": {
        "tool": "clip_benchmark",
        "eval": {
            "models": [
                {
                    "model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
                }
            ],
            "dataset_name": ["muge", "flickr8k"],
            "split": "test",
            "batch_size": 128,
            "num_workers": 1,
            "verbose": True,
            "skip_existing": False,
            "cache_dir": "cache",
            "limit": 1000,
        },
    },
}

Parameter Description

  • eval_backend: Default value is RAGEval, indicating the use of the RAGEval evaluation backend.
  • eval_config: A dictionary containing the following fields:
    • tool: The evaluation tool, using clip_benchmark.
    • eval: A dictionary containing the following fields:
      • models: A list of model configurations, each with the following fields:
        • model_name: str The model name or path, e.g., AI-ModelScope/chinese-clip-vit-large-patch14-336px. Supports automatic downloading from the ModelScope repository.
      • dataset_name: List[str] A list of dataset names, e.g., ["muge", "flickr8k", "mnist"]. See Task List.
      • split: str The split of the dataset to use, default is test.
      • batch_size: int Batch size for data loading, default is 128.
      • num_workers: int Number of worker threads for data loading, default is 1.
      • verbose: bool Whether to enable detailed logging, default is True.
      • skip_existing: bool Whether to skip processing if output already exists, default is False.
      • cache_dir: str Dataset cache directory, default is cache.
      • limit: Optional[int] Limit the number of samples to process, default is None, e.g., 1000.

Run Evaluation Task

from evalscope.run import run_task
from evalscope.utils.logger import get_logger

logger = get_logger()

# Run task
run_task(task_cfg=task_cfg) 

Output Evaluation Results

:caption: outputs/chinese-clip-vit-large-patch14-336px/muge_zeroshot_retrieval.json

{"dataset": "muge", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 0.8935546875, "text_retrieval_recall@5": 0.876953125}}

Custom Evaluation Dataset

[Custom Image-Text Dataset](../../../advanced_guides/custom_dataset/clip.md)