evalscope/docs/en/advanced_guides/custom_dataset/clip.md

196 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLIP Model
## Custom Image-Text Retrieval Dataset
### 1. Prepare the Dataset
Prepare the `image_queries.jsonl` dataset for image-text retrieval in the following format (file name must be fixed):
```{code-block} json
:caption: custom_eval/multimodal/text-image-retrieval/image_queries.jsonl
{"image_path": "custom_eval/multimodal/images/dog.jpg", "query": ["dog"]}
{"image_path": "custom_eval/multimodal/images/AMNH.jpg", "query": ["building"]}
{"image_path": "custom_eval/multimodal/images/tokyo.jpg", "query": ["city", "tokyo"]}
{"image_path": "custom_eval/multimodal/images/tesla.jpg", "query": ["car", "tesla"]}
{"image_path": "custom_eval/multimodal/images/running.jpg", "query": ["man", "running"]}
```
Where:
- `image_path`: Path to the image, supporting local paths.
- `query`: Text descriptions for image-text retrieval, supporting multiple descriptions, such as `["dog", "cat"]`.
### 2. Configure Evaluation Parameters
```python
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "AI-ModelScope/chinese-clip-vit-large-patch14-336px",
}
],
"dataset_name": ["custom"],
"data_dir": "custom_eval/multimodal/text-image-retrieval",
"split": "test",
"batch_size": 128,
"num_workers": 1,
"verbose": True,
"skip_existing": False,
"limit": 1000,
},
},
}
```
```{seealso}
[Full Parameter Explanation](../../user_guides/backend/rageval_backend/clip_benchmark.md#configure-evaluation-parameters)
```
Where:
- `dataset_name`: Dataset name, must be specified as `custom`.
- `data_dir`: Dataset directory, containing the `image_queries.jsonl` file.
### 3. Run Evaluation Task
```python
from evalscope.run import run_task
run_task(task_cfg=task_cfg)
```
The evaluation output is as follows:
```json
{"dataset": "custom", "model": "AI-ModelScope/chinese-clip-vit-large-patch14-336px", "task": "zeroshot_retrieval", "metrics": {"image_retrieval_recall@5": 1.0, "text_retrieval_recall@5": 1.0}}
```
## Convert Image-Text Retrieval Data to Text Retrieval Data
To facilitate the evaluation of different multimodal retrieval methods, this framework supports converting image-text retrieval problems into text retrieval problems using a multimodal large model, followed by text retrieval evaluation.
### 1. Prepare the Dataset
Supported input datasets include [image-text retrieval datasets](../../user_guides/backend/rageval_backend/clip_benchmark.md#supported-datasets) and the custom image-text retrieval dataset mentioned above.
### 2. Configure Evaluation Parameters
```python
task_cfg = {
"eval_backend": "RAGEval",
"eval_config": {
"tool": "clip_benchmark",
"eval": {
"models": [
{
"model_name": "internvl2-8b",
"api_base": "http://localhost:8008/v1",
"api_key": "xxx",
"prompt": "用中文描述这张图片",
}
],
"dataset_name": ["muge"],
"split": "test",
"task": "image_caption",
"batch_size": 2,
"num_workers": 1,
"verbose": True,
"skip_existing": False,
"limit": 10,
},
},
}
```
Parameter Explanation:
- The `models` list must include a multimodal large model configuration:
- `model_name`: Name of the multimodal large model, e.g., `internvl2-8b`.
- `api_base`: API address of the multimodal large model, e.g., `http://localhost:8008/v1`.
- `api_key`: API key for the multimodal large model, e.g., `xxx`.
- `prompt`: Prompt for the multimodal large model input, e.g., `"用中文描述这张图片"`.
- `task`: Evaluation task, must be specified as `image_caption`.
### 3. Run the Conversion Task
Run the following code to start the conversion:
```python
from evalscope.run import run_task
run_task(task_cfg=task_cfg)
```
The output is as follows:
```
2024-10-22 19:56:09,832 - evalscope - INFO - Write files to outputs/internvl2-8b/muge/retrieval_data
2024-10-22 19:56:10,543 - evalscope - INFO - Evaluation results: {'dataset': 'muge', 'model': 'internvl2-8b', 'task': 'image_caption', 'metrics': {'convertion_successful': True, 'save_path': 'outputs/internvl2-8b/muge/retrieval_data'}}
2024-10-22 19:56:10,544 - evalscope - INFO - Dump results to: outputs/internvl2-8b/muge_image_caption.json
```
The output file directory structure is as follows:
```
muge
├── retrieval_data
│ ├── corpus.jsonl
│ ├── queries.jsonl
│ └── qrels
│ └── test.tsv
└── muge_image_caption.json
```
The specific contents of the files are as follows:
```{code-block} json
:caption: outputs/internvl2-8b/muge/retrieval_data/corpus.jsonl
{"_id":0,"text":"这是一张展示澳亚奇品牌的产品广告图片,图片中包含了六罐澳亚奇品牌的饮料,饮料罐上印有品牌的名称和图案。饮料罐排列在纸箱上,纸箱上也有品牌名称和图案。整个包装以红色和黄色为主基调,给人以醒目和吸引人的感觉。"}
{"_id":1,"text":"这是一副时尚的眼镜镜框是金属材质的颜色为玫瑰金色镜腿部分是黑色的。镜腿的内侧有品牌标志看起来像是“The Row”。这款眼镜的设计比较现代适合日常佩戴。"}
{"_id":2,"text":"这张图片展示了一位女性她正在用手机拍摄自己的侧脸自拍。她有长长的棕色头发并佩戴着一对精美的耳环。耳环的设计有点像是字母“A”。背景是室内环境可以看到淡蓝色墙壁和浅色的柜子。"}
{"_id":3,"text":"这是一张黑色塑料瓶的图片,瓶身上贴有红色标签,标签上有白色和黄色的文字。标签上内容包括产品名称、品牌和一些图案。瓶口是红色和灰色的盖子。"}
{"_id":4,"text":"这是一张客厅的照片,里面有一把单人沙发椅。沙发的靠背和坐垫上有黑白相间的斑马纹图案,椅子的框架是黑色的木制结构,带有卷曲的扶手。沙发的腿部是黑色的,造型优雅。沙发布置在一个铺有地毯的地板上,背景中可以看到部分沙发和装饰画,整个房间的装饰风格显得温馨且现代。"}
{"_id":5,"text":"这是一张一次性纸杯的图片。纸杯呈圆柱形,杯壁较为光滑,没有明显的装饰或花纹。杯口部分略微向外扩展,便于抓握。杯子整体呈浅灰色或乳白色,质地看起来较为轻薄。这种纸杯常用于盛装饮料或冷食,适合一次性使用。"}
{"_id":6,"text":"这张图片展示的是四个卡通人物,背景有五彩斑斓的光芒。从左到右,这四个角色分别是:\n\n1. 一个穿着蓝色服装、戴着紫色头巾和发饰的角色。\n2. 一个穿着蓝绿色服装、戴着蓝色发饰和翅膀的角色。\n3. 一个穿着粉红色服装、带着红色头饰和翅膀的角色。\n4. 一个穿着红色和白色服装、戴着红色头饰的角色。\n\n背景中有“新格林童话”和“NEW GREEN”的字样。"}
{"_id":7,"text":"这是一张展示手中握着蓝色葡萄的照片。手的主人穿着绿色的毛衣,手指修长。葡萄颜色深蓝,表面光滑,每颗葡萄看起来都十分饱满多汁。旁边有一些绿色叶子和干燥的枝条做装饰。背景是一张木质的桌子,整体画面给人一种自然清新的感觉。"}
{"_id":8,"text":"这张图片展示了一个可爱的小马克杯,杯身是浅绿色,配有圆弧形的手柄。杯子上绘有可爱的卡通图案,包括一只戴着耳机的小兔子,并配有“热爱学习”字样,旁边还有两只小耳朵和几颗星星。整个马克杯的设计简洁可爱,适合用作日常饮品盛器。"}
{"_id":9,"text":"这是一张展示塑料包装中大量线状物体的图片。这些线状物体堆叠在一起,看起来像是一些纤维或麻线,可能是用于编织或加工的。"}
```
```{code-block} json
:caption: outputs/internvl2-8b/muge/retrieval_data/queries.jsonl
{"_id":0,"text":"酸角汁饮料 整箱 云南"}
{"_id":1,"text":"达芬奇眼镜"}
{"_id":2,"text":"水钻蝴蝶结耳钉"}
{"_id":3,"text":"邓州黄酒"}
{"_id":4,"text":"斑马纹老虎椅"}
{"_id":5,"text":"布丁杯模具"}
{"_id":6,"text":"光之美少女盒蛋"}
{"_id":7,"text":"蓝莓模型"}
{"_id":8,"text":"少女心喝水杯"}
{"_id":9,"text":"炸面"}
```
```{code-block}
:caption: outputs/internvl2-8b/muge/retrieval_data/qrels/test.tsv
query-id corpus-id score
0 0 1
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
```
### 4. Execute Text Retrieval Task
Once the dataset is ready, you can perform text retrieval tasks as per the CMTEB tutorial.
```{seealso}
Refer to [Custom Text Retrieval Evaluation](./embedding.md)
```