6.5 KiB
Multimodal Large Model
This framework supports multiple-choice questions and QA questions, two predefined dataset formats. The usage process is as follows:
Custom dataset evaluation requires using `VLMEvalKit`, which requires additional dependencies:
```shell
pip install evalscope[vlmeval]
```
Reference: [Evaluation Backend with VLMEvalKit](../../user_guides/backend/vlmevalkit_backend.md)
Multiple-Choice Question Format (MCQ)
1. Data Preparation
The evaluation metric is accuracy, and you need to define a tsv file in the following format (using \t as the separator):
index category answer question A B C D image_path
1 Animals A What animal is this? Dog Cat Tiger Elephant /root/LMUData/images/custom_mcq/dog.jpg
2 Buildings D What building is this? School Hospital Park Museum /root/LMUData/images/custom_mcq/AMNH.jpg
3 Cities B Which city's skyline is this? New York Tokyo Shanghai Paris /root/LMUData/images/custom_mcq/tokyo.jpg
4 Vehicles C What is the brand of this car? BMW Audi Tesla Mercedes /root/LMUData/images/custom_mcq/tesla.jpg
5 Activities A What is the person in the picture doing? Running Swimming Reading Singing /root/LMUData/images/custom_mcq/running.jpg
Where:
indexis the question numberquestionis the questionansweris the answerA,B,C,Dare the options, with at least two optionsansweris the answer optionimage_pathis the image path (absolute paths are recommended); this can also be replaced with theimagefield, which should be base64 encodedcategoryis the category (optional field)
Place this file in the ~/LMUData path, and you can use the filename for evaluation. For example, if the filename is custom_mcq.tsv, you can use custom_mcq for evaluation.
2. Configuration Task
The configuration file can be in python dict, yaml, or json format, for example, the following config.yaml file:
eval_backend: VLMEvalKit
eval_config:
model:
- type: qwen-vl-chat # Name of the deployed model
name: CustomAPIModel # Fixed value
api_base: http://localhost:8000/v1/chat/completions
key: EMPTY
temperature: 0.0
img_size: -1
data:
- custom_mcq # Name of the custom dataset, placed in `~/LMUData`
mode: all
limit: 10
reuse: false
work_dir: outputs
nproc: 1
VLMEvalKit [Parameter Description](../../user_guides/backend/vlmevalkit_backend.md#parameter-explanation)
3. Running Evaluation
Run the following code to start the evaluation:
from evalscope.run import run_task
run_task(task_cfg='config.yaml')
The evaluation results are as follows:
---------- ----
split none
Overall 1.0
Activities 1.0
Animals 1.0
Buildings 1.0
Cities 1.0
Vehicles 1.0
---------- ----
Custom QA Question Format (VQA)
1. Data Preparation
Prepare a QA formatted tsv file as follows:
index answer question image_path
1 Dog What animal is this? /root/LMUData/images/custom_mcq/dog.jpg
2 Museum What building is this? /root/LMUData/images/custom_mcq/AMNH.jpg
3 Tokyo Which city's skyline is this? /root/LMUData/images/custom_mcq/tokyo.jpg
4 Tesla What is the brand of this car? /root/LMUData/images/custom_mcq/tesla.jpg
5 Running What is the person in the picture doing? /root/LMUData/images/custom_mcq/running.jpg
This file is similar to the MCQ format, where:
indexis the question numberquestionis the questionansweris the answerimage_pathis the image path (absolute paths are recommended); this can also be replaced with theimagefield, which should be base64 encoded
Place this file in the ~/LMUData path, and you can use the filename for evaluation. For example, if the filename is custom_vqa.tsv, you can use custom_vqa for evaluation.
2. Custom Evaluation Script
Below is an example of a custom dataset, implementing a custom QA format evaluation script. This script will automatically load the dataset, use default prompts for QA, and finally compute accuracy as the evaluation metric.
import os
import numpy as np
from vlmeval.dataset.image_base import ImageBaseDataset
from vlmeval.dataset.image_vqa import CustomVQADataset
from vlmeval.smp import load, dump, d2df
class CustomDataset:
def load_data(self, dataset):
# Load custom dataset
data_path = os.path.join(os.path.expanduser("~/LMUData"), f'{dataset}.tsv')
return load(data_path)
def build_prompt(self, line):
msgs = ImageBaseDataset.build_prompt(self, line)
# Add prompts or custom instructions here
msgs[-1]['value'] += '\nAnswer the question in one word or phrase.'
return msgs
def evaluate(self, eval_file, **judge_kwargs):
data = load(eval_file)
assert 'answer' in data and 'prediction' in data
data['prediction'] = [str(x) for x in data['prediction']]
data['answer'] = [str(x) for x in data['answer']]
print(data)
# ========Compute the evaluation metric as needed=========
# Exact match
result = np.mean(data['answer'] == data['prediction'])
ret = {'Overall': result}
ret = d2df(ret).round(2)
# Save the result
suffix = eval_file.split('.')[-1]
result_file = eval_file.replace(f'.{suffix}', '_acc.csv')
dump(ret, result_file)
return ret
# ========================================================
# Keep the following code and override the default dataset class
CustomVQADataset.load_data = CustomDataset.load_data
CustomVQADataset.build_prompt = CustomDataset.build_prompt
CustomVQADataset.evaluate = CustomDataset.evaluate
3. Configuration File
The configuration file can be in python dict, yaml, or json format. For example, the following config.yaml file:
:caption: config.yaml
eval_backend: VLMEvalKit
eval_config:
model:
- type: qwen-vl-chat
name: CustomAPIModel
api_base: http://localhost:8000/v1/chat/completions
key: EMPTY
temperature: 0.0
img_size: -1
data:
- custom_vqa # Name of the custom dataset, placed in `~/LMUData`
mode: all
limit: 10
reuse: false
work_dir: outputs
nproc: 1
4. Running Evaluation
The complete evaluation script is as follows:
:emphasize-lines: 1
from custom_dataset import CustomDataset # Import the custom dataset
from evalscope.run import run_task
run_task(task_cfg='config.yaml')
The evaluation results are as follows:
{'qwen-vl-chat_custom_vqa_acc': {'Overall': '1.0'}}