embed-bge-m3/FlagEmbedding/research/MLVU/evaluation/README.md

## Evaluation for MLVU

We provide detailed evaluation methods for MLVU, including Multiple-choice tasks and generation tasks.

### Benchmark MLVU on your Model
Firstly, If you want to benchmark MLVU in your models, you can refer to our template test code as follows:
#### Multiple-Choice testing
```
python multiple_choice_evaluation/choice_bench.py
```
You must load your model into this template and evaluate the multiple-choice performance online.
#### Generation testing
- Step 1 Get the inference results of Sub-Scene Captioning and Video Summary.
```
python generation_evaluation/open_bench.py
```
- Step 2 Run the evaluation for the generation tasks.
For Sub-Scene Captioning, modify your pred_path (by step 1) and output_dir then run
```
python evaluate_ssc.py --pred_path /your_path/subplot_all.json --output_dir /eval_subplot  --output_json /eval_subplot.json
python calculate.py --path /eval_subplot
```
For Video Summarization, modify your pred_path (by step 1) and output_dir then run
```
python evaluate_summary.py --pred_path /your_path/summary_all.json --output_dir /eval_summary  --output_json /eval_summary.json
```
Then run, and you need to modify the path in it to your output_dir
```
python calculate_sum.py --path /eval_summary
```

### Benchmark MLVU on existing models
(Take VideoChat2 as an example:)
- step 1: Download original models as well as weights from [VideoChat2](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2)
- step 2: Put choice_bench.py and open_bench.py into the folder as the same as demo.py
- step 3: modify your path of the MLVU in choice_bench.py and open_bench.py
- step 4: run the inference and online evaluation for Multiple-choice tasks.
- step 5: run the inference and evaluation for the generation tasks.