embed-bge-m3/FlagEmbedding/research/MLVU/evaluation
hailin cb54502fae first commit 2025-08-04 11:58:13 +00:00
..
generation_evaluation first commit 2025-08-04 11:58:13 +00:00
models first commit 2025-08-04 11:58:13 +00:00
multiple_choice_evaluation first commit 2025-08-04 11:58:13 +00:00
README.md first commit 2025-08-04 11:58:13 +00:00

README.md

Evaluation for MLVU

We provide detailed evaluation methods for MLVU, including Multiple-choice tasks and generation tasks.

Benchmark MLVU on your Model

Firstly, If you want to benchmark MLVU in your models, you can refer to our template test code as follows:

Multiple-Choice testing

python multiple_choice_evaluation/choice_bench.py 

You must load your model into this template and evaluate the multiple-choice performance online.

Generation testing

  • Step 1 Get the inference results of Sub-Scene Captioning and Video Summary.
python generation_evaluation/open_bench.py 
  • Step 2 Run the evaluation for the generation tasks. For Sub-Scene Captioning, modify your pred_path (by step 1) and output_dir then run
python evaluate_ssc.py --pred_path /your_path/subplot_all.json --output_dir /eval_subplot  --output_json /eval_subplot.json
python calculate.py --path /eval_subplot

For Video Summarization, modify your pred_path (by step 1) and output_dir then run

python evaluate_summary.py --pred_path /your_path/summary_all.json --output_dir /eval_summary  --output_json /eval_summary.json

Then run, and you need to modify the path in it to your output_dir

python calculate_sum.py --path /eval_summary

Benchmark MLVU on existing models

(Take VideoChat2 as an example:)

  • step 1: Download original models as well as weights from VideoChat2
  • step 2: Put choice_bench.py and open_bench.py into the folder as the same as demo.py
  • step 3: modify your path of the MLVU in choice_bench.py and open_bench.py
  • step 4: run the inference and online evaluation for Multiple-choice tasks.
  • step 5: run the inference and evaluation for the generation tasks.