evalscope/docs/en/get_started/supported_dataset.md

38 KiB

Supported Datasets

1. Native Supported Datasets

The framework currently supports the following datasets natively. If the dataset you need is not on the list, you may submit an [issue](https://github.com/modelscope/evalscope/issues), and we will support it as soon as possible. Alternatively, you can refer to the [Benchmark Addition Guide](../advanced_guides/add_benchmark.md) to add datasets by yourself and submit a [PR](https://github.com/modelscope/evalscope/pulls). Contributions are welcome.

You can also use other tools supported by this framework for evaluation, such as [OpenCompass](../user_guides/backend/opencompass_backend.md) for language model evaluation, or [VLMEvalKit](../user_guides/backend/vlmevalkit_backend.md) for multimodal model evaluation.

LLM Evaluation Datasets

Name Dataset ID Task Category Remarks
aime24 HuggingFaceH4/aime_2024 Math Competition
aime25 opencompass/AIME2025 Math Competition Part1,2
alpaca_eval3 AI-ModelScope/alpaca_eval Instruction Following
Notelength-controlled winrate is not currently supported; Official Judge model is gpt-4-1106-preview, baseline model is gpt-4-turbo
arc modelscope/ai2_arc Exam
arena_hard3 AI-ModelScope/arena-hard-auto-v0.1 Comprehensive Reasoning
Notestyle-control is not currently supported; Official Judge model is gpt-4-1106-preview, baseline model is gpt-4-0314
bbh modelscope/bbh Comprehensive Reasoning
ceval modelscope/ceval-exam Chinese Comprehensive Exam
chinese_simpleqa3 AI-ModelScope/Chinese-SimpleQA Chinese Knowledge Q&A Use primary_category field as sub-dataset
cmmlu modelscope/cmmlu Chinese Comprehensive Exam
competition_math modelscope/competition_math Math Competition Use level field as sub-dataset
drop AI-ModelScope/DROP Reading Comprehension, Reasoning
gpqa modelscope/gpqa Expert-Level Examination
gsm8k modelscope/gsm8k Math Problems
hellaswag modelscope/hellaswag Common Sense Reasoning
humaneval2 modelscope/humaneval Code Generation
ifeval4 modelscope/ifeval Instruction Following
iquiz modelscope/iquiz IQ and EQ
live_code_bench2,4 AI-ModelScope/code_generation_lite Code Generation
Parameter Description Sub-datasets support release_v1, release_v5, v1, v4_v5 version tags; dataset-args supports setting {'extra_params': {'start_date': '2024-12-01','end_date': '2025-01-01'}} to filter specific time range questions
math_500 AI-ModelScope/MATH-500 Math Competition Use level field as sub-dataset
maritime_bench HiDolphin/MaritimeBench Maritime Knowledge
mmlu modelscope/mmlu Comprehensive Exam
mmlu_pro modelscope/mmlu-pro Comprehensive Exam Use category field as sub-dataset
mmlu_redux AI-ModelScope/mmlu-redux-2.0 Comprehensive Exam
musr AI-ModelScope/MuSR Multi-step Soft Reasoning
process_bench Qwen/ProcessBench Mathematical Process Reasoning
race modelscope/race Reading Comprehension
simple_qa3 AI-ModelScope/SimpleQA Knowledge Q&A
super_gpqa m-a-p/SuperGPQA Expert-Level Examination Use field field as sub-dataset
tool_bench AI-ModelScope/ToolBench-Statich Tool Calling Refer to usage doc
trivia_qa modelscope/trivia_qa Knowledge Q&A
truthful_qa1 modelscope/truthful_qa Safety
winogrande AI-ModelScope/winogrande_val Reasoning
**1.** Evaluation requires calculating logits, not currently supported for API service evaluation (`eval-type != server`).

**2.** Due to operations involving code execution, it is recommended to run in a sandbox environment (e.g., Docker) to prevent impact on the local environment.

**3.** This dataset requires specifying a Judge Model for evaluation. Refer to [Judge Parameters](./parameters.md#judge-parameters).

**4.** For better evaluation results, it is recommended that reasoning models set post-processing corresponding to the dataset, such as `{"filters": {"remove_until": "</think>"}}`.

AIGC Evaluation Datasets

This framework also supports evaluation datasets related to text-to-image and other AIGC tasks. The specific datasets are as follows:

Name Dataset ID Task Type Remarks
general_t2i General Text-to-Image Refer to the tutorial
evalmuse AI-ModelScope/T2V-Eval-Prompts Text-Image Consistency EvalMuse subset, default metric is FGA_BLIP2Score
genai_bench AI-ModelScope/T2V-Eval-Prompts Text-Image Consistency GenAI-Bench-1600 subset, default metric is VQAScore
hpdv2 AI-ModelScope/T2V-Eval-Prompts Text-Image Consistency HPDv2 subset, default metric is HPSv2.1Score
tifa160 AI-ModelScope/T2V-Eval-Prompts Text-Image Consistency TIFA160 subset, default metric is PickScore

2. OpenCompass Backend

Refer to the detailed explanation

Language Knowledge Reasoning Examination
Word Definition
  • WiC
  • SummEdits
Idiom Learning
  • CHID
Semantic Similarity
  • AFQMC
  • BUSTM
Coreference Resolution
  • CLUEWSC
  • WSC
  • WinoGrande
Translation
  • Flores
  • IWSLT2017
Multi-language Question Answering
  • TyDi-QA
  • XCOPA
Multi-language Summary
  • XLSum
Knowledge Question Answering
  • BoolQ
  • CommonSenseQA
  • NaturalQuestions
  • TriviaQA
Textual Entailment
  • CMNLI
  • OCNLI
  • OCNLI_FC
  • AX-b
  • AX-g
  • CB
  • RTE
  • ANLI
Commonsense Reasoning
  • StoryCloze
  • COPA
  • ReCoRD
  • HellaSwag
  • PIQA
  • SIQA
Mathematical Reasoning
  • MATH
  • GSM8K
Theorem Application
  • TheoremQA
  • StrategyQA
  • SciBench
Comprehensive Reasoning
  • BBH
Junior High, High School, University, Professional Examinations
  • C-Eval
  • AGIEval
  • MMLU
  • GAOKAO-Bench
  • CMMLU
  • ARC
  • Xiezhi
Medical Examinations
  • CMB
Understanding Long Context Safety Code
Reading Comprehension
  • C3
  • CMRC
  • DRCD
  • MultiRC
  • RACE
  • DROP
  • OpenBookQA
  • SQuAD2.0
Content Summary
  • CSL
  • LCSTS
  • XSum
  • SummScreen
Content Analysis
  • EPRSTMT
  • LAMBADA
  • TNEWS
Long Context Understanding
  • LEval
  • LongBench
  • GovReports
  • NarrativeQA
  • Qasper
Safety
  • CivilComments
  • CrowsPairs
  • CValues
  • JigsawMultilingual
  • TruthfulQA
Robustness
  • AdvGLUE
Code
  • HumanEval
  • HumanEvalX
  • MBPP
  • APPs
  • DS1000

3. VLMEvalKit Backend

For more comprehensive instructions and an up-to-date list of datasets, please refer to [detailed instructions](https://aicarrier.feishu.cn/wiki/Qp7wwSzQ9iK1Y6kNUJVcr6zTnPe?table=tblsdEpLieDoCxtb).

Image Understanding Dataset

Abbreviations used:

  • MCQ: Multiple Choice Questions;
  • Y/N: Yes/No Questions;
  • MTT: Multiturn Dialogue Evaluation;
  • MTI: Multi-image Input Evaluation
Dataset Dataset Names Task
MMBench Series:
MMBench, MMBench-CN, CCBench
MMBench_DEV_[EN/CN]
MMBench_TEST_[EN/CN]
MMBench_DEV_[EN/CN]_V11
MMBench_TEST_[EN/CN]_V11
CCBench
MCQ
MMStar MMStar MCQ
MME MME Y/N
SEEDBench Series SEEDBench_IMG
SEEDBench2
SEEDBench2_Plus
MCQ
MM-Vet MMVet VQA
MMMU MMMU_[DEV_VAL/TEST] MCQ
MathVista MathVista_MINI VQA
ScienceQA_IMG ScienceQA_[VAL/TEST] MCQ
COCO Caption COCO_VAL Caption
HallusionBench HallusionBench Y/N
OCRVQA* OCRVQA_[TESTCORE/TEST] VQA
TextVQA* TextVQA_VAL VQA
ChartQA* ChartQA_TEST VQA
AI2D AI2D_[TEST/TEST_NO_MASK] MCQ
LLaVABench LLaVABench VQA
DocVQA+ DocVQA_[VAL/TEST] VQA
InfoVQA+ InfoVQA_[VAL/TEST] VQA
OCRBench OCRBench VQA
RealWorldQA RealWorldQA MCQ
POPE POPE Y/N
Core-MM- CORE_MM (MTI) VQA
MMT-Bench MMT-Bench_[VAL/ALL]
MMT-Bench_[VAL/ALL]_MI
MCQ (MTI)
MLLMGuard - MLLMGuard_DS VQA
AesBench+ AesBench_[VAL/TEST] MCQ
VCR-wiki + VCR_[EN/ZH]_[EASY/HARD]_[ALL/500/100] VQA
MMLongBench-Doc+ MMLongBench_DOC VQA (MTI)
BLINK BLINK MCQ (MTI)
MathVision+ MathVision
MathVision_MINI
VQA
MT-VQA+ MTVQA_TEST VQA
MMDU+ MMDU VQA (MTT, MTI)
Q-Bench1+ Q-Bench1_[VAL/TEST] MCQ
A-Bench+ A-Bench_[VAL/TEST] MCQ
DUDE+ DUDE VQA (MTI)
SlideVQA+ SLIDEVQA
SLIDEVQA_MINI
VQA (MTI)
TaskMeAnything ImageQA Random+ TaskMeAnything_v1_imageqa_random MCQ
MMMB and Multilingual MMBench+ MMMB_[ar/cn/en/pt/ru/tr]
MMBench_dev_[ar/cn/en/pt/ru/tr]
MMMB
MTL_MMBench_DEV
PS: MMMB & MTL_MMBench_DEV
are all-in-one names for 6 langs
MCQ
A-OKVQA+ A-OKVQA MCQ
MuirBench MUIRBench MCQ
GMAI-MMBench+ GMAI-MMBench_VAL MCQ
TableVQABench+ TableVQABench VQA
**\*** Partial model testing results are provided [here](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard), while remaining models cannot achieve reasonable accuracy under zero-shot conditions.

**\+** Testing results for this evaluation set have not yet been provided.

**\-** VLMEvalKit only supports inference for this evaluation set and cannot output final accuracy.

Video Understanding Dataset

Dataset Dataset Name Task
MMBench-Video MMBench-Video VQA
MVBench MVBench_MP4 MCQ
MLVU MLVU MCQ & VQA
TempCompass TempCompass MCQ & Y/N & Caption
LongVideoBench LongVideoBench MCQ
Video-MME Video-MME MCQ

4. RAGEval Backend

CMTEB Evaluation Dataset

Name Hub Link Description Type Category Number of Test Samples
T2Retrieval C-MTEB/T2Retrieval T2Ranking: A large-scale Chinese paragraph ranking benchmark Retrieval s2p 24,832
MMarcoRetrieval C-MTEB/MMarcoRetrieval mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset Retrieval s2p 7,437
DuRetrieval C-MTEB/DuRetrieval A large-scale Chinese web search engine paragraph retrieval benchmark Retrieval s2p 4,000
CovidRetrieval C-MTEB/CovidRetrieval COVID-19 news articles Retrieval s2p 949
CmedqaRetrieval C-MTEB/CmedqaRetrieval Online medical consultation texts Retrieval s2p 3,999
EcomRetrieval C-MTEB/EcomRetrieval Paragraph retrieval dataset collected from Alibaba e-commerce search engine systems Retrieval s2p 1,000
MedicalRetrieval C-MTEB/MedicalRetrieval Paragraph retrieval dataset collected from Alibaba medical search engine systems Retrieval s2p 1,000
VideoRetrieval C-MTEB/VideoRetrieval Paragraph retrieval dataset collected from Alibaba video search engine systems Retrieval s2p 1,000
T2Reranking C-MTEB/T2Reranking T2Ranking: A large-scale Chinese paragraph ranking benchmark Re-ranking s2p 24,382
MMarcoReranking C-MTEB/MMarco-reranking mMARCO is the multilingual version of the MS MARCO paragraph ranking dataset Re-ranking s2p 7,437
CMedQAv1 C-MTEB/CMedQAv1-reranking Chinese community medical Q&A Re-ranking s2p 2,000
CMedQAv2 C-MTEB/CMedQAv2-reranking Chinese community medical Q&A Re-ranking s2p 4,000
Ocnli C-MTEB/OCNLI Original Chinese natural language inference dataset Pair Classification s2s 3,000
Cmnli C-MTEB/CMNLI Chinese multi-class natural language inference Pair Classification s2s 139,000
CLSClusteringS2S C-MTEB/CLSClusteringS2S Clustering titles from the CLS dataset. Clustering based on 13 sets of main categories. Clustering s2s 10,000
CLSClusteringP2P C-MTEB/CLSClusteringP2P Clustering titles + abstracts from the CLS dataset. Clustering based on 13 sets of main categories. Clustering p2p 10,000
ThuNewsClusteringS2S C-MTEB/ThuNewsClusteringS2S Clustering titles from the THUCNews dataset Clustering s2s 10,000
ThuNewsClusteringP2P C-MTEB/ThuNewsClusteringP2P Clustering titles + abstracts from the THUCNews dataset Clustering p2p 10,000
ATEC C-MTEB/ATEC ATEC NLP Sentence Pair Similarity Competition STS s2s 20,000
BQ C-MTEB/BQ Banking Question Semantic Similarity STS s2s 10,000
LCQMC C-MTEB/LCQMC Large-scale Chinese Question Matching Corpus STS s2s 12,500
PAWSX C-MTEB/PAWSX Translated PAWS evaluation pairs STS s2s 2,000
STSB C-MTEB/STSB Translated STS-B into Chinese STS s2s 1,360
AFQMC C-MTEB/AFQMC Ant Financial Question Matching Corpus STS s2s 3,861
QBQTC C-MTEB/QBQTC QQ Browser Query Title Corpus STS s2s 5,000
TNews C-MTEB/TNews-classification News Short Text Classification Classification s2s 10,000
IFlyTek C-MTEB/IFlyTek-classification Long Text Classification of Application Descriptions Classification s2s 2,600
Waimai C-MTEB/waimai-classification Sentiment Analysis of User Reviews on Food Delivery Platforms Classification s2s 1,000
OnlineShopping C-MTEB/OnlineShopping-classification Sentiment Analysis of User Reviews on Online Shopping Websites Classification s2s 1,000
MultilingualSentiment C-MTEB/MultilingualSentiment-classification A set of multilingual sentiment datasets grouped into three categories: positive, neutral, negative Classification s2s 3,000
JDReview C-MTEB/JDReview-classification Reviews of iPhone Classification s2s 533

For retrieval tasks, a sample of 100,000 candidates (including the ground truth) is drawn from the entire corpus to reduce inference costs.

MTEB Evaluation Dataset

See also: [MTEB Related Tasks](https://github.com/embeddings-benchmark/mteb/blob/main/docs/tasks.md)

CLIP-Benchmark

Dataset Name Task Type Notes
muge zeroshot_retrieval Chinese Multimodal Dataset
flickr30k zeroshot_retrieval
flickr8k zeroshot_retrieval
mscoco_captions zeroshot_retrieval
mscoco_captions2017 zeroshot_retrieval
imagenet1k zeroshot_classification
imagenetv2 zeroshot_classification
imagenet_sketch zeroshot_classification
imagenet-a zeroshot_classification
imagenet-r zeroshot_classification
imagenet-o zeroshot_classification
objectnet zeroshot_classification
fer2013 zeroshot_classification
voc2007 zeroshot_classification
voc2007_multilabel zeroshot_classification
sun397 zeroshot_classification
cars zeroshot_classification
fgvc_aircraft zeroshot_classification
mnist zeroshot_classification
stl10 zeroshot_classification
gtsrb zeroshot_classification
country211 zeroshot_classification
renderedsst2 zeroshot_classification
vtab_caltech101 zeroshot_classification
vtab_cifar10 zeroshot_classification
vtab_cifar100 zeroshot_classification
vtab_clevr_count_all zeroshot_classification
vtab_clevr_closest_object_distance zeroshot_classification
vtab_diabetic_retinopathy zeroshot_classification
vtab_dmlab zeroshot_classification
vtab_dsprites_label_orientation zeroshot_classification
vtab_dsprites_label_x_position zeroshot_classification
vtab_dsprites_label_y_position zeroshot_classification
vtab_dtd zeroshot_classification
vtab_eurosat zeroshot_classification
vtab_kitti_closest_vehicle_distance zeroshot_classification
vtab_flowers zeroshot_classification
vtab_pets zeroshot_classification
vtab_pcam zeroshot_classification
vtab_resisc45 zeroshot_classification
vtab_smallnorb_label_azimuth zeroshot_classification
vtab_smallnorb_label_elevation zeroshot_classification
vtab_svhn zeroshot_classification