# 👍 Contribute Benchmark EvalScope, the official evaluation tool of [ModelScope](https://modelscope.cn), is continuously optimizing its benchmark evaluation features! We invite you to refer to this tutorial to easily add your own benchmark evaluation and share your contribution with the community. Let's work together to improve EvalScope and make our tool even better! Below, using `MMLU-Pro` as an example, we will introduce how to add a benchmark evaluation, primarily including three steps: uploading the dataset, registering the dataset, and writing the evaluation task. ## Upload Benchmark Evaluation Dataset Upload the benchmark evaluation dataset to ModelScope, which allows users to load the dataset with one click, benefiting more users. Of course, if the dataset already exists, you can skip this step. ```{seealso} For example: [modelscope/MMLU-Pro](https://modelscope.cn/datasets/modelscope/MMLU-Pro/summary), refer to the [dataset upload tutorial](https://www.modelscope.cn/docs/datasets/create). ``` Ensure that the data can be loaded by ModelScope, test the code as follows: ```python from modelscope import MsDataset dataset = MsDataset.load("modelscope/MMLU-Pro") # Replace with your dataset ``` ## Register Benchmark Evaluation Add the benchmark evaluation in EvalScope. ### Create File Structure First, [Fork EvalScope](https://github.com/modelscope/evalscope/fork) repository, which creates a personal copy of the EvalScope repository, and then clone it locally. Then, add the benchmark evaluation under the `evalscope/benchmarks/` directory, with the following structure: ```text evalscope/benchmarks/ ├── benchmark_name │ ├── __init__.py │ ├── benchmark_name_adapter.py │ └── ... ``` For `MMLU-Pro`, the structure is as follows: ```text evalscope/benchmarks/ ├── mmlu_pro │ ├── __init__.py │ ├── mmlu_pro_adapter.py │ └── ... ``` ### Register `Benchmark` We need to register the `Benchmark` in `benchmark_name_adapter.py`, enabling EvalScope to load the benchmark test we added. Using `MMLU-Pro` as an example, it mainly includes the following: - Import `Benchmark` and `DataAdapter` - Register `Benchmark`, specifying: - `name`: Name of the benchmark test - `dataset_id`: Benchmark test dataset ID for loading the dataset - `model_adapter`: Default model adapter for the benchmark test, supporting two types: - `OutputType.GENERATION`: General text generation model evaluation, returns text generated by the model via input prompt - `OutputType.MULTIPLE_CHOICE`: Multiple-choice evaluation, calculates option probability through logits, returns the option with the highest probability - `output_types`: Output types of the benchmark test, supporting multiple selections: - `OutputType.GENERATION`: General text generation model evaluation - `OutputType.MULTIPLE_CHOICE`: Multiple-choice evaluation output logits - `subset_list`: Sub-datasets of the benchmark test dataset - `metric_list`: Evaluation metrics for the benchmark test - `few_shot_num`: Number of In Context Learning samples for evaluation - `train_split`: Training set of the benchmark test, used for sampling ICL examples - `eval_split`: Evaluation set of the benchmark test - `prompt_template`: Prompt template for the benchmark test - Create `MMLUProAdapter` class, inherited from `DataAdapter`. ```{tip} `subset_list`, `train_split`, `eval_split` can be obtained from the dataset preview, for example [MMLU-Pro preview](https://modelscope.cn/datasets/modelscope/MMLU-Pro/dataPeview) ![MMLU-Pro preview](./images/mmlu_pro_preview.png) ``` Example code is as follows: ```python from evalscope.benchmarks import Benchmark, DataAdapter from evalscope.constants import EvalType, OutputType SUBSET_LIST = [ 'computer science', 'math', 'chemistry', 'engineering', 'law', 'biology', 'health', 'physics', 'business', 'philosophy', 'economics', 'other', 'psychology', 'history' ] # customize your subset list @Benchmark.register( name='mmlu_pro', pretty_name='MMLU-Pro', dataset_id='modelscope/MMLU-Pro', model_adapter=OutputType.GENERATION, output_types=[OutputType.MULTIPLE_CHOICE, OutputType.GENERATION], subset_list=SUBSET_LIST, metric_list=['AverageAccuracy'], few_shot_num=5, train_split='validation', eval_split='test', prompt_template= 'The following are multiple choice questions (with answers) about {subset_name}. Think step by step and then finish your answer with \"the answer is (X)\" where X is the correct letter choice.\n{query}', # noqa: E501 ) class MMLUProAdapter(DataAdapter): def __init__(self, **kwargs): super().__init__(**kwargs) ``` ## Write Evaluation Logic After completing the `DataAdapter`, you can add evaluation tasks in EvalScope. The following methods need to be implemented: - `gen_prompt`: Generate model input prompt. - `get_gold_answer`: Parse the standard answer from the dataset. - `parse_pred_result`: Parse the model output, return different answer parsing methods based on different eval_types. - `match`: Match the model output with the dataset standard answer and score. ```{note} If the default `load` logic does not meet the requirements, you can override the `load` method, for example: implement classification of the dataset based on specified fields. ``` Complete example code is as follows: ```python class MMLUProAdapter(DataAdapter): def __init__(self, **kwargs): super().__init__(**kwargs) self.choices = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'] def load(self, **kwargs): # default load all data kwargs['subset_list'] = ['default'] data_dict = super().load(**kwargs) # use `category` as subset key return self.reformat_subset(data_dict, subset_key='category') def gen_prompt(self, input_d: Dict, subset_name: str, few_shot_list: list, **kwargs) -> Any: if self.few_shot_num > 0: prefix = self.format_fewshot_examples(few_shot_list) else: prefix = '' query = prefix + 'Q: ' + input_d['question'] + '\n' + \ self.__form_options(input_d['options']) + '\n' full_prompt = self.prompt_template.format(subset_name=subset_name, query=query) return self.gen_prompt_data(full_prompt) def format_fewshot_examples(self, few_shot_list): # load few-shot prompts for each category prompts = '' for index, d in enumerate(few_shot_list): prompts += 'Q: ' + d['question'] + '\n' + \ self.__form_options(d['options']) + '\n' + \ d['cot_content'] + '\n\n' return prompts def __form_options(self, options: list): option_str = 'Options are:\n' for opt, choice in zip(options, self.choices): option_str += f'({choice}): {opt}' + '\n' return option_str def get_gold_answer(self, input_d: dict) -> str: """ Parse the raw input labels (gold). Args: input_d: input raw data. Depending on the dataset. Returns: The parsed input. e.g. gold answer ... Depending on the dataset. """ return input_d['answer'] def parse_pred_result(self, result: str, raw_input_d: dict = None, eval_type: str = EvalType.CHECKPOINT) -> str: """ Parse the predicted result and extract proper answer. Args: result: Predicted answer from the model. Usually a string for chat. raw_input_d: The raw input. Depending on the dataset. eval_type: 'checkpoint' or 'service' or `custom`, default: 'checkpoint' Returns: The parsed answer. Depending on the dataset. Usually a string for chat. """ if self.model_adapter == OutputType.MULTIPLE_CHOICE: return result else: return ResponseParser.parse_first_option(result) def match(self, gold: str, pred: str) -> float: """ Match the gold answer and the predicted answer. Args: gold (Any): The golden answer. Usually a string for chat/multiple-choice-questions. e.g. 'A', extracted from get_gold_answer method. pred (Any): The predicted answer. Usually a string for chat/multiple-choice-questions. e.g. 'B', extracted from parse_pred_result method. Returns: The match result. Usually a score (float) for chat/multiple-choice-questions. """ return exact_match(gold=gold, pred=pred) ``` ## Run Evaluation Debug the code to see if it can run normally. ```python from evalscope import run_task, TaskConfig task_cfg = TaskConfig( model='Qwen/Qwen2.5-0.5B-Instruct', datasets=['mmlu_pro'], limit=10, dataset_args={'mmlu_pro': {'subset_list': ['computer science', 'math']}}, debug=True ) run_task(task_cfg=task_cfg) ``` Output is as follows: ```text +-----------------------+-----------+-----------------+------------------+-------+---------+---------+ | Model | Dataset | Metric | Subset | Num | Score | Cat.0 | +=======================+===========+=================+==================+=======+=========+=========+ | Qwen2.5-0.5B-Instruct | mmlu_pro | AverageAccuracy | computer science | 10 | 0.1 | default | +-----------------------+-----------+-----------------+------------------+-------+---------+---------+ | Qwen2.5-0.5B-Instruct | mmlu_pro | AverageAccuracy | math | 10 | 0.1 | default | +-----------------------+-----------+-----------------+------------------+-------+---------+---------+ ``` If everything runs smoothly, you can submit a [PR](https://github.com/modelscope/evalscope/pulls). We will review and merge your contribution as soon as possible, allowing more users to benefit from the benchmark evaluation you've contributed. If you're unsure how to submit a PR, you can check out our [guide](https://github.com/modelscope/evalscope/blob/main/CONTRIBUTING.md). Give it a try! 🚀