14 KiB
14 KiB
Parameters
Run evalscope eval --help to get a complete list of parameter descriptions.
Model Parameters
--model: The name of the model being evaluated.- Specify the model's
idin ModelScope, and it will automatically download the model, for example, Qwen/Qwen2.5-0.5B-Instruct; - Specify the local path to the model, for example,
/path/to/model, to load the model from the local environment; - When the evaluation target is the model API endpoint, it needs to be specified as
model_id, for example,Qwen2.5-0.5B-Instruct.
- Specify the model's
--model-id: An alias for the model being evaluated. Defaults to the last part ofmodel, for example, themodel-idforQwen/Qwen2.5-0.5B-InstructisQwen2.5-0.5B-Instruct.--model-task: The task type of the model, defaults totext_generation, options aretext_generation,image_generation.--model-args: Model loading parameters, separated by commas inkey=valueformat, with default parameters:revision: Model version, defaults tomasterprecision: Model precision, defaults totorch.float16device_map: Device allocation for the model, defaults toauto
--generation-config: Generation parameters, separated by commas, in the form ofkey=valueor passed in as a JSON string, which will be parsed into a dictionary:- If using local model inference (based on Transformers), the following parameters are included (Full parameter guide):
do_sample: Whether to use sampling, default isfalsemax_length: Maximum length, default is 2048max_new_tokens: Maximum length of generated text, default is 512num_return_sequences: Number of sequences to generate, default is 1; when set greater than 1, multiple sequences will be generated, requires settingdo_sample=Truetemperature: Generation temperaturetop_k: Top-k for generationtop_p: Top-p for generation
- If using model API service for inference (
eval-typeset toservice), the following parameters are included (please refer to the deployed model service for specifics):max_tokens: Maximum length of generated text, default is 2048temperature: Generation temperature, default is 0.0n: number of generated sequences, default is 1 (Note: currently, lmdeploy only supports n=1)
# For example, pass arguments in the form of key=value --model-args revision=master,precision=torch.float16,device_map=auto --generation-config do_sample=true,temperature=0.5 # Or pass more complex parameters using a JSON string --model-args '{"revision": "master", "precision": "torch.float16", "device_map": "auto"}' --generation-config '{"do_sample":true,"temperature":0.5,"chat_template_kwargs":{"enable_thinking": false}}'- If using local model inference (based on Transformers), the following parameters are included (Full parameter guide):
--chat-template: Model inference template, defaults toNone, indicating the use of transformers'apply_chat_template; supports passing in a jinja template string to customize the inference template.--template-type: Model inference template, deprecated, refer to--chat-template.
The following parameters are only valid when eval-type=service:
--api-url: Model API endpoint, default isNone; supports local or remote OpenAI API format endpoints, for examplehttp://127.0.0.1:8000/v1.--api-key: Model API endpoint key, default isEMPTY--timeout: Model API request timeout, default isNone--stream: Whether to use streaming transmission, default isFalse
Dataset Parameters
--datasets: Dataset name, supports inputting multiple datasets separated by spaces, datasets will automatically be downloaded from ModelScope, supported datasets refer to Dataset List.--dataset-args: Configuration parameters for the evaluation dataset, passed injsonformat, where the key is the dataset name and the value is the parameter, note that it needs to correspond one-to-one with the values in the--datasetsparameter:dataset_id(orlocal_path): Local path for the dataset, once specified, it will attempt to load local data.prompt_template: The prompt template for the evaluation dataset. When specified, it will be used to generate prompts. For example, the template for thegsm8kdataset isQuestion: {query}\nLet's think step by step\nAnswer:. The question from the dataset will be filled into thequeryfield of the template.query_template: The query template for the evaluation dataset. When specified, it will be used to generate queries. For example, the template forgeneral_mcqisQuestion: {question}\n{choices}\nAnswer: {answer}\n\n. The questions from the dataset will be inserted into thequestionfield of the template, options will be inserted into thechoicesfield, and answers will be inserted into theanswerfield (answer insertion is only effective for few-shot scenarios).system_prompt: System prompt for the evaluation dataset.model_adapter: The model adapter for the evaluation dataset. Once specified, the given model adapter will be used for evaluation. Currently, it supportsgeneration,multiple_choice_logits, andcontinuous_logits. For service evaluation, onlygenerationis supported at the moment. Some multiple-choice datasets supportlogitsoutput.subset_list: List of subsets for the evaluation dataset, once specified, only subset data will be used.few_shot_num: Number of few-shots.few_shot_random: Whether to randomly sample few-shot data, defaults toFalse.metric_list: A list of metrics for evaluating the dataset. When specified, the evaluation will use the given metrics. Currently supported metrics includeAverageAccuracy,AveragePass@1, andPass@[1-16]. For example, for thehumanevaldataset, you can specify["Pass@1", "Pass@5"]. Note that in this case, you need to setn=5to make the model return 5 results.filters: Filters for the evaluation dataset. When specified, these filters will be used to process the evaluation results. They can be used to handle the output of inference models. Currently supported filters are:remove_until {string}: Removes the part of the model's output before the specified string.extract {regex}: Extracts the part of the model's output that matches the specified regular expression. For example, theifevaldataset can specify{"remove_until": "</think>"}, which will filter out the part of the model's output before</think>, avoiding interference with scoring.
# For example --datasets gsm8k arc --dataset-args '{"gsm8k": {"few_shot_num": 4, "few_shot_random": false}, "arc": {"dataset_id": "/path/to/arc"}}, "ifeval": {"filters": {"remove_until": "</think>"}}'--dataset-dir: Dataset download path, defaults to~/.cache/modelscope/datasets.--dataset-hub: Dataset download source, defaults tomodelscope, alternative ishuggingface.--limit: The maximum amount of evaluation data for each dataset. If not specified, the default is to evaluate the entire dataset, which can be useful for quick validation. It supports bothintandfloattypes. Anintvalue indicates the firstNentries of the dataset to be evaluated, while afloatvalue represents the firstN%of the dataset. For example,0.1means evaluating the first 10% of the dataset, and100means evaluating the first 100 entries.
Evaluation Parameters
--eval-batch-size: Evaluation batch size, default is1; wheneval-type=service, it indicates the number of concurrent evaluation requests, default is8.--eval-stage: (Deprecated, refer to--use-cache) Evaluation stage, options areall,infer,review, default isall.--eval-type: Evaluation type, options arecheckpoint,custom,service; defaults tocheckpoint.--eval-backend: Evaluation backend, options areNative,OpenCompass,VLMEvalKit,RAGEval,ThirdParty, defaults toNative.OpenCompassis used for evaluating large language models.VLMEvalKitis used for evaluating multimodal models.RAGEvalis used for evaluating RAG processes, embedding models, re-ranking models, CLIP models.Other evaluation backends [User Guide](../user_guides/backend/index.md)ThirdPartyis used for other special task evaluations, such as ToolBench, LongBench.
--eval-config: This parameter needs to be passed when using a non-Nativeevaluation backend.
Judge Parameters
The LLM-as-a-Judge evaluation parameters use a judge model to determine correctness, including the following parameters:
--judge-strategy: The strategy for using the judge model, options include:auto: The default strategy, which decides whether to use the judge model based on the dataset requirementsllm: Always use the judge modelrule: Do not use the judge model, use rule-based judgment insteadllm_recall: First use rule-based judgment, and if it fails, then use the judge model
--judge-worker-num: The concurrency number for the judge model, default is1--judge-model-args: Sets the parameters for the judge model, passed in as ajsonstring and parsed as a dictionary, supporting the following fields:api_key: The API endpoint key for the model. If not set, it will be retrieved from the environment variableMODELSCOPE_SDK_TOKEN, with a default value ofEMPTY.api_url: The API endpoint for the model. If not set, it will be retrieved from the environment variableMODELSCOPE_API_BASE, with a default value ofhttps://api-inference.modelscope.cn/v1/.model_id: The model ID. If not set, it will be retrieved from the environment variableMODELSCOPE_JUDGE_LLM, with a default value ofQwen/Qwen3-235B-A22B.For more information on ModelScope's model inference services, please refer to [ModelScope API Inference Services](https://modelscope.cn/docs/model-service/API-Inference/intro).system_prompt: System prompt for evaluating the datasetprompt_template: Prompt template for evaluating the datasetgeneration_config: Model generation parameters, same as the--generation-configparameter.score_type: Preset model scoring method, options include:-
pattern: (Default option) Directly judge whether the model output matches the reference answer, suitable for evaluations with reference answers.Default prompt_template
Your job is to look at a question, a gold target, and a predicted answer, and return a letter "A" or "B" to indicate whether the predicted answer is correct or incorrect. [Question] {question} [Reference Answer] {gold} [Predicted Answer] {pred} Evaluate the model's answer based on correctness compared to the reference answer. Grade the predicted answer of this new question as one of: A: CORRECT B: INCORRECT Just return the letters "A" or "B", with no text around it. -
numeric: Judge the model output score under prompt conditions, suitable for evaluations without reference answers.Default prompt_template
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 0 (worst) to 1 (best) by strictly following this format: \"[[rating]]\", for example: \"Rating: [[0.5]]\" [Question] {question} [Response] {pred}
-
score_pattern: Regular expression for parsing model output, default forpatternmode is(A|B); default fornumericmode is\[\[(\d+(?:\.\d+)?)\]\], used to extract model scoring results.score_mapping: Score mapping dictionary forpatternmode, default is{'A': 1.0, 'B': 0.0}
--analysis-report: Whether to generate an analysis report, default isfalse; if this parameter is set, an analysis report will be generated using the judge model, including analysis interpretation and suggestions for the model evaluation results. The report output language will be automatically determined based onlocale.getlocale().
Other Parameters
--work-dir: Model evaluation output path, default is./outputs/{timestamp}, folder structure example is as follows:. ├── configs │ └── task_config_b6f42c.yaml ├── logs │ └── eval_log.log ├── predictions │ └── Qwen2.5-0.5B-Instruct │ └── general_qa_example.jsonl ├── reports │ └── Qwen2.5-0.5B-Instruct │ └── general_qa.json └── reviews └── Qwen2.5-0.5B-Instruct └── general_qa_example.jsonl--use-cache: Use local cache path, default isNone; if a path is specified, such asoutputs/20241210_194434, it will reuse the model inference results from that path. If inference is not completed, it will continue inference and then proceed to evaluation.--seed: Random seed, default is42.--debug: Whether to enable debug mode, default isfalse.--ignore-errors: Whether to ignore errors during model generation, default isfalse.--dry-run: Pre-check parameters without performing inference, only prints parameters, default isfalse.