12 KiB
Large Language Model
This framework supports multiple-choice questions and question-answering questions, with two predefined dataset formats. The usage process is as follows:
Multiple-Choice Question Format (MCQ)
Suitable for scenarios where users need multiple-choice questions. The evaluation metric is accuracy.
1. Data Preparation
Prepare files in multiple-choice question format, supporting both CSV and JSONL formats. The directory structure is as follows:
CSV Format
mcq/
├── example_dev.csv # (Optional) File name composed of `{subset_name}_dev.csv`, used for few-shot evaluation
└── example_val.csv # File name composed of `{subset_name}_val.csv`, used for actual evaluation data
CSV files should be in the following format:
id,question,A,B,C,D,answer
1,Generally speaking, the amino acids that make up animal proteins are ____,4 types,22 types,20 types,19 types,C
2,Among the substances present in the blood, which one is not a metabolic end product?____,Urea,Uric acid,Pyruvic acid,Carbon dioxide,C
JSONL Format
mcq/
├── example_dev.jsonl # (Optional) File name composed of `{subset_name}_dev.jsonl`, used for few-shot evaluation
└── example_val.jsonl # File name composed of `{subset_name}_val.jsonl`, used for actual evaluation data
JSONL files should be in the following format:
{"id": "1", "question": "Generally speaking, the amino acids that make up animal proteins are ____", "A": "4 types", "B": "22 types", "C": "20 types", "D": "19 types", "answer": "C"}
{"id": "2", "question": "Among the substances present in the blood, which one is not a metabolic end product?____", "A": "Urea", "B": "Uric acid", "C": "Pyruvic acid", "D": "Carbon dioxide", "answer": "C"}
Where:
idis the serial number (optional field)questionis the queryA,B,C,D, etc., are the options, supporting up to 10 choicesansweris the correct option
2. Configuration Task
Run the following code to start the evaluation:
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='Qwen/Qwen2-0.5B-Instruct',
datasets=['general_mcq'], # Data format, fixed as 'general_mcq' for multiple-choice format
dataset_args={
'general_mcq': {
"local_path": "custom_eval/text/mcq", # Custom dataset path
"subset_list": [
"example" # Evaluation dataset name, mentioned subset_name
]
}
},
)
run_task(task_cfg=task_cfg)
Results:
+---------------------+-------------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+=====================+=============+=================+==========+=======+=========+=========+
| Qwen2-0.5B-Instruct | general_mcq | AverageAccuracy | example | 12 | 0.5833 | default |
+---------------------+-------------+-----------------+----------+-------+---------+---------+
Question-Answering Format (QA)
This framework accommodates two formats for question-and-answer tasks: those with reference answers and those without.
- Reference Answer Q&A: Suitable for questions with clear correct answers, with default evaluation metrics being
ROUGEandBLEU. It can also be configured with an LLM judge for semantic correctness assessment. - Reference-free Answer Q&A: Suitable for questions without definitive correct answers, such as open-ended questions. By default, no evaluation metrics are provided, but an LLM judge can be configured to score the generated answers.
Here's how to use it:
Data Preparation
Prepare a JSONL file in the Q&A format, for example, a directory containing a file:
qa/
└── example.jsonl
The JSONL file should be formatted as follows:
{"system": "You are a geographer", "query": "What is the capital of China?", "response": "The capital of China is Beijing"}
{"query": "What is the highest mountain in the world?", "response": "It is Mount Everest"}
{"query": "Why are there no penguins in the Arctic?", "response": "Because penguins mostly live in Antarctica"}
Where:
systemis the system prompt (optional field)queryis the question (mandatory)responseis the correct answer. For reference answer Q&A tasks, this field must exist; for non-reference answer Q&A tasks, it can be empty.
Reference Answer Q&A
Below is how to configure the evaluation of reference answer Q&A tasks using the Qwen2.5 model on example.jsonl.
Method 1: Evaluation based on ROUGE and BLEU
Simply run the following code:
from evalscope import TaskConfig, run_task
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=['general_qa'], # Data format, fixed as 'general_qa' for Q&A tasks
dataset_args={
'general_qa': {
"local_path": "custom_eval/text/qa", # Custom dataset path
"subset_list": [
# Evaluation dataset name, the * in *.jsonl above, multiple subsets can be configured
"example"
]
}
},
)
run_task(task_cfg=task_cfg)
Click to view evaluation results
+----------------+------------+-----------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+================+============+===========+==========+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-1-R | example | 12 | 0.694 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-1-P | example | 12 | 0.176 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-1-F | example | 12 | 0.2276 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-2-R | example | 12 | 0.4667 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-2-P | example | 12 | 0.0939 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-2-F | example | 12 | 0.1226 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-L-R | example | 12 | 0.6528 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-L-P | example | 12 | 0.1628 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | Rouge-L-F | example | 12 | 0.2063 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | bleu-1 | example | 12 | 0.164 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | bleu-2 | example | 12 | 0.0935 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | bleu-3 | example | 12 | 0.065 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
| Qwen2.5-0.5B-Instruct | general_qa | bleu-4 | example | 12 | 0.0556 | default |
+----------------+------------+-----------+----------+-------+---------+---------+
Method 2: Evaluation based on LLM
LLM-based evaluation can conveniently assess the correctness of model outputs (or other dimensions of metrics, requiring custom prompt settings). Below is an example configuring judge_model_args parameters, using the preset pattern mode to determine the correctness of model outputs.
For a complete explanation of judge parameters, please refer to documentation.
import os
from evalscope import TaskConfig, run_task
from evalscope.constants import JudgeStrategy
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=[
'general_qa',
],
dataset_args={
'general_qa': {
'dataset_id': 'custom_eval/text/qa',
'subset_list': [
'example'
],
}
},
# judge related parameters
judge_model_args={
'model_id': 'qwen2.5-72b-instruct',
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': os.getenv('DASHSCOPE_API_KEY'),
'generation_config': {
'temperature': 0.0,
'max_tokens': 4096
},
# Determine if the model output is correct based on reference answers and model output
'score_type': 'pattern',
},
# judge concurrency number
judge_worker_num=5,
# Use LLM for evaluation
judge_strategy=JudgeStrategy.LLM,
)
run_task(task_cfg=task_cfg)
Click to view evaluation results
+----------------+------------+----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+================+============+================+==========+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | general_qa | AverageAccuracy | example | 12 | 0.583 | default |
+----------------+------------+----------------+----------+-------+---------+---------+
Reference-free Answer Q&A
If the dataset lacks reference answers, an LLM judge can be used to evaluate the model's output answers. Without configuring an LLM, no scoring results will be available.
Below is an example configuring judge_model_args parameters, using the preset numeric mode to automatically assess model output scores from dimensions such as accuracy, relevance, and usefulness. Higher scores indicate better model output.
For a complete explanation of judge parameters, please refer to documentation.
import os
from evalscope import TaskConfig, run_task
from evalscope.constants import JudgeStrategy
task_cfg = TaskConfig(
model='Qwen/Qwen2.5-0.5B-Instruct',
datasets=[
'general_qa',
],
dataset_args={
'general_qa': {
'dataset_id': 'custom_eval/text/qa',
'subset_list': [
'example'
],
}
},
# judge related parameters
judge_model_args={
'model_id': 'qwen2.5-72b-instruct',
'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
'api_key': os.getenv('DASHSCOPE_API_KEY'),
'generation_config': {
'temperature': 0.0,
'max_tokens': 4096
},
# Direct scoring
'score_type': 'numeric',
},
# judge concurrency number
judge_worker_num=5,
# Use LLM for evaluation
judge_strategy=JudgeStrategy.LLM,
)
run_task(task_cfg=task_cfg)
Click to view evaluation results
+----------------+------------+----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+================+============+================+==========+=======+=========+=========+
| Qwen2.5-0.5B-Instruct | general_qa | AverageAccuracy | example | 12 | 0.6375 | default |
+----------------+------------+----------------+----------+-------+---------+---------+