evalscope/docs/zh/advanced_guides/collection/schema.md

114 lines
4.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 定义数据混合schema
数据混合schema定义使用哪些数据进行评测以及数据如何分组是数据混合评测的第一步。
## 创建schema
数据混合schema (CollectionSchema)示例如下:
**简单示例**
```python
from evalscope.collections import CollectionSchema, DatasetInfo
simple_schema = CollectionSchema(name='reasoning', datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']})
])
```
其中:
- `name` 是数据混合schema的名称
- `datasets` 是数据集列表,每个数据集(DatasetInfo)包含 `name`、`weight`、`task_type`、`tags` 和 `args` 等属性。
- `name` 是数据集的名称,支持的数据集名称见[数据集列表](../../get_started/supported_dataset.md#1-原生支持的数据集)
- `weight` 是数据集的权重类型为float用于加权采样默认为1.0采样时所有数据会归一化数值需要大于0
- `task_type` 是数据集的任务类型,可自行填写
- `tags` 是数据集的标签,可自行填写
- `args` 是数据集的参数,可指定的参数见[数据集参数](../../get_started/parameters.md#数据集参数)
- `hierarchy` 是数据集的层次由schema自动生成
**复杂示例**
```python
complex_schema = CollectionSchema(name='math&reasoning', datasets=[
CollectionSchema(name='math', weight=3, datasets=[
DatasetInfo(name='gsm8k', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='competition_math', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='cmmlu', weight=1, task_type='math_examination', tags=['zh'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']}),
DatasetInfo(name='ceval', weight=1, task_type='math_examination', tags=['zh'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']}),
]),
CollectionSchema(name='reasoning', weight=1, datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning_examination', tags=['zh'], args={'subset_list': ['logic']}),
DatasetInfo(name='race', weight=1, task_type='reasoning', tags=['en']),
]),
])
```
- `weight` 是数据混合schema的权重类型为float用于加权采样默认为1.0采样时所有数据会归一化数值需要大于0
- `datasets` 中可以包含CollectionSchema从而实现数据集的嵌套在评测时`CollectionSchema`的名称会递归添加到每个样本的tag中
## 使用schema
- 查看创建的schema:
```python
print(simple_schema)
```
```json
{
"name": "reasoning",
"datasets": [
{
"name": "arc",
"weight": 1,
"task_type": "reasoning",
"tags": [
"en",
"reasoning"
],
"args": {}
},
{
"name": "ceval",
"weight": 1,
"task_type": "reasoning",
"tags": [
"zh",
"reasoning"
],
"args": {
"subset_list": [
"logic"
]
}
}
]
}
```
- 查看schema的flatten结果自动归一化权重:
```python
print(complex_schema.flatten())
```
```text
DatasetInfo(name='gsm8k', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='competition_math', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='cmmlu', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']})
DatasetInfo(name='ceval', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']})
DatasetInfo(name='arc', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
DatasetInfo(name='ceval', weight=0.08333333333333333, task_type='reasoning', tags=['zh', 'math&reasoning', 'reasoning'], args={'subset_list': ['logic']})
DatasetInfo(name='race', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
```
- 保存schema:
```python
schema.dump_json('outputs/schema.json')
```
- 从json文件中加载schema:
```python
schema = CollectionSchema.from_json('outputs/schema.json')
```