5.0 KiB
5.0 KiB
Defining the Data Mixing Schema
The data mixing schema defines which datasets are used for evaluation and how the data is grouped. This is the first step in the mixed data evaluation process.
Creating the Schema
An example of a data mixing schema (CollectionSchema) is shown below:
Simple Example
from evalscope.collections import CollectionSchema, DatasetInfo
simple_schema = CollectionSchema(name='reasoning', datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']})
])
Where:
nameis the name of the data mixing schema.datasetsis a list of datasets, where each dataset (DatasetInfo) includes attributes such asname,weight,task_type,tags, andargs.nameis the name of the dataset. Supported dataset names can be found in the dataset list.weightis the weight of the dataset, used for weighted sampling. The default is 1.0, and all data will be normalized during sampling. (The value must be greater than 0)task_typeis the task type of the dataset and can be filled in as needed.tagsare labels for the dataset, which can also be filled in as needed.argsare parameters for the dataset, and the configurable parameters can be found in the dataset parameters.hierarchyis the hierarchy of the dataset, which is automatically generated by the schema.
Complex Example
complex_schema = CollectionSchema(name='math&reasoning', datasets=[
CollectionSchema(name='math', weight=3, datasets=[
DatasetInfo(name='gsm8k', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='competition_math', weight=1, task_type='math', tags=['en']),
DatasetInfo(name='cmmlu', weight=1, task_type='math', tags=['zh'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']}),
DatasetInfo(name='ceval', weight=1, task_type='math', tags=['zh'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']}),
]),
CollectionSchema(name='reasoning', weight=1, datasets=[
DatasetInfo(name='arc', weight=1, task_type='reasoning', tags=['en']),
DatasetInfo(name='ceval', weight=1, task_type='reasoning', tags=['zh'], args={'subset_list': ['logic']}),
DatasetInfo(name='race', weight=1, task_type='reasoning', tags=['en']),
]),
])
weightis the weight of the data mixing schema, used for weighted sampling. The default is 1.0, and all data will be normalized during sampling. (The value must be greater than 0)datasetscan contain CollectionSchema, enabling the nesting of datasets. During evaluation, the name of theCollectionSchemawill be recursively added to the tags of each sample.
Using the Schema
- To view the created schema:
print(simple_schema)
{
"name": "reasoning",
"datasets": [
{
"name": "arc",
"weight": 1,
"task_type": "reasoning",
"tags": [
"en",
"reasoning"
],
"args": {}
},
{
"name": "ceval",
"weight": 1,
"task_type": "reasoning",
"tags": [
"zh",
"reasoning"
],
"args": {
"subset_list": [
"logic"
]
}
}
]
}
- To view the flatten result of the schema (automatically normalized weights):
print(complex_schema.flatten())
DatasetInfo(name='gsm8k', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='competition_math', weight=0.1875, task_type='math', tags=['en', 'math&reasoning', 'math'], args={})
DatasetInfo(name='cmmlu', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['college_mathematics', 'high_school_mathematics']})
DatasetInfo(name='ceval', weight=0.1875, task_type='math', tags=['zh', 'math&reasoning', 'math'], args={'subset_list': ['advanced_mathematics', 'high_school_mathematics', 'discrete_mathematics', 'middle_school_mathematics']})
DatasetInfo(name='arc', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
DatasetInfo(name='ceval', weight=0.08333333333333333, task_type='reasoning', tags=['zh', 'math&reasoning', 'reasoning'], args={'subset_list': ['logic']})
DatasetInfo(name='race', weight=0.08333333333333333, task_type='reasoning', tags=['en', 'math&reasoning', 'reasoning'], args={})
- To save the schema:
schema.dump_json('outputs/schema.json')
- To load the schema from a JSON file:
schema = CollectionSchema.from_json('outputs/schema.json')