evalscope_v0.17.0/evalscope.0.17.0/docs/zh/user_guides/arena.md

20 KiB
Raw Blame History

竞技场模式

竞技场模式允许配置多个候选模型并指定一个baseline模型通过候选模型与baseline模型进行对比(pairwise battle)的方式进行评测,最后输出模型的胜率和排名。该方法适合多个模型之间的对比评测,直观体现模型优劣。

数据准备

为支持竞技场模式,候选模型都需要在相同的数据集上进行推理。数据集可以是一个通用的问答数据集,也可以是一个特定领域的数据集。下面展示使用自定义的general_qa数据集作为示例,该数据集具体使用方法参考文档

general_qa数据集的jsonline文件需要为下面的格式仅需query字段即可,无需其他字段。下面展示两种示例文件:

  • arena.jsonl文件内容示例如下:

    {"query": "How can I improve my time management skills?"}
    {"query": "What are the most effective ways to deal with stress?"}
    {"query": "What are the main differences between Python and JavaScript programming languages?"}
    {"query": "How can I increase my productivity while working from home?"}
    {"query": "Can you explain the basics of quantum computing?"}
    
  • example.jsonl文件内容示例如下(有参考答案):

    {"query": "What is the capital of France?", "response": "The capital of France is Paris."}
    {"query": "What is the largest mammal in the world?", "response": "The largest mammal in the world is the blue whale."}
    {"query": "How does photosynthesis work?", "response": "Photosynthesis is the process by which green plants use sunlight to synthesize foods with the help of chlorophyll."}
    {"query": "What is the theory of relativity?", "response": "The theory of relativity, developed by Albert Einstein, describes the laws of physics in relation to observers in different frames of reference."}
    {"query": "Who wrote 'To Kill a Mockingbird'?", "response": "Harper Lee wrote 'To Kill a Mockingbird'."}
    

候选模型推理

在构造好数据集后可以使用EvalScope的run_task方法进行候选模型的推理,得到模型的输出,用于后续模型对战。

下面展示如何配置候选模型的推理任务,有三个候选模型:Qwen2.5-0.5B-Instruct, Qwen2.5-7B-Instruct, Qwen2.5-72B-Instruct,我们使用相同的配置进行推理。

运行下面的代码:

import os
from evalscope import TaskConfig, run_task
from evalscope.constants import EvalType

models = ['qwen2.5-72b-instruct', 'qwen2.5-7b-instruct', 'qwen2.5-0.5b-instruct']

task_list = [TaskConfig(
    model=model,
    api_url='https://dashscope.aliyuncs.com/compatible-mode/v1',
    api_key=os.getenv('DASHSCOPE_API_KEY'),
    eval_type=EvalType.SERVICE,
    datasets=[
        'general_qa',
    ],
    dataset_args={
        'general_qa': {
            'dataset_id': 'custom_eval/text/qa',
            'subset_list': [
                'arena',
                'example'
            ],
        }
    },
    eval_batch_size=10,
    generation_config={
        'temperature': 0,
        'n': 1,
        'max_tokens': 4096,
    }) for model in models]

run_task(task_cfg=task_list)
点击查看推理结果

由于arena子集没有参考答案,因此推理结果中没有评测指标。example子集有参考答案,因此会输出评测指标。

+-----------------------+------------+-----------------+----------+-------+---------+---------+
| Model                 | Dataset    | Metric          | Subset   |   Num |   Score | Cat.0   |
+=======================+============+=================+==========+=======+=========+=========+
| qwen2.5-0.5b-instruct | general_qa | AverageAccuracy | arena    |    10 | -1      | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-1-R       | example  |    12 |  0.8611 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-1-P       | example  |    12 |  0.1341 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-1-F       | example  |    12 |  0.1983 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-2-R       | example  |    12 |  0.55   | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-2-P       | example  |    12 |  0.0404 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-2-F       | example  |    12 |  0.0716 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-L-R       | example  |    12 |  0.8611 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-L-P       | example  |    12 |  0.1193 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | Rouge-L-F       | example  |    12 |  0.1754 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | bleu-1          | example  |    12 |  0.1192 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | bleu-2          | example  |    12 |  0.0403 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | bleu-3          | example  |    12 |  0.0135 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-0.5b-instruct | general_qa | bleu-4          | example  |    12 |  0.0079 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | AverageAccuracy | arena    |    10 | -1      | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-1-R       | example  |    12 |  0.9722 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-1-P       | example  |    12 |  0.1149 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-1-F       | example  |    12 |  0.1612 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-2-R       | example  |    12 |  0.6833 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-2-P       | example  |    12 |  0.0813 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-2-F       | example  |    12 |  0.1027 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-L-R       | example  |    12 |  0.9722 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-L-P       | example  |    12 |  0.101  | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | Rouge-L-F       | example  |    12 |  0.1361 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | bleu-1          | example  |    12 |  0.1009 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | bleu-2          | example  |    12 |  0.0807 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | bleu-3          | example  |    12 |  0.0625 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-72b-instruct  | general_qa | bleu-4          | example  |    12 |  0.0556 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | AverageAccuracy | arena    |    10 | -1      | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-1-R       | example  |    12 |  0.9722 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-1-P       | example  |    12 |  0.104  | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-1-F       | example  |    12 |  0.1418 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-2-R       | example  |    12 |  0.7    | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-2-P       | example  |    12 |  0.078  | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-2-F       | example  |    12 |  0.0964 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-L-R       | example  |    12 |  0.9722 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-L-P       | example  |    12 |  0.0942 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | Rouge-L-F       | example  |    12 |  0.1235 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | bleu-1          | example  |    12 |  0.0939 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | bleu-2          | example  |    12 |  0.0777 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | bleu-3          | example  |    12 |  0.0625 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+
| qwen2.5-7b-instruct   | general_qa | bleu-4          | example  |    12 |  0.0556 | default |
+-----------------------+------------+-----------------+----------+-------+---------+---------+

候选模型对战

接下来可以使用EvalScope的general_arena方法进行候选模型的对战得到模型在各个子集上的胜率和排名。为了得到良好的自动对战效果我们需要配置一个LLM模型作为裁判用于对比模型的输出哪个更好。

在评测过程中EvalScope会自动解析候选模型的公共评测集使用裁判模型对比每个候选模型与baseline模型的输出并判断优劣为避免模型偏见每个输出会交换顺序进行两轮对战。裁判模型的输出会被解析为胜利、平局或失败并计算每个候选模型的Elo得分胜率

运行下面的代码:

import os
from evalscope import TaskConfig, run_task

task_cfg = TaskConfig(
    model_id='Arena',  # 模型ID为'Arena'可以不指定模型ID
    datasets=[
        'general_arena',  # 必须指定为'general_arena',表示使用竞技场模式
    ],
    dataset_args={
        'general_arena': {
            # 'system_prompt': 'xxx', # 可选,若想要自定义裁判模型的系统提示,可以在这里配置
            # 'prompt_template': 'xxx', # 可选,若想要自定义裁判模型的提示模板,可以在这里配置
            'extra_params':{
                # 配置候选模型名称和对应的报告路径
                # 报告路径为上一步中模型输出的路径,用来解析模型推理结果
                'models':[
                    {
                        'name': 'qwen2.5-0.5b',
                        'report_path': 'outputs/20250702_204346/reports/qwen2.5-0.5b-instruct'
                    },
                    {
                        'name': 'qwen2.5-7b',
                        'report_path': 'outputs/20250702_204346/reports/qwen2.5-7b-instruct'
                    },
                    {
                        'name': 'qwen2.5-72b',
                        'report_path': 'outputs/20250702_204346/reports/qwen2.5-72b-instruct'
                    }
                ],
                # 设置baseline模型必须为候选模型之一
                'baseline': 'qwen2.5-7b'
            }
        }
    },
    # 配置judge模型参数
    judge_model_args={
        'model_id': 'qwen-plus',
        'api_url': 'https://dashscope.aliyuncs.com/compatible-mode/v1',
        'api_key': os.getenv('DASHSCOPE_API_KEY'),
        'generation_config': {
            'temperature': 0.0,
            'max_tokens': 8000
        },
    },
    judge_worker_num=5,
    # use_cache='outputs/xxx' # 可选,若想在已有评测结果上添加新的候选模型,可以指定已有评测结果路径
)

run_task(task_cfg=task_cfg)
点击查看评测结果
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Model   | Dataset       | Metric        | Subset                                     |   Num |   Score | Cat.0   |
+=========+===============+===============+============================================+=======+=========+=========+
| Arena   | general_arena | winrate       | general_qa&example@qwen2.5-0.5b&qwen2.5-7b |    12 |  0.0185 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate       | general_qa&example@qwen2.5-72b&qwen2.5-7b  |    12 |  0.5469 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate       | general_qa&arena@qwen2.5-0.5b&qwen2.5-7b   |    10 |  0.075  | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate       | general_qa&arena@qwen2.5-72b&qwen2.5-7b    |    10 |  0.8382 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate       | OVERALL                                    |    44 |  0.3617 | -       |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_lower | general_qa&example@qwen2.5-0.5b&qwen2.5-7b |    12 |  0.0185 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_lower | general_qa&example@qwen2.5-72b&qwen2.5-7b  |    12 |  0.3906 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_lower | general_qa&arena@qwen2.5-0.5b&qwen2.5-7b   |    10 |  0.025  | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_lower | general_qa&arena@qwen2.5-72b&qwen2.5-7b    |    10 |  0.7276 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_lower | OVERALL                                    |    44 |  0.2826 | -       |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_upper | general_qa&example@qwen2.5-0.5b&qwen2.5-7b |    12 |  0.0909 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_upper | general_qa&example@qwen2.5-72b&qwen2.5-7b  |    12 |  0.6875 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_upper | general_qa&arena@qwen2.5-0.5b&qwen2.5-7b   |    10 |  0.0909 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_upper | general_qa&arena@qwen2.5-72b&qwen2.5-7b    |    10 |  0.9412 | default |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+
| Arena   | general_arena | winrate_upper | OVERALL                                    |    44 |  0.4469 | -       |
+---------+---------------+---------------+--------------------------------------------+-------+---------+---------+ 

自动生成的模型排行榜如下(输出文件在outputs/xxx/reports/Arena/leaderboard.txt目录):

排行榜按胜率降序排列,可以看出,qwen2.5-72b模型在所有子集上表现最好,胜率最高,而qwen2.5-0.5b模型表现最差。

=== OVERALL LEADERBOARD ===
Model           WinRate (%)  CI (%)
------------  -------------  ---------------
qwen2.5-72b            69.3  (-13.3 / +12.2)
qwen2.5-7b             50    (+0.0 / +0.0)
qwen2.5-0.5b            4.7  (-2.5 / +4.4)

=== DATASET LEADERBOARD: general_qa ===
Model           WinRate (%)  CI (%)
------------  -------------  ---------------
qwen2.5-72b            69.3  (-13.3 / +12.2)
qwen2.5-7b             50    (+0.0 / +0.0)
qwen2.5-0.5b            4.7  (-2.5 / +4.4)

=== SUBSET LEADERBOARD: general_qa - example ===
Model           WinRate (%)  CI (%)
------------  -------------  ---------------
qwen2.5-72b            54.7  (-15.6 / +14.1)
qwen2.5-7b             50    (+0.0 / +0.0)
qwen2.5-0.5b            1.8  (+0.0 / +7.2)

=== SUBSET LEADERBOARD: general_qa - arena ===
Model           WinRate (%)  CI (%)
------------  -------------  ---------------
qwen2.5-72b            83.8  (-11.1 / +10.3)
qwen2.5-7b             50    (+0.0 / +0.0)
qwen2.5-0.5b            7.5  (-5.0 / +1.6)

对战结果可视化

为了直观的展现模型候选模型与baseline模型的对战结果EvalScope提供了可视化功能可以对比每个候选模型与baseline模型在每个样本上的对战结果。

运行下面命令启动可视化界面:

evalscope app

在浏览器中打开http://localhost:7860,即可看到可视化界面。

使用流程为:

  1. 选择最近的general_arena评测报告,点击加载并查看按钮
  2. 点击数据集详情选择候选模型与baseline模型的对战结果
  3. 调整阈值可筛选对战结果归一化的分数为0-1分0.5表示持平分数越高表明候选模型比baseline更优秀反之更差

示例如下,对比了qwen2.5-72bqwen2.5-7b的一次对战结果模型判断结果为72b模型更优

image