# R1类模型推理能力评测最佳实践
随着DeepSeek-R1模型的广泛应用,越来越多的开发者开始尝试复现类似的模型,以提升其推理能力。目前已经涌现出不少令人瞩目的成果。然而,这些新模型的推理能力是否真的提高了呢?EvalScope框架是魔搭社区上开源的评估工具,提供了对R1类模型的推理性能的评测能力。
在本最佳实践中,我们通过728道推理题目(与R1技术报告一致)进行演示。评测数据具体包括:
- [MATH-500](https://www.modelscope.cn/datasets/HuggingFaceH4/aime_2024):一组具有挑战性的高中数学竞赛问题数据集,涵盖七个科目(如初等代数、代数、数论)共500道题。
- [GPQA-Diamond](https://modelscope.cn/datasets/AI-ModelScope/gpqa_diamond/summary):该数据集包含物理、化学和生物学子领域的硕士水平多项选择题,共198道题。
- [AIME-2024](https://modelscope.cn/datasets/AI-ModelScope/AIME_2024):美国邀请数学竞赛的数据集,包含30道数学题。
本最佳实践的流程包括安装相关依赖、准备模型、评测模型以及评测结果的可视化。让我们开始吧。
## 安装依赖
首先,安装[EvalScope](https://github.com/modelscope/evalscope)模型评估框架:
```bash
pip install 'evalscope[app,perf]' -U
```
## 模型准备
接下来,我们以DeepSeek-R1-Distill-Qwen-1.5B模型为例,介绍评估的过程。首先将模型的能力通过一个OpenAI API兼容的推理服务接入,来进行模型的评测。EvalScope也支持通过transformers推理来进行模型评测,具体可见EvalScope文档。
除了将模型部署到云端支持OpenAI接口的服务使用以外,也可以在本地直接用vLLM,ollama等框架直接拉起模型。这里介绍基于[vLLM](https://github.com/vllm-project/vllm)和[lmdeploy](https://github.com/InternLM/lmdeploy)推理框架的使用,因为这些推理框架能较好的支持并发多个请求,以加速评测过程,同时R1类模型的输出包含较长的思维链,输出token数量往往超过1万,使用高效的推理框架部署模型,可以提高推理速度。
**使用vLLM**:
```bash
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --served-model-name DeepSeek-R1-Distill-Qwen-1.5B --trust_remote_code --port 8801
```
或**使用lmdeploy**:
```bash
LMDEPLOY_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --model-name DeepSeek-R1-Distill-Qwen-1.5B --server-port 8801
```
**(可选) 测试推理服务性能**
在开始正式模型评测前可以测试推理服务的性能,以选择性能更好的推理引擎,使用`evalscope`的`perf`子命令即可测试:
```bash
evalscope perf \
--parallel 10 \
--url http://127.0.0.1:8801/v1/chat/completions \
--model DeepSeek-R1-Distill-Qwen-1.5B \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--api openai \
--prompt '写一个科幻小说,不少于2000字,请开始你的表演' \
-n 100
```
参数说明具体可参考[性能评测快速开始](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/quick_start.html)
推理服务性能测试结果
```text
Benchmarking summary:
+-----------------------------------+-------------------------------------------------------------------------+
| Key | Value |
+===================================+=========================================================================+
| Time taken for tests (s) | 92.66 |
+-----------------------------------+-------------------------------------------------------------------------+
| Number of concurrency | 10 |
+-----------------------------------+-------------------------------------------------------------------------+
| Total requests | 100 |
+-----------------------------------+-------------------------------------------------------------------------+
| Succeed requests | 100 |
+-----------------------------------+-------------------------------------------------------------------------+
| Failed requests | 0 |
+-----------------------------------+-------------------------------------------------------------------------+
| Throughput(average tokens/s) | 1727.453 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average QPS | 1.079 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average latency (s) | 8.636 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average time to first token (s) | 8.636 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average time per output token (s) | 0.00058 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average input tokens per request | 20.0 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average output tokens per request | 1600.66 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average package latency (s) | 8.636 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average package per request | 1.0 |
+-----------------------------------+-------------------------------------------------------------------------+
| Expected number of requests | 100 |
+-----------------------------------+-------------------------------------------------------------------------+
| Result DB path | outputs/20250213_103632/DeepSeek-R1-Distill-Qwen-1.5B/benchmark_data.db |
+-----------------------------------+-------------------------------------------------------------------------+
Percentile results:
+------------+----------+----------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+----------+-------------+--------------+---------------+----------------------+
| 10% | 5.4506 | nan | 5.4506 | 20 | 1011 | 183.7254 |
| 25% | 6.1689 | nan | 6.1689 | 20 | 1145 | 184.9222 |
| 50% | 9.385 | nan | 9.385 | 20 | 1741 | 185.5081 |
| 66% | 11.0023 | nan | 11.0023 | 20 | 2048 | 185.8063 |
| 75% | 11.0374 | nan | 11.0374 | 20 | 2048 | 186.1429 |
| 80% | 11.047 | nan | 11.047 | 20 | 2048 | 186.3683 |
| 90% | 11.075 | nan | 11.075 | 20 | 2048 | 186.5962 |
| 95% | 11.147 | nan | 11.147 | 20 | 2048 | 186.7836 |
| 98% | 11.1574 | nan | 11.1574 | 20 | 2048 | 187.4917 |
| 99% | 11.1688 | nan | 11.1688 | 20 | 2048 | 197.4991 |
+------------+----------+----------+-------------+--------------+---------------+----------------------+
```