# Evaluating the Inference Capability of R1 Models
With the widespread application of the DeepSeek-R1 model, an increasing number of developers are attempting to replicate similar models to enhance their inference capabilities. Many impressive results have emerged; however, do these new models actually demonstrate improved inference capabilities? The EvalScope framework, an open-source evaluation tool available in the Modao community, provides an assessment of the inference performance of R1 models.
In this best practice guide, we will demonstrate the evaluation process using 728 inference questions (consistent with the R1 technical report). The evaluation data includes:
- [MATH-500](https://www.modelscope.cn/datasets/HuggingFaceH4/aime_2024): A set of challenging high school mathematics competition problems across seven subjects (such as elementary algebra, algebra, number theory), comprising a total of 500 questions.
- [GPQA-Diamond](https://modelscope.cn/datasets/AI-ModelScope/gpqa_diamond/summary): This dataset contains master's level multiple-choice questions in the subfields of physics, chemistry, and biology, totaling 198 questions.
- [AIME-2024](https://modelscope.cn/datasets/AI-ModelScope/AIME_2024): A dataset from the American Invitational Mathematics Examination, containing 30 math problems.
The process outlined in this best practice includes installing the necessary dependencies, preparing the model, evaluating the model, and visualizing the evaluation results. Let’s get started.
## Installing Dependencies
First, install the [EvalScope](https://github.com/modelscope/evalscope) model evaluation framework:
```bash
pip install 'evalscope[app,perf]' -U
```
## Model Preparation
Next, we will introduce the evaluation process using the DeepSeek-R1-Distill-Qwen-1.5B model as an example. The model's capabilities will be accessed via an OpenAI API-compatible inference service for evaluation purposes. EvalScope also supports model evaluation via transformers inference; for details, please refer to the EvalScope documentation.
In addition to deploying the model on a cloud service that supports the OpenAI API, it can also be run locally using frameworks such as vLLM or ollama. Here, we will introduce the usage of the [vLLM](https://github.com/vllm-project/vllm) and [lmdeploy](https://github.com/InternLM/lmdeploy) inference frameworks, as these can effectively handle multiple concurrent requests to speed up the evaluation process. Since R1 models often produce lengthy reasoning chains with output token counts frequently exceeding 10,000, using efficient inference frameworks can enhance inference speed.
**Using vLLM**:
```bash
VLLM_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --served-model-name DeepSeek-R1-Distill-Qwen-1.5B --trust_remote_code --port 8801
```
or **Using lmdeploy**:
```bash
LMDEPLOY_USE_MODELSCOPE=True CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B --model-name DeepSeek-R1-Distill-Qwen-1.5B --server-port 8801
```
**(Optional) Test Inference Service Performance**
Before officially evaluating the model, you can test the performance of the inference service to select a better-performing inference engine using the `evalscope` `perf` subcommand:
```bash
evalscope perf \
--parallel 10 \
--url http://127.0.0.1:8801/v1/chat/completions \
--model DeepSeek-R1-Distill-Qwen-1.5B \
--log-every-n-query 5 \
--connect-timeout 6000 \
--read-timeout 6000 \
--api openai \
--prompt 'Write a science fiction novel, no less than 2000 words, please start your performance' \
-n 100
```
For parameter explanations, please refer to the [Performance Evaluation Quick Start](https://evalscope.readthedocs.io/zh-cn/latest/user_guides/stress_test/quick_start.html).
Inference Service Performance Test Results
```text
Benchmarking summary:
+-----------------------------------+-------------------------------------------------------------------------+
| Key | Value |
+===================================+=========================================================================+
| Time taken for tests (s) | 92.66 |
+-----------------------------------+-------------------------------------------------------------------------+
| Number of concurrency | 10 |
+-----------------------------------+-------------------------------------------------------------------------+
| Total requests | 100 |
+-----------------------------------+-------------------------------------------------------------------------+
| Succeed requests | 100 |
+-----------------------------------+-------------------------------------------------------------------------+
| Failed requests | 0 |
+-----------------------------------+-------------------------------------------------------------------------+
| Throughput(average tokens/s) | 1727.453 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average QPS | 1.079 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average latency (s) | 8.636 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average time to first token (s) | 8.636 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average time per output token (s) | 0.00058 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average input tokens per request | 20.0 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average output tokens per request | 1600.66 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average package latency (s) | 8.636 |
+-----------------------------------+-------------------------------------------------------------------------+
| Average package per request | 1.0 |
+-----------------------------------+-------------------------------------------------------------------------+
| Expected number of requests | 100 |
+-----------------------------------+-------------------------------------------------------------------------+
| Result DB path | outputs/20250213_103632/DeepSeek-R1-Distill-Qwen-1.5B/benchmark_data.db |
+-----------------------------------+-------------------------------------------------------------------------+
Percentile results:
+------------+----------+----------+-------------+--------------+---------------+----------------------+
| Percentile | TTFT (s) | TPOT (s) | Latency (s) | Input tokens | Output tokens | Throughput(tokens/s) |
+------------+----------+----------+-------------+--------------+---------------+----------------------+
| 10% | 5.4506 | nan | 5.4506 | 20 | 1011 | 183.7254 |
| 25% | 6.1689 | nan | 6.1689 | 20 | 1145 | 184.9222 |
| 50% | 9.385 | nan | 9.385 | 20 | 1741 | 185.5081 |
| 66% | 11.0023 | nan | 11.0023 | 20 | 2048 | 185.8063 |
| 75% | 11.0374 | nan | 11.0374 | 20 | 2048 | 186.1429 |
| 80% | 11.047 | nan | 11.047 | 20 | 2048 | 186.3683 |
| 90% | 11.075 | nan | 11.075 | 20 | 2048 | 186.5962 |
| 95% | 11.147 | nan | 11.147 | 20 | 2048 | 186.7836 |
| 98% | 11.1574 | nan | 11.1574 | 20 | 2048 | 187.4917 |
| 99% | 11.1688 | nan | 11.1688 | 20 | 2048 | 197.4991 |
+------------+----------+----------+-------------+--------------+---------------+----------------------+
```