# Speed Benchmark Testing

To conduct speed tests and obtain a speed benchmark report similar to the [official Qwen](https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html) report, as shown below:

![image](./images/qwen_speed_benchmark.png)

You can specify the dataset for speed testing using `--dataset [speed_benchmark|speed_benchmark_long]`:

- `speed_benchmark`: Tests prompts of lengths [1, 6144, 14336, 30720], with a fixed output of 2048 tokens. 
  A total of 8 requests are made, with 2 requests for each prompt length.
- `speed_benchmark_long`: Tests prompts of lengths [63488, 129024], with a fixed output of 2048 tokens. 
  A total of 4 requests are made, with 2 requests for each prompt length.

## Online API Inference
```{note}
For speed testing, the `--url` should use the `/v1/completions` endpoint instead of `/v1/chat/completions` to avoid the additional processing of the chat template affecting input length.
```

```bash
evalscope perf \
 --parallel 1 \
 --url http://127.0.0.1:8000/v1/completions \
 --model qwen2.5 \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api openai \
 --dataset speed_benchmark \
 --debug
```

## Local Transformer Inference
```bash
CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --attn-implementation flash_attention_2 \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local \
 --dataset speed_benchmark \
 --debug
```

Example Output:
```text
Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
|       1       |      50.69      |      0.97      |
|     6144      |      51.36      |      1.23      |
|     14336     |      49.93      |      1.59      |
|     30720     |      49.56      |      2.34      |
+---------------+-----------------+----------------+
```

## Local vLLM Inference
```bash
CUDA_VISIBLE_DEVICES=0 evalscope perf \
 --parallel 1 \
 --model Qwen/Qwen2.5-0.5B-Instruct \
 --log-every-n-query 5 \
 --connect-timeout 6000 \
 --read-timeout 6000 \
 --max-tokens 2048 \
 --min-tokens 2048 \
 --api local_vllm \
 --dataset speed_benchmark 
```

Example Output:
```{tip}
The GPU usage is obtained through the `torch.cuda.max_memory_allocated` function, so GPU usage is not displayed here.
```
```text
Speed Benchmark Results:
+---------------+-----------------+----------------+
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
+---------------+-----------------+----------------+
|       1       |     343.08      |      0.0       |
|     6144      |     334.71      |      0.0       |
|     14336     |     318.88      |      0.0       |
|     30720     |     292.86      |      0.0       |
+---------------+-----------------+----------------+
```