84 lines
2.2 KiB
Markdown
84 lines
2.2 KiB
Markdown
# QwQ-32B-Preview
|
|
|
|
> QwQ-32B-Preview is an experimental research model developed by the Qwen team, aimed at enhancing the reasoning capabilities of artificial intelligence. [Model Link](https://modelscope.cn/models/Qwen/QwQ-32B-Preview/summary)
|
|
|
|
The Speed Benchmark tool was used to test the GPU memory usage and inference speed of the QwQ-32B-Preview model under different configurations. The following tests measure the speed and memory usage when generating 2048 tokens, with input lengths of 1, 6144, 14336, and 30720:
|
|
|
|
## Local Transformers Inference Speed
|
|
|
|
### Test Environment
|
|
|
|
- NVIDIA A100 80GB * 1
|
|
- CUDA 12.1
|
|
- Pytorch 2.3.1
|
|
- Flash Attention 2.5.8
|
|
- Transformers 4.46.0
|
|
- EvalScope 0.7.0
|
|
|
|
|
|
### Stress Testing Command
|
|
```shell
|
|
pip install evalscope[perf] -U
|
|
```
|
|
```shell
|
|
CUDA_VISIBLE_DEVICES=0 evalscope perf \
|
|
--parallel 1 \
|
|
--model Qwen/QwQ-32B-Preview \
|
|
--attn-implementation flash_attention_2 \
|
|
--log-every-n-query 1 \
|
|
--connect-timeout 60000 \
|
|
--read-timeout 60000\
|
|
--max-tokens 2048 \
|
|
--min-tokens 2048 \
|
|
--api local \
|
|
--dataset speed_benchmark
|
|
```
|
|
|
|
### Test Results
|
|
```text
|
|
+---------------+-----------------+----------------+
|
|
| Prompt Tokens | Speed(tokens/s) | GPU Memory(GB) |
|
|
+---------------+-----------------+----------------+
|
|
| 1 | 17.92 | 61.58 |
|
|
| 6144 | 12.61 | 63.72 |
|
|
| 14336 | 9.01 | 67.31 |
|
|
| 30720 | 5.61 | 74.47 |
|
|
+---------------+-----------------+----------------+
|
|
```
|
|
|
|
## vLLM Inference Speed
|
|
|
|
### Test Environment
|
|
- NVIDIA A100 80GB * 2
|
|
- CUDA 12.1
|
|
- vLLM 0.6.3
|
|
- Pytorch 2.4.0
|
|
- Flash Attention 2.6.3
|
|
- Transformers 4.46.0
|
|
|
|
### Test Command
|
|
```shell
|
|
CUDA_VISIBLE_DEVICES=0,1 evalscope perf \
|
|
--parallel 1 \
|
|
--model Qwen/QwQ-32B-Preview \
|
|
--log-every-n-query 1 \
|
|
--connect-timeout 60000 \
|
|
--read-timeout 60000\
|
|
--max-tokens 2048 \
|
|
--min-tokens 2048 \
|
|
--api local_vllm \
|
|
--dataset speed_benchmark
|
|
```
|
|
|
|
### Test Results
|
|
```text
|
|
+---------------+-----------------+
|
|
| Prompt Tokens | Speed(tokens/s) |
|
|
+---------------+-----------------+
|
|
| 1 | 38.17 |
|
|
| 6144 | 36.63 |
|
|
| 14336 | 35.01 |
|
|
| 30720 | 31.68 |
|
|
+---------------+-----------------+
|
|
```
|