# Examples ## Using Local Model Inference This project supports inference using local transformers and vllm (vllm needs to be installed first). The `--model` can be filled with a modelscope model name, such as `Qwen/Qwen2.5-0.5B-Instruct`; or you can directly specify the model weight path, such as `/path/to/model_weights`, without needing to specify the `--url` parameter. **Inference using transformers** ```bash evalscope perf \ --model 'Qwen/Qwen2.5-0.5B-Instruct' \ --attn-implementation flash_attention_2 \ # Optional, or choose from [flash_attention_2|eager|sdpa] --number 20 \ --parallel 2 \ --api local \ --dataset openqa ``` **Inference using vllm** ```bash evalscope perf \ --model 'Qwen/Qwen2.5-0.5B-Instruct' \ --number 20 \ --parallel 2 \ --api local_vllm \ --dataset openqa ``` ## Using `prompt` ```bash evalscope perf \ --url 'http://127.0.0.1:8000/v1/chat/completions' \ --parallel 2 \ --model 'qwen2.5' \ --log-every-n-query 10 \ --number 20 \ --api openai \ --temperature 0.9 \ --max-tokens 1024 \ --prompt 'Write a science fiction story, please begin your performance' ``` You can also use a local file as a prompt: ```bash evalscope perf \ --url 'http://127.0.0.1:8000/v1/chat/completions' \ --parallel 2 \ --model 'qwen2.5' \ --log-every-n-query 10 \ --number 20 \ --api openai \ --temperature 0.9 \ --max-tokens 1024 \ --prompt @prompt.txt ``` ## Complex Requests Using `stop`, `stream`, `temperature`, etc.: ```bash evalscope perf \ --url 'http://127.0.0.1:8000/v1/chat/completions' \ --parallel 2 \ --model 'qwen2.5' \ --log-every-n-query 10 \ --read-timeout 120 \ --connect-timeout 120 \ --number 20 \ --max-prompt-length 128000 \ --min-prompt-length 128 \ --api openai \ --temperature 0.7 \ --max-tokens 1024 \ --stop '<|im_end|>' \ --dataset openqa \ --stream ``` ## Using `query-template` You can set request parameters in the `query-template`: ```bash evalscope perf \ --url 'http://127.0.0.1:8000/v1/chat/completions' \ --parallel 2 \ --model 'qwen2.5' \ --log-every-n-query 10 \ --read-timeout 120 \ --connect-timeout 120 \ --number 20 \ --max-prompt-length 128000 \ --min-prompt-length 128 \ --api openai \ --query-template '{"model": "%m", "messages": [{"role": "user","content": "%p"}], "stream": true, "skip_special_tokens": false, "stop": ["<|im_end|>"], "temperature": 0.7, "max_tokens": 1024}' \ --dataset openqa ``` Where `%m` and `%p` will be replaced by the model name and the prompt. You can set request parameters in the query-template: ```{code-block} json :caption: template.json { "model":"%m", "messages":[ { "role":"user", "content":"%p" } ], "stream":true, "skip_special_tokens":false, "stop":[ "<|im_end|>" ], "temperature":0.7, "max_tokens":1024 } ``` ```bash evalscope perf \ --url 'http://127.0.0.1:8000/v1/chat/completions' \ --parallel 2 \ --model 'qwen2.5' \ --log-every-n-query 10 \ --read-timeout 120 \ --connect-timeout 120 \ --number 20 \ --max-prompt-length 128000 \ --min-prompt-length 128 \ --api openai \ --query-template @template.json \ --dataset openqa ``` ## Using the Random Dataset Randomly generate prompts based on `prefix-length`, `max-prompt-length`, and `min-prompt-length`. It is necessary to specify `tokenizer-path`. The number of tokens in the generated prompt is uniformly distributed between `prefix_length + min-prompt-length` and `prefix_length + max-prompt-length`. In a single test, all requests have the same prefix portion. ```{note} Due to the influence of chat_template and tokenization algorithms, there may be some discrepancies in the number of tokens in the generated prompts, and it is not an exact specified token count. ``` Execute the following command: ```bash evalscope perf \ --parallel 20 \ --model Qwen2.5-0.5B-Instruct \ --url http://127.0.0.1:8801/v1/chat/completions \ --api openai \ --dataset random \ --min-tokens 128 \ --max-tokens 128 \ --prefix-length 64 \ --min-prompt-length 1024 \ --max-prompt-length 2048 \ --number 100 \ --tokenizer-path Qwen/Qwen2.5-0.5B-Instruct \ --debug ``` ## Using wandb to Record Test Results Please install wandb: ```bash pip install wandb ``` When starting, add the following parameters: ```bash --wandb-api-key 'wandb_api_key' --name 'name_of_wandb_log' ``` ![wandb sample](https://modelscope.oss-cn-beijing.aliyuncs.com/resource/wandb_sample.png) ## Using swanlab to Record Test Results Please install swanlab: ```bash pip install swanlab ``` When starting, add the following parameters: ```bash --swanlab-api-key 'swanlab_api_key' --name 'name_of_swanlab_log' ``` ![swanlab sample](https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/swanlab.png) ## Debugging Requests Use the `--debug` option to output the requests and responses. **Non-`stream` Mode Output Example** ```text 2024-11-27 11:25:34,161 - evalscope - http_client.py - on_request_start - 116 - DEBUG - Starting request: )> 2024-11-27 11:25:34,163 - evalscope - http_client.py - on_request_chunk_sent - 128 - DEBUG - Request sent: 2024-11-27 11:25:38,172 - evalscope - http_client.py - on_response_chunk_received - 140 - DEBUG - Request received: ``` **`stream` Mode Output Example** ```text 2024-11-27 20:02:24,760 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"重要的"},"finish_reason":null}],"usage":null} 2024-11-27 20:02:24,803 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":""},"finish_reason":null}],"usage":null} 2024-11-27 20:02:24,847 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":",以便"},"finish_reason":null}],"usage":null} 2024-11-27 20:02:24,890 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"及时"},"finish_reason":null}],"usage":null} 2024-11-27 20:02:24,933 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"得到"},"finish_reason":null}],"usage":null} 2024-11-27 20:02:24,976 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"帮助"},"finish_reason":null}],"usage":null} 2024-11-27 20:02:25,023 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"和支持"},"finish_reason":null}],"usage":null} 2024-11-27 20:02:25,066 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":""},"finish_reason":null}],"usage":null} 2024-11-27 20:02:25,109 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":""},"finish_reason":null}],"usage":null} 2024-11-27 20:02:25,111 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"。<|im_end|>"},"finish_reason":null}],"usage":null} 2024-11-27 20:02:25,113 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: {"model":"Qwen2.5-0.5B-Instruct","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":50,"completion_tokens":260,"total_tokens":310}} 2024-11-27 20:02:25,113 - evalscope - http_client.py - _handle_stream - 57 - DEBUG - Response recevied: data: [DONE] ```