evalscope/docs/en/third_party/longwriter.md

178 lines
5.6 KiB
Markdown

# LongBench-Write
## Description
The LongWriter supports 10,000+ Word Generation From Long Context LLMs.
We can use the benchmark **LongBench-Write** focuses more on measuring the long output quality as well as the output length.
GitHub: [LongWriter](https://github.com/THUDM/LongWriter)
Technical Report: [Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key](https://arxiv.org/abs/2410.10210)
## Usage
### Installation
```bash
pip install evalscope[framework] -U
pip install vllm -U
```
### Task configuration
There are few ways to configure the task: dict, json and yaml.
1. Configuration with dict:
```python
task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
model='ZhipuAI/LongWriter-glm4-9b',
input_data_path=None,
output_dir='./outputs',
infer_config={
'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions',
'is_chat': True,
'verbose': False,
'generation_kwargs': {
'max_new_tokens': 32768,
'temperature': 0.5,
'repetition_penalty': 1.0
},
'proc_num': 16,
},
eval_config={
# No need to set OpenAI info if skipping the stage `eval_q`
'openai_api_key': None,
'openai_api_base': 'https://api.openai.com/v1/chat/completions',
'openai_gpt_model': 'gpt-4o-2024-05-13',
'generation_kwargs': {
'max_new_tokens': 1024,
'temperature': 0.5,
'stop': None
},
'proc_num': 8
}
)
```
- Arguments:
- `stage`: To run multiple stages, `infer`--run the inference process. `eval_l`--run eval length process. `eval_q`--run eval quality process with the model-as-judge.
- `model`: model id on the ModelScope hub, or local model dir. Refer to [LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b/summary) for more details.
- `input_data_path`: input data path, default to `None`, it means to use [longbench_write](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/resources/longbench_write.jsonl)
- `output_dir`: output root directory.
- `openai_api_key`: openai_api_key when enabling the stage `eval_q` to use `Model-as-Judge`. Default to None if not needed.
- `openai_gpt_model`: Judge model name from OpenAI. Default to `gpt-4o-2024-05-13`
- `generation_kwargs`: The generation configs.
- `proc_num`: process number for inference and evaluation.
2. Configuration with json (Optional):
```json
{
"stage": [
"infer",
"eval_l",
"eval_q"
],
"model": "ZhipuAI/LongWriter-glm4-9b",
"input_data_path": null,
"output_dir": "./outputs",
"infer_config": {
"openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
"is_chat": true,
"verbose": false,
"generation_kwargs": {
"max_new_tokens": 32768,
"temperature": 0.5,
"repetition_penalty": 1.0
},
"proc_num": 16
},
"eval_config": {
"openai_api_key": null,
"openai_api_base": "https://api.openai.com/v1/chat/completions",
"openai_gpt_model": "gpt-4o-2024-05-13",
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.5,
"stop": null
},
"proc_num": 8
}
}
```
Refer to [default_task.json](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.json) for more details.
2. Configuration with yaml (Optional):
```yaml
stage:
- infer
- eval_l
- eval_q
model: "ZhipuAI/LongWriter-glm4-9b"
input_data_path: null
output_dir: "./outputs"
infer_config:
openai_api_base: "http://127.0.0.1:8000/v1/chat/completions"
is_chat: true
verbose: false
generation_kwargs:
max_new_tokens: 32768
temperature: 0.5
repetition_penalty: 1.0
proc_num: 16
eval_config:
openai_api_key: null
openai_api_base: "https://api.openai.com/v1/chat/completions"
openai_gpt_model: "gpt-4o-2024-05-13"
generation_kwargs:
max_new_tokens: 1024
temperature: 0.5
stop: null
proc_num: 8
```
Refer to [default_task.yaml](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.yaml) for more details.
### Run Model Inference
We recommend to use the [vLLM](https://github.com/vllm-project/vllm) to deploy the model.
Environment:
* A100(80G) x 1
To start vLLM server, run the following command:
```shell
CUDA_VISIBLE_DEVICES=0 VLLM_USE_MODELSCOPE=True vllm serve --max-model-len=65536 --gpu_memory_utilization=0.95 --trust-remote-code ZhipuAI/LongWriter-glm4-9b
```
- Arguments:
- `max-model-len`: The maximum length of the model input.
- `gpu_memory_utilization`: The GPU memory utilization.
- `trust-remote-code`: Whether to trust the remote code.
- `model`: Could be a model id on the ModelScope/HuggingFace hub, or a local model dir.
* Note: You can use multiple GPUs by setting `CUDA_VISIBLE_DEVICES=0,1,2,3` alternatively.
### Run Evaluation
```python
from evalscope.third_party.longbench_write import run_task
run_task(task_cfg=task_cfg)
```
### Results and metrics
See `eval_length.jsonl` and `eval_quality.jsonl` in the outputs dir.
- Metrics:
- `score_l`: The average score of the length evaluation.
- `score_q`: The average score of the quality evaluation.