178 lines
5.6 KiB
Markdown
178 lines
5.6 KiB
Markdown
# LongBench-Write
|
|
|
|
## Description
|
|
The LongWriter supports 10,000+ Word Generation From Long Context LLMs.
|
|
We can use the benchmark **LongBench-Write** focuses more on measuring the long output quality as well as the output length.
|
|
|
|
GitHub: [LongWriter](https://github.com/THUDM/LongWriter)
|
|
|
|
Technical Report: [Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key](https://arxiv.org/abs/2410.10210)
|
|
|
|
|
|
## Usage
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install evalscope[framework] -U
|
|
pip install vllm -U
|
|
```
|
|
|
|
### Task configuration
|
|
|
|
There are few ways to configure the task: dict, json and yaml.
|
|
|
|
1. Configuration with dict:
|
|
|
|
```python
|
|
task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
|
|
model='ZhipuAI/LongWriter-glm4-9b',
|
|
input_data_path=None,
|
|
output_dir='./outputs',
|
|
infer_config={
|
|
'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions',
|
|
'is_chat': True,
|
|
'verbose': False,
|
|
'generation_kwargs': {
|
|
'max_new_tokens': 32768,
|
|
'temperature': 0.5,
|
|
'repetition_penalty': 1.0
|
|
},
|
|
'proc_num': 16,
|
|
},
|
|
eval_config={
|
|
# No need to set OpenAI info if skipping the stage `eval_q`
|
|
'openai_api_key': None,
|
|
'openai_api_base': 'https://api.openai.com/v1/chat/completions',
|
|
'openai_gpt_model': 'gpt-4o-2024-05-13',
|
|
'generation_kwargs': {
|
|
'max_new_tokens': 1024,
|
|
'temperature': 0.5,
|
|
'stop': None
|
|
},
|
|
'proc_num': 8
|
|
}
|
|
)
|
|
|
|
```
|
|
|
|
- Arguments:
|
|
- `stage`: To run multiple stages, `infer`--run the inference process. `eval_l`--run eval length process. `eval_q`--run eval quality process with the model-as-judge.
|
|
- `model`: model id on the ModelScope hub, or local model dir. Refer to [LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b/summary) for more details.
|
|
- `input_data_path`: input data path, default to `None`, it means to use [longbench_write](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/resources/longbench_write.jsonl)
|
|
- `output_dir`: output root directory.
|
|
- `openai_api_key`: openai_api_key when enabling the stage `eval_q` to use `Model-as-Judge`. Default to None if not needed.
|
|
- `openai_gpt_model`: Judge model name from OpenAI. Default to `gpt-4o-2024-05-13`
|
|
- `generation_kwargs`: The generation configs.
|
|
- `proc_num`: process number for inference and evaluation.
|
|
|
|
|
|
2. Configuration with json (Optional):
|
|
|
|
```json
|
|
{
|
|
"stage": [
|
|
"infer",
|
|
"eval_l",
|
|
"eval_q"
|
|
],
|
|
"model": "ZhipuAI/LongWriter-glm4-9b",
|
|
"input_data_path": null,
|
|
"output_dir": "./outputs",
|
|
"infer_config": {
|
|
"openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
|
|
"is_chat": true,
|
|
"verbose": false,
|
|
"generation_kwargs": {
|
|
"max_new_tokens": 32768,
|
|
"temperature": 0.5,
|
|
"repetition_penalty": 1.0
|
|
},
|
|
"proc_num": 16
|
|
},
|
|
"eval_config": {
|
|
"openai_api_key": null,
|
|
"openai_api_base": "https://api.openai.com/v1/chat/completions",
|
|
"openai_gpt_model": "gpt-4o-2024-05-13",
|
|
"generation_kwargs": {
|
|
"max_new_tokens": 1024,
|
|
"temperature": 0.5,
|
|
"stop": null
|
|
},
|
|
"proc_num": 8
|
|
}
|
|
}
|
|
```
|
|
Refer to [default_task.json](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.json) for more details.
|
|
|
|
|
|
2. Configuration with yaml (Optional):
|
|
|
|
```yaml
|
|
stage:
|
|
- infer
|
|
- eval_l
|
|
- eval_q
|
|
model: "ZhipuAI/LongWriter-glm4-9b"
|
|
input_data_path: null
|
|
output_dir: "./outputs"
|
|
infer_config:
|
|
openai_api_base: "http://127.0.0.1:8000/v1/chat/completions"
|
|
is_chat: true
|
|
verbose: false
|
|
generation_kwargs:
|
|
max_new_tokens: 32768
|
|
temperature: 0.5
|
|
repetition_penalty: 1.0
|
|
proc_num: 16
|
|
eval_config:
|
|
openai_api_key: null
|
|
openai_api_base: "https://api.openai.com/v1/chat/completions"
|
|
openai_gpt_model: "gpt-4o-2024-05-13"
|
|
generation_kwargs:
|
|
max_new_tokens: 1024
|
|
temperature: 0.5
|
|
stop: null
|
|
proc_num: 8
|
|
|
|
```
|
|
Refer to [default_task.yaml](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.yaml) for more details.
|
|
|
|
|
|
### Run Model Inference
|
|
We recommend to use the [vLLM](https://github.com/vllm-project/vllm) to deploy the model.
|
|
|
|
Environment:
|
|
* A100(80G) x 1
|
|
|
|
|
|
To start vLLM server, run the following command:
|
|
```shell
|
|
CUDA_VISIBLE_DEVICES=0 VLLM_USE_MODELSCOPE=True vllm serve --max-model-len=65536 --gpu_memory_utilization=0.95 --trust-remote-code ZhipuAI/LongWriter-glm4-9b
|
|
|
|
```
|
|
- Arguments:
|
|
- `max-model-len`: The maximum length of the model input.
|
|
- `gpu_memory_utilization`: The GPU memory utilization.
|
|
- `trust-remote-code`: Whether to trust the remote code.
|
|
- `model`: Could be a model id on the ModelScope/HuggingFace hub, or a local model dir.
|
|
|
|
* Note: You can use multiple GPUs by setting `CUDA_VISIBLE_DEVICES=0,1,2,3` alternatively.
|
|
|
|
|
|
### Run Evaluation
|
|
|
|
```python
|
|
from evalscope.third_party.longbench_write import run_task
|
|
|
|
run_task(task_cfg=task_cfg)
|
|
```
|
|
|
|
|
|
### Results and metrics
|
|
See `eval_length.jsonl` and `eval_quality.jsonl` in the outputs dir.
|
|
|
|
- Metrics:
|
|
- `score_l`: The average score of the length evaluation.
|
|
- `score_q`: The average score of the quality evaluation.
|