evalscope/docs/en/third_party/longwriter.md

# LongBench-Write

## Description
The LongWriter supports 10,000+ Word Generation From Long Context LLMs.
We can use the benchmark **LongBench-Write** focuses more on measuring the long output quality as well as the output length.

GitHub: [LongWriter](https://github.com/THUDM/LongWriter)

Technical Report: [Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key](https://arxiv.org/abs/2410.10210)


## Usage

### Installation

```bash
pip install evalscope[framework] -U
pip install vllm -U
```

### Task configuration

There are few ways to configure the task: dict, json and yaml.

1. Configuration with dict:

```python
task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
                model='ZhipuAI/LongWriter-glm4-9b',
                input_data_path=None,
                output_dir='./outputs',
                infer_config={
                    'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions',
                    'is_chat': True,
                    'verbose': False,
                    'generation_kwargs': {
                        'max_new_tokens': 32768,
                        'temperature': 0.5,
                        'repetition_penalty': 1.0
                    },
                    'proc_num': 16,
                },
                eval_config={
                    # No need to set OpenAI info if skipping the stage `eval_q`
                    'openai_api_key': None,
                    'openai_api_base': 'https://api.openai.com/v1/chat/completions',
                    'openai_gpt_model': 'gpt-4o-2024-05-13',
                    'generation_kwargs': {
                        'max_new_tokens': 1024,
                        'temperature': 0.5,
                        'stop': None
                    },
                    'proc_num': 8
                }
            )

```

- Arguments:
  - `stage`: To run multiple stages, `infer`--run the inference process. `eval_l`--run eval length process. `eval_q`--run eval quality process with the model-as-judge.
  - `model`: model id on the ModelScope hub, or local model dir. Refer to [LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b/summary) for more details.
  - `input_data_path`: input data path, default to `None`, it means to use [longbench_write](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/resources/longbench_write.jsonl)
  - `output_dir`: output root directory.
  - `openai_api_key`: openai_api_key when enabling the stage `eval_q` to use `Model-as-Judge`. Default to None if not needed.
  - `openai_gpt_model`: Judge model name from OpenAI. Default to `gpt-4o-2024-05-13`
  - `generation_kwargs`: The generation configs.
  - `proc_num`: process number for inference and evaluation.


2. Configuration with json (Optional):

```json
{
    "stage": [
        "infer",
        "eval_l",
        "eval_q"
    ],
    "model": "ZhipuAI/LongWriter-glm4-9b",
    "input_data_path": null,
    "output_dir": "./outputs",
    "infer_config": {
        "openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
        "is_chat": true,
        "verbose": false,
        "generation_kwargs": {
            "max_new_tokens": 32768,
            "temperature": 0.5,
            "repetition_penalty": 1.0
        },
        "proc_num": 16
    },
    "eval_config": {
        "openai_api_key": null,
        "openai_api_base": "https://api.openai.com/v1/chat/completions",
        "openai_gpt_model": "gpt-4o-2024-05-13",
        "generation_kwargs": {
            "max_new_tokens": 1024,
            "temperature": 0.5,
            "stop": null
        },
        "proc_num": 8
    }
}
```
Refer to [default_task.json](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.json) for more details.


2. Configuration with yaml (Optional):

```yaml
stage:
  - infer
  - eval_l
  - eval_q
model: "ZhipuAI/LongWriter-glm4-9b"
input_data_path: null
output_dir: "./outputs"
infer_config:
  openai_api_base: "http://127.0.0.1:8000/v1/chat/completions"
  is_chat: true
  verbose: false
  generation_kwargs:
    max_new_tokens: 32768
    temperature: 0.5
    repetition_penalty: 1.0
  proc_num: 16
eval_config:
  openai_api_key: null
  openai_api_base: "https://api.openai.com/v1/chat/completions"
  openai_gpt_model: "gpt-4o-2024-05-13"
  generation_kwargs:
    max_new_tokens: 1024
    temperature: 0.5
    stop: null
  proc_num: 8

```
Refer to [default_task.yaml](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.yaml) for more details.


### Run Model Inference
We recommend to use the [vLLM](https://github.com/vllm-project/vllm) to deploy the model.

Environment:
* A100(80G) x 1


To start vLLM server, run the following command:
```shell
CUDA_VISIBLE_DEVICES=0 VLLM_USE_MODELSCOPE=True vllm serve --max-model-len=65536 --gpu_memory_utilization=0.95 --trust-remote-code ZhipuAI/LongWriter-glm4-9b

```
- Arguments:
  - `max-model-len`: The maximum length of the model input.
  - `gpu_memory_utilization`: The GPU memory utilization.
  - `trust-remote-code`: Whether to trust the remote code.
  - `model`: Could be a model id on the ModelScope/HuggingFace hub, or a local model dir.

* Note: You can use multiple GPUs by setting `CUDA_VISIBLE_DEVICES=0,1,2,3` alternatively.


### Run Evaluation

```python
from evalscope.third_party.longbench_write import run_task

run_task(task_cfg=task_cfg)
```


### Results and metrics
See `eval_length.jsonl` and `eval_quality.jsonl` in the outputs dir.

- Metrics:
  - `score_l`: The average score of the length evaluation.
  - `score_q`: The average score of the quality evaluation.