evalscope/docs/zh/third_party/longwriter.md

177 lines
5.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# LongBench-Write
## 描述
LongWriter 使得LLM能够生成超过10000个单词的内容。我们可以使用基准测试 **LongBench-Write** 来评估长输出的质量以及按需生成相应长度文本的能力。
GitHub: [LongWriter](https://github.com/THUDM/LongWriter)
技术报告: [Minimum Tuning to Unlock Long Output from LLMs with High Quality Data as the Key](https://arxiv.org/abs/2410.10210)
## 使用方式
### 安装
```bash
pip install evalscope[framework] -U
pip install vllm -U
```
### 任务配置
支持的任务配置格式: dict, json, yaml.
1. 配置dict:
```python
task_cfg = dict(stage=['infer', 'eval_l', 'eval_q'],
model='ZhipuAI/LongWriter-glm4-9b',
input_data_path=None,
output_dir='./outputs',
infer_config={
'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions',
'is_chat': True,
'verbose': False,
'generation_kwargs': {
'max_new_tokens': 32768,
'temperature': 0.5,
'repetition_penalty': 1.0
},
'proc_num': 16,
},
eval_config={
# No need to set OpenAI info if skipping the stage `eval_q`
'openai_api_key': None,
'openai_api_base': 'https://api.openai.com/v1/chat/completions',
'openai_gpt_model': 'gpt-4o-2024-05-13',
'generation_kwargs': {
'max_new_tokens': 1024,
'temperature': 0.5,
'stop': None
},
'proc_num': 8
}
)
```
- Arguments:
- `stage`: To run multiple stages, `infer`--run the inference process. `eval_l`--run eval length process. `eval_q`--run eval quality process with the model-as-judge.
- `model`: model id on the ModelScope hub, or local model dir. Refer to [LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b/summary) for more details.
- `input_data_path`: input data path, default to `None`, it means to use [longbench_write](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/resources/longbench_write.jsonl)
- `output_dir`: output root directory.
- `openai_api_key`: openai_api_key when enabling the stage `eval_q` to use `Model-as-Judge`. Default to None if not needed.
- `openai_gpt_model`: Judge model name from OpenAI. Default to `gpt-4o-2024-05-13`
- `generation_kwargs`: The generation configs.
- `proc_num`: process number for inference and evaluation.
2. 配置json (可选):
```json
{
"stage": [
"infer",
"eval_l",
"eval_q"
],
"model": "ZhipuAI/LongWriter-glm4-9b",
"input_data_path": null,
"output_dir": "./outputs",
"infer_config": {
"openai_api_base": "http://127.0.0.1:8000/v1/chat/completions",
"is_chat": true,
"verbose": false,
"generation_kwargs": {
"max_new_tokens": 32768,
"temperature": 0.5,
"repetition_penalty": 1.0
},
"proc_num": 16
},
"eval_config": {
"openai_api_key": null,
"openai_api_base": "https://api.openai.com/v1/chat/completions",
"openai_gpt_model": "gpt-4o-2024-05-13",
"generation_kwargs": {
"max_new_tokens": 1024,
"temperature": 0.5,
"stop": null
},
"proc_num": 8
}
}
```
参考 [default_task.json](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.json) for more details.
2. 配置yaml (可选):
```yaml
stage:
- infer
- eval_l
- eval_q
model: "ZhipuAI/LongWriter-glm4-9b"
input_data_path: null
output_dir: "./outputs"
infer_config:
openai_api_base: "http://127.0.0.1:8000/v1/chat/completions"
is_chat: true
verbose: false
generation_kwargs:
max_new_tokens: 32768
temperature: 0.5
repetition_penalty: 1.0
proc_num: 16
eval_config:
openai_api_key: null
openai_api_base: "https://api.openai.com/v1/chat/completions"
openai_gpt_model: "gpt-4o-2024-05-13"
generation_kwargs:
max_new_tokens: 1024
temperature: 0.5
stop: null
proc_num: 8
```
参考 [default_task.yaml](https://github.com/modelscope/evalscope/blob/main/evalscope/third_party/longbench_write/default_task.yaml) for more details.
### 启动模型预测服务
推荐使用 [vLLM](https://github.com/vllm-project/vllm) 来部署模型服务。
硬件环境:
* A100(80G) x 1
启动vLLM服务
```shell
CUDA_VISIBLE_DEVICES=0 VLLM_USE_MODELSCOPE=True vllm serve --max-model-len=65536 --gpu_memory_utilization=0.95 --trust-remote-code ZhipuAI/LongWriter-glm4-9b
```
- Arguments:
- `max-model-len`: The maximum length of the model input.
- `gpu_memory_utilization`: The GPU memory utilization.
- `trust-remote-code`: Whether to trust the remote code.
- `model`: Could be a model id on the ModelScope/HuggingFace hub, or a local model dir.
* 提示:可以使用`CUDA_VISIBLE_DEVICES=0,1,2,3`来配置多GPUs环境。
### 运行评测任务
```python
from evalscope.third_party.longbench_write import run_task
run_task(task_cfg=task_cfg)
```
### 结果和指标
See `eval_length.jsonl` and `eval_quality.jsonl` in the outputs dir.
- Metrics:
- `score_l`: The average score of the length evaluation.
- `score_q`: The average score of the quality evaluation.