## Description
We evaluate the effectiveness of tool learning benchmark: [ToolBench](https://arxiv.org/pdf/2307.16789) (Qin et al.,2023b). The task involve integrating API calls to accomplish tasks, where the agent must accurately select the appropriate API and compose necessary API requests.

Moreover, we partition the test set of ToolBench into in-domain and out-of-domain based on whether the tools used in the test instances have been seen during training.

This division allows us to evaluate performance in both in-distribution and out-of-distribution scenarios. We call this dataset to be `ToolBench-Static`.

For more details, please refer to: [Small LLMs Are Weak Tool Learners: A Multi-LLM Agent](https://arxiv.org/abs/2401.07324)

## Dataset

- Dataset statistics:
  - Number of in_domain: 1588
  - Number of out_domain: 781

## Usage

### Installation

```bash
pip install evalscope -U
pip install ms-swift -U
pip install rouge -U
```


### Download the dataset

```bash
wget https://modelscope.oss-cn-beijing.aliyuncs.com/open_data/toolbench-static/data.zip
```


### Unzip the dataset

```bash
unzip data.zip
# The dataset will be unzipped to the `/path/to/data/toolbench_static` folder
```


### Task configuration

There are two ways to configure the task: dict and yaml.

1. Configuration with dict:

```python
your_task_config = {
    'infer_args': {
        'model_name_or_path': '/path/to/model_dir',
        'model_type': 'qwen2-7b-instruct',
        'data_path': 'data/toolbench_static',
        'output_dir': 'output_res',
        'deploy_type': 'swift',
        'max_new_tokens': 2048,
        'num_infer_samples': None
    },
    'eval_args': {
        'input_path': 'output_res',
        'output_path': 'output_res'
    }
}
```
- Arguments:
  - `model_name_or_path`: The path to the model local directory.
  - `model_type`: The model type, refer to [模型类型列表](https://github.com/modelscope/swift/blob/main/docs/source/LLM/%E6%94%AF%E6%8C%81%E7%9A%84%E6%A8%A1%E5%9E%8B%E5%92%8C%E6%95%B0%E6%8D%AE%E9%9B%86.md)
  - `data_path`: The path to the dataset directory contains `in_domain.json` and `out_of_domain.json` files.
  - `output_dir`: The path to the output directory. Default to `output_res`.
  - `deploy_type`: The deploy type, default to `swift`.
  - `max_new_tokens`: The maximum number of tokens to generate.
  - `num_infer_samples`: The number of samples to infer. Default to `None`, which means infer all samples.
  - `input_path`: The path to the input directory for evaluation, should be the same as `output_dir` of `infer_args`.
  - `output_path`: The path to the output directory for evaluation.


2. Configuration with yaml:

```yaml
infer_args:
  model_name_or_path: /path/to/model_dir      # absolute path is recommended
  model_type: qwen2-7b-instruct
  data_path: /path/to/data/toolbench_static   # absolute path is recommended
  deploy_type: swift
  max_new_tokens: 2048
  num_infer_samples: null
  output_dir: output_res
eval_args:
  input_path: output_res
  output_path: output_res
```
refer to [config_default.yaml](config_default.yaml) for more details.


### Run the task

```python
from evalscope.third_party.toolbench_static import run_task

# Run the task with dict configuration
run_task(task_cfg=your_task_config)

# Run the task with yaml configuration
run_task(task_cfg='/path/to/your_task_config.yaml')
```


### Results and metrics

- Metrics:
  - `Plan.EM`: The agent’s planning decisions at each step for using tools invocation, generating answer, or giving up. Exact match score.
  - `Act.EM`: Action exact match score, including the tool name and arguments.
  - `HalluRate`（lower is better）: The hallucination rate of the agent's answers at each step.
  - `Avg.F1`: The average F1 score of the agent's tools calling at each step.
  - `R-L`: The Rouge-L score of the agent's answers at each step.

Generally, we focus on `Act.EM`, `HalluRate` and `Avg.F1` metrics.