embed-bge-m3/FlagEmbedding/research/llm_embedder/docs/fine-tune.md

# Fine-tuning

## Environment
It is recommended that you create a new environment:
```
cd FlagEmbedding/llm_embedder

conda env create -f environment.yaml --name llm-embedder
conda activate llm-embedder
```

To use BM25, you must download **java11** and **anserini**, then add java to your `PATH`:
```bash
# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/java11.tar.gz?download=true -O /data/java11.tar.gz
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/anserini.tar.gz?download=true -O /data/anserini.tar.gz

cd /data
tar -xzvf java11.tar.gz
tar -xzvf anserini.tar.gz

# below just temporarily set JAVA_HOME; it is RECOMMENDED that you store the lines the setting in ~/.bashrc
export JAVA_HOME=/data/jdk-11.0.2
export PATH=$JAVA_HOME/bin:$PATH
```

## Data
You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. `/data`, which results in a folder `/data/llm-embedder`:
```bash
# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/llm-embedder.tar.gz?download=true -O /data/llm-embedder.tar.gz

cd /data
tar -xzvf llm-embedder-eval.tar.gz
```

The corpus of QReCC for conversational search is too large (54M passages), we separately upload it to huggingface datasets [namespace-Pt/qrecc-corpus](https://huggingface.co/datasets/namespace-Pt/qrecc-corpus). To evaluate the performance on conversational search, you should load it and save it as json file in the `qrecc` folder:
```python
import datasets
# load dataset
qrecc_corpus = datasets.load_dataset("namespace-Pt/qrecc-corpus", split="train")
# save to jsonline format in YOUR data folder
qrecc_corpus.to_json("/data/llm-embedder/convsearch/qrecc/corpus.json", force_ascii=False, lines=True, orient="records")
```

The data formats for training and evaluation are as follows:

```python
# training
{
  "query": str,
  "pos": List[str],
  "neg": List[str],
  "pos_index": Optional[List[int]],         # Indices of the positives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
  "neg_index": Optional[List[int]],         # Indices of the negatives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
  "teacher_scores": Optional[List[float]],  # Scores from an LM or a reranker, used for distillation.
  "answers": Optional[List[str]],           # List of answers for the query, used for LM scoring.
}

# evaluation
{
  "query": str,
  "pos_index": Optional[List[int]],         # Indices of the positives w.r.t. corpus. When there is no positives pre-defined (e.g. NQ), just ignore this field.
  "answers": Optional[List[str]],           # List of answers for computing NQ metrics.
  "key": Optional[List[str]],               # Retrieval results of the query. Usually used for RAG or reranking.
  "key_index": Optional[List[int]],         # Key indices w.r.t. the corpus.
}
```

## Retriever
Below are several important arguments for training. The meaning and usage of other arguments can be inspected from [code](../src/retrieval/args.py) or running `python run_dense.py --help` from command line.
- `train_data`: required, one or a list of json files with the aforementioned formatting.
- `eval_data`: optional, one json file with the aforementioned formatting. If an `eval_data` is speficied, the trainer will automatically do evaluation on the `eval_data`.
- `corpus`: optional, the global corpus where `positives`.

**IMPORTANT NOTE**
- For any path specified for `train_data`, `eval_data`, and `corpus`: if it is prefixed with `llm-embedder`, it will be solved to the relative path against [`data_root`](../src/retrieval/args.py). *Note that you can modify the default value of `data_root`, so that you don't need to type it for each command.*
- During fine-tuning, we save the output model in the `huggingface transformers`🤗 format. To use it from `sentence_transformers`, you should convert it to `sentence_transformers` checkpoint in advance:
  ```bash
  python scripts/ours2st.py --encoder data/outputs/your-output-dir/encoder
  ```
  Then everything is the same as described in [README](../README.md).

### LLM-Embedder (Multi-Task Fine-Tune)
```bash
# Remember to modify the data_root to your data root in the script :)
bash scripts/llm-embedder.sh
```

### Single Task Fine-Tune
Below we provide commands to fine-tune a retriever on a single task.

#### QA
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/nq \
--train_data llm-embedder:qa/nq/train.json \
--eval_data llm-embedder:qa/nq/test.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics nq \
--key_max_length 128 \
--query_max_length 32 \
--contrastive_weight 0 \
--stable_distill \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
```

#### In-Context Learning
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/icl \
--train_data llm-embedder:icl/icl/train.json \
--select_positive random \
--contrastive_weight 0 \
--stable_distill \
--save_steps 6000 \
--max_steps 6000 \
--data_root /data/llm-embedder
```

#### Long-Range Language Modeling
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/lrlm \
--train_data llm-embedder:lrlm/books3/train.json llm-embedder:lrlm/arxiv/train.json llm-embedder:lrlm/codeparrot/train.json \
--select_positive teacher \
--teacher_scores_margin 0.1 \
--contrastive_weight 0 \
--teacher_temperature 0.1 \
--save_steps 4000 \
--max_steps 4000 \
--data_root /data/llm-embedder
```

#### Long Chat
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/msc \
--train_data llm-embedder:chat/msc/train.json \
--select_positive teacher \
--select_negative random \
--contrastive_weight 0 \
--teacher_temperature 0.1 \
--save_steps 4000 \
--max_steps 4000 \
--data_root /data/llm-embedder
```

#### Tool
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/tool \
--train_data llm-embedder:tool/toolbench/train.json \
--eval_data llm-embedder:tool/toolbench/test.json \
--corpus llm-embedder:tool/toolbench/corpus.json \
--key_template {text} \
--metrics ndcg \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
```

#### Conversation Search
```bash
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/qrecc \
--train_data llm-embedder:conversation/qrecc/train.concat.json \
--eval_data llm-embedder:conversation/qrecc/test.concat.json \
--corpus llm-embedder:conversation/qrecc/corpus.json \
--key_template '{text}' \
--metrics mrr ndcg \
--cutoffs 3 10 100 \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
```

### Mine Negatives
```bash
# BGE (the result will be saved at llm-embedder:qa/nq/train.neg.bge.json)
torchrun --nproc_per_node=8 -m evaluation.eval_retrieval \
--eval_data llm-embedder:qa/nq/train.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics mrr recall collate_neg \
--save_name bge \
--data_root /data/llm-embedder

# BM25 (the result will be saved at llm-embedder:qa/nq/train.neg.bm25.json; anserini_dir is the folder where you untar anserini.tar.gz)
torchrun --nproc_per_node 8 -m evaluation.eval_retrieval \
--anserini_dir /data/anserini \
--retrieval_method bm25 \
--eval_data llm-embedder:qa/nq/train.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics mrr recall collate_neg \
--save_name bm25 \
--data_root /data/llm-embedder
```

## LM Scoring
Score positives and negatives in `eval_data` with $p(o|q,k)$ where $o$ is the desired output (i.e. `answers` field), $q$ is the query, and $k$ is a key (could be positive or negative).

```bash
torchrun --nproc_per_node=8 run_lm_score.py \
--eval_data llm-embedder:qa/msmarco/train.json \
--data_root /data/llm-embedder \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--save_name llama2-7b-chat
```
Results will be saved at `/data/llm-embedder/qa/msmarco/train.scored.llama2-7b-chat.json`


## Known Issues
- `transformers==4.30.0` raises error when using deepspeed schedulerconfig
  - modify line `1750` in `trainer.py`
  ```python
    if use_accelerator_prepare:
        # NOTE: fix bug in transformers 4.30.0
        # model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
        self.model.train()
        if hasattr(self.lr_scheduler, "step"):
            if self.use_apex:
                model = self.accelerator.prepare(self.model)
            else:
                model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
        else:
            # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
            model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
                self.model, self.optimizer, self.lr_scheduler
            )
  ```