235 lines
8.8 KiB
Markdown
235 lines
8.8 KiB
Markdown
# Fine-tuning
|
|
|
|
## Environment
|
|
It is recommended that you create a new environment:
|
|
```
|
|
cd FlagEmbedding/llm_embedder
|
|
|
|
conda env create -f environment.yaml --name llm-embedder
|
|
conda activate llm-embedder
|
|
```
|
|
|
|
To use BM25, you must download **java11** and **anserini**, then add java to your `PATH`:
|
|
```bash
|
|
# feel free to alternate /data to your prefered location
|
|
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/java11.tar.gz?download=true -O /data/java11.tar.gz
|
|
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/anserini.tar.gz?download=true -O /data/anserini.tar.gz
|
|
|
|
cd /data
|
|
tar -xzvf java11.tar.gz
|
|
tar -xzvf anserini.tar.gz
|
|
|
|
# below just temporarily set JAVA_HOME; it is RECOMMENDED that you store the lines the setting in ~/.bashrc
|
|
export JAVA_HOME=/data/jdk-11.0.2
|
|
export PATH=$JAVA_HOME/bin:$PATH
|
|
```
|
|
|
|
## Data
|
|
You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. `/data`, which results in a folder `/data/llm-embedder`:
|
|
```bash
|
|
# feel free to alternate /data to your prefered location
|
|
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/llm-embedder.tar.gz?download=true -O /data/llm-embedder.tar.gz
|
|
|
|
cd /data
|
|
tar -xzvf llm-embedder-eval.tar.gz
|
|
```
|
|
|
|
The corpus of QReCC for conversational search is too large (54M passages), we separately upload it to huggingface datasets [namespace-Pt/qrecc-corpus](https://huggingface.co/datasets/namespace-Pt/qrecc-corpus). To evaluate the performance on conversational search, you should load it and save it as json file in the `qrecc` folder:
|
|
```python
|
|
import datasets
|
|
# load dataset
|
|
qrecc_corpus = datasets.load_dataset("namespace-Pt/qrecc-corpus", split="train")
|
|
# save to jsonline format in YOUR data folder
|
|
qrecc_corpus.to_json("/data/llm-embedder/convsearch/qrecc/corpus.json", force_ascii=False, lines=True, orient="records")
|
|
```
|
|
|
|
The data formats for training and evaluation are as follows:
|
|
|
|
```python
|
|
# training
|
|
{
|
|
"query": str,
|
|
"pos": List[str],
|
|
"neg": List[str],
|
|
"pos_index": Optional[List[int]], # Indices of the positives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
|
|
"neg_index": Optional[List[int]], # Indices of the negatives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
|
|
"teacher_scores": Optional[List[float]], # Scores from an LM or a reranker, used for distillation.
|
|
"answers": Optional[List[str]], # List of answers for the query, used for LM scoring.
|
|
}
|
|
|
|
# evaluation
|
|
{
|
|
"query": str,
|
|
"pos_index": Optional[List[int]], # Indices of the positives w.r.t. corpus. When there is no positives pre-defined (e.g. NQ), just ignore this field.
|
|
"answers": Optional[List[str]], # List of answers for computing NQ metrics.
|
|
"key": Optional[List[str]], # Retrieval results of the query. Usually used for RAG or reranking.
|
|
"key_index": Optional[List[int]], # Key indices w.r.t. the corpus.
|
|
}
|
|
```
|
|
|
|
## Retriever
|
|
Below are several important arguments for training. The meaning and usage of other arguments can be inspected from [code](../src/retrieval/args.py) or running `python run_dense.py --help` from command line.
|
|
- `train_data`: required, one or a list of json files with the aforementioned formatting.
|
|
- `eval_data`: optional, one json file with the aforementioned formatting. If an `eval_data` is speficied, the trainer will automatically do evaluation on the `eval_data`.
|
|
- `corpus`: optional, the global corpus where `positives`.
|
|
|
|
**IMPORTANT NOTE**
|
|
- For any path specified for `train_data`, `eval_data`, and `corpus`: if it is prefixed with `llm-embedder`, it will be solved to the relative path against [`data_root`](../src/retrieval/args.py). *Note that you can modify the default value of `data_root`, so that you don't need to type it for each command.*
|
|
- During fine-tuning, we save the output model in the `huggingface transformers`🤗 format. To use it from `sentence_transformers`, you should convert it to `sentence_transformers` checkpoint in advance:
|
|
```bash
|
|
python scripts/ours2st.py --encoder data/outputs/your-output-dir/encoder
|
|
```
|
|
Then everything is the same as described in [README](../README.md).
|
|
|
|
### LLM-Embedder (Multi-Task Fine-Tune)
|
|
```bash
|
|
# Remember to modify the data_root to your data root in the script :)
|
|
bash scripts/llm-embedder.sh
|
|
```
|
|
|
|
### Single Task Fine-Tune
|
|
Below we provide commands to fine-tune a retriever on a single task.
|
|
|
|
#### QA
|
|
```bash
|
|
torchrun --nproc_per_node=8 run_dense.py \
|
|
--output_dir data/outputs/nq \
|
|
--train_data llm-embedder:qa/nq/train.json \
|
|
--eval_data llm-embedder:qa/nq/test.json \
|
|
--corpus llm-embedder:qa/nq/corpus.json \
|
|
--metrics nq \
|
|
--key_max_length 128 \
|
|
--query_max_length 32 \
|
|
--contrastive_weight 0 \
|
|
--stable_distill \
|
|
--eval_steps 2000 \
|
|
--save_steps 2000 \
|
|
--max_steps 2000 \
|
|
--data_root /data/llm-embedder
|
|
```
|
|
|
|
#### In-Context Learning
|
|
```bash
|
|
torchrun --nproc_per_node=8 run_dense.py \
|
|
--output_dir data/outputs/icl \
|
|
--train_data llm-embedder:icl/icl/train.json \
|
|
--select_positive random \
|
|
--contrastive_weight 0 \
|
|
--stable_distill \
|
|
--save_steps 6000 \
|
|
--max_steps 6000 \
|
|
--data_root /data/llm-embedder
|
|
```
|
|
|
|
#### Long-Range Language Modeling
|
|
```bash
|
|
torchrun --nproc_per_node=8 run_dense.py \
|
|
--output_dir data/outputs/lrlm \
|
|
--train_data llm-embedder:lrlm/books3/train.json llm-embedder:lrlm/arxiv/train.json llm-embedder:lrlm/codeparrot/train.json \
|
|
--select_positive teacher \
|
|
--teacher_scores_margin 0.1 \
|
|
--contrastive_weight 0 \
|
|
--teacher_temperature 0.1 \
|
|
--save_steps 4000 \
|
|
--max_steps 4000 \
|
|
--data_root /data/llm-embedder
|
|
```
|
|
|
|
#### Long Chat
|
|
```bash
|
|
torchrun --nproc_per_node=8 run_dense.py \
|
|
--output_dir data/outputs/msc \
|
|
--train_data llm-embedder:chat/msc/train.json \
|
|
--select_positive teacher \
|
|
--select_negative random \
|
|
--contrastive_weight 0 \
|
|
--teacher_temperature 0.1 \
|
|
--save_steps 4000 \
|
|
--max_steps 4000 \
|
|
--data_root /data/llm-embedder
|
|
```
|
|
|
|
#### Tool
|
|
```bash
|
|
torchrun --nproc_per_node=8 run_dense.py \
|
|
--output_dir data/outputs/tool \
|
|
--train_data llm-embedder:tool/toolbench/train.json \
|
|
--eval_data llm-embedder:tool/toolbench/test.json \
|
|
--corpus llm-embedder:tool/toolbench/corpus.json \
|
|
--key_template {text} \
|
|
--metrics ndcg \
|
|
--eval_steps 2000 \
|
|
--save_steps 2000 \
|
|
--max_steps 2000 \
|
|
--data_root /data/llm-embedder
|
|
```
|
|
|
|
#### Conversation Search
|
|
```bash
|
|
torchrun --nproc_per_node=8 run_dense.py \
|
|
--output_dir data/outputs/qrecc \
|
|
--train_data llm-embedder:conversation/qrecc/train.concat.json \
|
|
--eval_data llm-embedder:conversation/qrecc/test.concat.json \
|
|
--corpus llm-embedder:conversation/qrecc/corpus.json \
|
|
--key_template '{text}' \
|
|
--metrics mrr ndcg \
|
|
--cutoffs 3 10 100 \
|
|
--eval_steps 2000 \
|
|
--save_steps 2000 \
|
|
--max_steps 2000 \
|
|
--data_root /data/llm-embedder
|
|
```
|
|
|
|
### Mine Negatives
|
|
```bash
|
|
# BGE (the result will be saved at llm-embedder:qa/nq/train.neg.bge.json)
|
|
torchrun --nproc_per_node=8 -m evaluation.eval_retrieval \
|
|
--eval_data llm-embedder:qa/nq/train.json \
|
|
--corpus llm-embedder:qa/nq/corpus.json \
|
|
--metrics mrr recall collate_neg \
|
|
--save_name bge \
|
|
--data_root /data/llm-embedder
|
|
|
|
# BM25 (the result will be saved at llm-embedder:qa/nq/train.neg.bm25.json; anserini_dir is the folder where you untar anserini.tar.gz)
|
|
torchrun --nproc_per_node 8 -m evaluation.eval_retrieval \
|
|
--anserini_dir /data/anserini \
|
|
--retrieval_method bm25 \
|
|
--eval_data llm-embedder:qa/nq/train.json \
|
|
--corpus llm-embedder:qa/nq/corpus.json \
|
|
--metrics mrr recall collate_neg \
|
|
--save_name bm25 \
|
|
--data_root /data/llm-embedder
|
|
```
|
|
|
|
## LM Scoring
|
|
Score positives and negatives in `eval_data` with $p(o|q,k)$ where $o$ is the desired output (i.e. `answers` field), $q$ is the query, and $k$ is a key (could be positive or negative).
|
|
|
|
```bash
|
|
torchrun --nproc_per_node=8 run_lm_score.py \
|
|
--eval_data llm-embedder:qa/msmarco/train.json \
|
|
--data_root /data/llm-embedder \
|
|
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
|
|
--save_name llama2-7b-chat
|
|
```
|
|
Results will be saved at `/data/llm-embedder/qa/msmarco/train.scored.llama2-7b-chat.json`
|
|
|
|
|
|
## Known Issues
|
|
- `transformers==4.30.0` raises error when using deepspeed schedulerconfig
|
|
- modify line `1750` in `trainer.py`
|
|
```python
|
|
if use_accelerator_prepare:
|
|
# NOTE: fix bug in transformers 4.30.0
|
|
# model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
|
|
self.model.train()
|
|
if hasattr(self.lr_scheduler, "step"):
|
|
if self.use_apex:
|
|
model = self.accelerator.prepare(self.model)
|
|
else:
|
|
model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
|
|
else:
|
|
# to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config.
|
|
model, self.optimizer, self.lr_scheduler = self.accelerator.prepare(
|
|
self.model, self.optimizer, self.lr_scheduler
|
|
)
|
|
``` |