8.8 KiB
Fine-tuning
Environment
It is recommended that you create a new environment:
cd FlagEmbedding/llm_embedder
conda env create -f environment.yaml --name llm-embedder
conda activate llm-embedder
To use BM25, you must download java11 and anserini, then add java to your PATH:
# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/java11.tar.gz?download=true -O /data/java11.tar.gz
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/anserini.tar.gz?download=true -O /data/anserini.tar.gz
cd /data
tar -xzvf java11.tar.gz
tar -xzvf anserini.tar.gz
# below just temporarily set JAVA_HOME; it is RECOMMENDED that you store the lines the setting in ~/.bashrc
export JAVA_HOME=/data/jdk-11.0.2
export PATH=$JAVA_HOME/bin:$PATH
Data
You should download the data for fine-tuning & evaluation then untar the file at anywhere you prefer, e.g. /data, which results in a folder /data/llm-embedder:
# feel free to alternate /data to your prefered location
wget https://huggingface.co/datasets/namespace-Pt/projects/resolve/main/llm-embedder.tar.gz?download=true -O /data/llm-embedder.tar.gz
cd /data
tar -xzvf llm-embedder-eval.tar.gz
The corpus of QReCC for conversational search is too large (54M passages), we separately upload it to huggingface datasets namespace-Pt/qrecc-corpus. To evaluate the performance on conversational search, you should load it and save it as json file in the qrecc folder:
import datasets
# load dataset
qrecc_corpus = datasets.load_dataset("namespace-Pt/qrecc-corpus", split="train")
# save to jsonline format in YOUR data folder
qrecc_corpus.to_json("/data/llm-embedder/convsearch/qrecc/corpus.json", force_ascii=False, lines=True, orient="records")
The data formats for training and evaluation are as follows:
# training
{
"query": str,
"pos": List[str],
"neg": List[str],
"pos_index": Optional[List[int]], # Indices of the positives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
"neg_index": Optional[List[int]], # Indices of the negatives w.r.t. the corpus. When a global corpus is not available (e.g. long conversation), just ignore this field.
"teacher_scores": Optional[List[float]], # Scores from an LM or a reranker, used for distillation.
"answers": Optional[List[str]], # List of answers for the query, used for LM scoring.
}
# evaluation
{
"query": str,
"pos_index": Optional[List[int]], # Indices of the positives w.r.t. corpus. When there is no positives pre-defined (e.g. NQ), just ignore this field.
"answers": Optional[List[str]], # List of answers for computing NQ metrics.
"key": Optional[List[str]], # Retrieval results of the query. Usually used for RAG or reranking.
"key_index": Optional[List[int]], # Key indices w.r.t. the corpus.
}
Retriever
Below are several important arguments for training. The meaning and usage of other arguments can be inspected from code or running python run_dense.py --help from command line.
train_data: required, one or a list of json files with the aforementioned formatting.eval_data: optional, one json file with the aforementioned formatting. If aneval_datais speficied, the trainer will automatically do evaluation on theeval_data.corpus: optional, the global corpus wherepositives.
IMPORTANT NOTE
- For any path specified for
train_data,eval_data, andcorpus: if it is prefixed withllm-embedder, it will be solved to the relative path againstdata_root. Note that you can modify the default value ofdata_root, so that you don't need to type it for each command. - During fine-tuning, we save the output model in the
huggingface transformers🤗 format. To use it fromsentence_transformers, you should convert it tosentence_transformerscheckpoint in advance:
Then everything is the same as described in README.python scripts/ours2st.py --encoder data/outputs/your-output-dir/encoder
LLM-Embedder (Multi-Task Fine-Tune)
# Remember to modify the data_root to your data root in the script :)
bash scripts/llm-embedder.sh
Single Task Fine-Tune
Below we provide commands to fine-tune a retriever on a single task.
QA
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/nq \
--train_data llm-embedder:qa/nq/train.json \
--eval_data llm-embedder:qa/nq/test.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics nq \
--key_max_length 128 \
--query_max_length 32 \
--contrastive_weight 0 \
--stable_distill \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
In-Context Learning
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/icl \
--train_data llm-embedder:icl/icl/train.json \
--select_positive random \
--contrastive_weight 0 \
--stable_distill \
--save_steps 6000 \
--max_steps 6000 \
--data_root /data/llm-embedder
Long-Range Language Modeling
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/lrlm \
--train_data llm-embedder:lrlm/books3/train.json llm-embedder:lrlm/arxiv/train.json llm-embedder:lrlm/codeparrot/train.json \
--select_positive teacher \
--teacher_scores_margin 0.1 \
--contrastive_weight 0 \
--teacher_temperature 0.1 \
--save_steps 4000 \
--max_steps 4000 \
--data_root /data/llm-embedder
Long Chat
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/msc \
--train_data llm-embedder:chat/msc/train.json \
--select_positive teacher \
--select_negative random \
--contrastive_weight 0 \
--teacher_temperature 0.1 \
--save_steps 4000 \
--max_steps 4000 \
--data_root /data/llm-embedder
Tool
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/tool \
--train_data llm-embedder:tool/toolbench/train.json \
--eval_data llm-embedder:tool/toolbench/test.json \
--corpus llm-embedder:tool/toolbench/corpus.json \
--key_template {text} \
--metrics ndcg \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
Conversation Search
torchrun --nproc_per_node=8 run_dense.py \
--output_dir data/outputs/qrecc \
--train_data llm-embedder:conversation/qrecc/train.concat.json \
--eval_data llm-embedder:conversation/qrecc/test.concat.json \
--corpus llm-embedder:conversation/qrecc/corpus.json \
--key_template '{text}' \
--metrics mrr ndcg \
--cutoffs 3 10 100 \
--eval_steps 2000 \
--save_steps 2000 \
--max_steps 2000 \
--data_root /data/llm-embedder
Mine Negatives
# BGE (the result will be saved at llm-embedder:qa/nq/train.neg.bge.json)
torchrun --nproc_per_node=8 -m evaluation.eval_retrieval \
--eval_data llm-embedder:qa/nq/train.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics mrr recall collate_neg \
--save_name bge \
--data_root /data/llm-embedder
# BM25 (the result will be saved at llm-embedder:qa/nq/train.neg.bm25.json; anserini_dir is the folder where you untar anserini.tar.gz)
torchrun --nproc_per_node 8 -m evaluation.eval_retrieval \
--anserini_dir /data/anserini \
--retrieval_method bm25 \
--eval_data llm-embedder:qa/nq/train.json \
--corpus llm-embedder:qa/nq/corpus.json \
--metrics mrr recall collate_neg \
--save_name bm25 \
--data_root /data/llm-embedder
LM Scoring
Score positives and negatives in eval_data with p(o|q,k) where o is the desired output (i.e. answers field), q is the query, and k is a key (could be positive or negative).
torchrun --nproc_per_node=8 run_lm_score.py \
--eval_data llm-embedder:qa/msmarco/train.json \
--data_root /data/llm-embedder \
--model_name_or_path meta-llama/Llama-2-7b-chat-hf \
--save_name llama2-7b-chat
Results will be saved at /data/llm-embedder/qa/msmarco/train.scored.llama2-7b-chat.json
Known Issues
transformers==4.30.0raises error when using deepspeed schedulerconfig- modify line
1750intrainer.py
if use_accelerator_prepare: # NOTE: fix bug in transformers 4.30.0 # model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) self.model.train() if hasattr(self.lr_scheduler, "step"): if self.use_apex: model = self.accelerator.prepare(self.model) else: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer) else: # to handle cases wherein we pass "DummyScheduler" such as when it is specified in DeepSpeed config. model, self.optimizer, self.lr_scheduler = self.accelerator.prepare( self.model, self.optimizer, self.lr_scheduler )- modify line