# Finetune In this example, we show how to finetune the baai-general-embedding with your data. ## 1. Installation ``` git clone https://github.com/FlagOpen/FlagEmbedding.git cd FlagEmbedding/research/baai_general_embedding ``` ## 2. Data format Train data should be a json file, where each line is a dict like this: ``` {"query": str, "pos": List[str], "neg":List[str]} ``` `query` is the query, and `pos` is a list of positive texts, `neg` is a list of negative texts. If you have no negative texts for a query, you can random sample some from the entire corpus as the negatives. See [toy_finetune_data.jsonl](https://github.com/FlagOpen/FlagEmbedding/blob/master/examples/finetune/toy_finetune_data.jsonl) for a toy data file. ### Hard Negatives Hard negatives is a widely used method to improve the quality of sentence embedding. You can mine hard negatives following this command: ```shell git clone https://github.com/FlagOpen/FlagEmbedding.git cd FlagEmbedding/scripts ``` ```bash python -m FlagEmbedding.baai_general_embedding.finetune.hn_mine \ --model_name_or_path BAAI/bge-base-en-v1.5 \ --input_file toy_finetune_data.jsonl \ --output_file toy_finetune_data_minedHN.jsonl \ --range_for_sampling 2-200 \ --negative_number 15 \ --use_gpu_for_searching ``` - `input_file`: json data for finetuning. This script will retrieve top-k documents for each query, and random sample negatives from the top-k documents (not including the positive documents). - `output_file`: path to save JSON data with mined hard negatives for finetuning - `negative_number`: the number of sampled negatives - `range_for_sampling`: where to sample negative. For example, `2-100` means sampling `negative_number` negatives from top2-top200 documents. **You can set larger value to reduce the difficulty of negatives (e.g., set it `60-300` to sample negatives from top60-300 passages)** - `candidate_pool`: The pool to retrieval. The default value is None, and this script will retrieve from the combination of all `neg` in `input_file`. The format of this file is the same as [pretrain data](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain#2-data-format). If input a candidate_pool, this script will retrieve negatives from this file. - `use_gpu_for_searching`: whether to use faiss-gpu to retrieve negatives. ## 3. Train ``` torchrun --nproc_per_node {number of gpus} \ -m finetune.run \ --output_dir {path to save model} \ --model_name_or_path BAAI/bge-large-zh-v1.5 \ --train_data ./toy_finetune_data.jsonl \ --learning_rate 1e-5 \ --fp16 \ --num_train_epochs 5 \ --per_device_train_batch_size {large batch size; set 1 for toy data} \ --dataloader_drop_last True \ --normlized True \ --temperature 0.02 \ --query_max_len 64 \ --passage_max_len 256 \ --train_group_size 2 \ --negatives_cross_device \ --logging_steps 10 \ --save_steps 1000 \ --query_instruction_for_retrieval "" ``` **some important arguments**: - `per_device_train_batch_size`: batch size in training. In most of cases, larger batch size will bring stronger performance. You can expand it by enabling `--fp16`, `--deepspeed ./df_config.json` (df_config.json can refer to [ds_config.json](./ds_config.json)), `--gradient_checkpointing`, etc. - `train_group_size`: the number of positive and negatives for a query in training. There are always one positive, so this argument will control the number of negatives (#negatives=train_group_size-1). Noted that the number of negatives should not be larger than the numbers of negatives in data `"neg":List[str]`. Besides the negatives in this group, the in-batch negatives also will be used in fine-tuning. - `negatives_cross_device`: share the negatives across all GPUs. This argument will extend the number of negatives. - `learning_rate`: select a appropriate for your model. Recommend 1e-5/2e-5/3e-5 for large/base/small-scale. - `temperature`: It will influence the distribution of similarity scores. **Recommended value: 0.01-0.1.** - `query_max_len`: max length for query. Please set it according the average length of queries in your data. - `passage_max_len`: max length for passage. Please set it according the average length of passages in your data. - `query_instruction_for_retrieval`: instruction for query, which will be added to each query. You also can set it `""` to add nothing to query. - `use_inbatch_neg`: use passages in the same batch as negatives. Default value is True. - `save_steps`: for setting how many training steps to save a checkpoint. For more training arguments please refer to [transformers.TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) ### 4. Model merging via [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/research/LM_Cocktail) [optional] For more details please refer to [LM-Cocktail](https://github.com/FlagOpen/FlagEmbedding/tree/master/research/LM_Cocktail). Fine-tuning the base bge model can improve its performance on target task, but maybe lead to severe degeneration of model’s general capabilities beyond the targeted domain (e.g., lower performance on c-mteb tasks). By merging the fine-tuned model and the base model, LM-Cocktail can significantly enhance performance in downstream task while maintaining performance in other unrelated tasks. ```python from LM_Cocktail import mix_models, mix_models_with_data # Mix fine-tuned model and base model; then save it to output_path: ./mixed_model_1 model = mix_models( model_names_or_paths=["BAAI/bge-large-en-v1.5", "your_fine-tuned_model"], model_type='encoder', weights=[0.5, 0.5], # you can change the weights to get a better trade-off. output_path='./mixed_model_1') ``` If you have a new task, and there is no data or resource can be used for fine-tuning, you can try to use LM-Cocktail to merge existing models (from open-source community or your models fine-tuned on other tasks) to produce a task-specific model. In this way, you just need to construct a few example data and don't need fine-tuning the base model. For example, you can merge the models from [huggingface](https://huggingface.co/Shitao) using the example data for your task: ```python from LM_Cocktail import mix_models, mix_models_with_data example_data = [ {"query": "How does one become an actor in the Telugu Film Industry?", "pos": [" How do I become an actor in Telugu film industry?"], "neg": [" What is the story of Moses and Ramesses?", " Does caste system affect economic growth of India?"]}, {"query": "Why do some computer programmers develop amazing software or new concepts, while some are stuck with basic programming work?", "pos": [" Why do some computer programmers develops amazing softwares or new concepts, while some are stuck with basics programming works?"], "neg": [" When visiting a friend, do you ever think about what would happen if you did something wildly inappropriate like punch them or destroy their furniture?", " What is the difference between a compliment and flirting?"]} ] model = mix_models_with_data( model_names_or_paths=["BAAI/bge-base-en-v1.5", "Shitao/bge-hotpotqa", "Shitao/bge-quora"], model_type='encoder', example_ata=example_data, temperature=5.0, max_input_length=512, neg_number=2) ``` **Since there are only 9 `bge-*` models in this [repo](https://huggingface.co/Shitao), the performance may not be satisfactory when your task is different with all 9 fine-tuning tasks. You can fine-tune the base model on more tasks and merge them to achieve better performance on your task.** ### 5. Load your model After fine-tuning BGE model, you can load it easily in the same way as [here](https://github.com/FlagOpen/FlagEmbedding/tree/master/research/baai_general_embedding#usage) Please replace the `query_instruction_for_retrieval` with your instruction if you set a different value for hyper-parameter `--query_instruction_for_retrieval` when fine-tuning. ### 6. Evaluate model We provide [a simple script](https://github.com/FlagOpen/FlagEmbedding/blob/master/research/baai_general_embedding/finetune/eval_msmarco.py) to evaluate the model's performance. A brief summary of how the script works: 1. Load the model on all available GPUs through [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). 2. Encode the corpus and offload the embeddings in `faiss` Flat index. By default, `faiss` also dumps the index on all available GPUs. 3. Encode the queries and search `100` nearest neighbors for each query. 4. Compute Recall and MRR metrics. First, install `faiss`, a popular approximate nearest neighbor search library: ```bash conda install -c conda-forge faiss-gpu ``` #### 6.1 MSMARCO dataset The default evaluate data is MSMARCO, a widely used retrieval benchmark. You can check the data formats for the [msmarco corpus](https://huggingface.co/datasets/namespace-Pt/msmarco-corpus) and [evaluation queries](https://huggingface.co/datasets/namespace-Pt/msmarco). Run the following command: ```bash python -m finetune.eval_msmarco \ --encoder BAAI/bge-base-en-v1.5 \ --fp16 \ --add_instruction \ --k 100 ``` **some important arguments:** - `encoder`: specify the encoder model, which can be either a model on huggingface or a local one. - `fp16`: use half precision for inference. - `add_instruction`: add retrieval instruction (`Represent this sentence for searching relevant passages: `). - `k`: specify how many nearest neighbors to retrieve for each query. The results should be similar to ```python { 'MRR@1': 0.2330945558739255, 'MRR@10': 0.35786976395142633, 'MRR@100': 0.3692618036917553, 'Recall@1': 0.22606255969436478, 'Recall@10': 0.6412965616045848, 'Recall@100': 0.9012774594078318 } ``` #### 6.2 Your dataset You should prepare two files with jsonl format: - One is corpus_data, which contains the text you want to search. A toy example: [toy_corpus.json](./toy_evaluation_data/toy_corpus.json) ``` {"content": "A is ..."} {"content": "B is ..."} {"content": "C is ..."} {"content": "Panda is ..."} {"content": "... is A"} ``` - The other is query_data, which contains the queries and the ground truth. A toy example: [toy_corpus.json](./toy_evaluation_data/toy_query.json) ``` {"query": "What is A?", "positive": ["A is ...", "... is A"]} {"query": "What is B?", "positive": ["B is ..."]} {"query": "What is C?", "positive": ["C is ..."]} ``` Then, pass the data path to evaluation script: ```bash python -m FlagEmbedding.baai_general_embedding.finetune.eval_msmarco \ --encoder BAAI/bge-base-en-v1.5 \ --fp16 \ --add_instruction \ --k 100 \ --corpus_data ./toy_evaluation_data/toy_corpus.json \ --query_data ./toy_evaluation_data/toy_query.json ```