embed-bge-m3/FlagEmbedding/research/llm_embedder/README.md

213 lines
7.2 KiB
Markdown

<div align="center">
<h1>LLM-Embedder [<a href="https://arxiv.org/abs/2310.07554">paper</a>]</h1>
<img src="imgs/llm-embedder.png" width="60%" class="center">
</div>
This is the codebase for LLM-Embedder, a unified embedding model to comprehensively support the retrieval augmentation needs of large language models, including knowledge retrieval, memory retrieval, examplar retrieval, and tool retrieval. It is fine-tuned over 6 tasks:
- *Question Answering (qa)*
- *Conversational Search (convsearch)*
- *Long Conversation (chat)*
- *Long-Range Language Modeling (lrlm)*
- *In-Context Learning (icl)*
- *Tool Learning (tool)*
## Roadmap
- Details about how to fine-tune the LLM-Embedder are [here](docs/fine-tune.md).
- Details about how to evaluate different retrievers on various retrieval-augmented scenarios are [here](docs/evaluation.md).
## Usage
### Using `FlagEmbedding`
```pip install -U FlagEmbedding```
```python
from FlagEmbedding import FlagModel
INSTRUCTIONS = {
"qa": {
"query": "Represent this query for retrieving relevant documents: ",
"key": "Represent this document for retrieval: ",
},
"icl": {
"query": "Convert this example into vector to look for useful examples: ",
"key": "Convert this example into vector for retrieval: ",
},
"chat": {
"query": "Embed this dialogue to find useful historical dialogues: ",
"key": "Embed this historical dialogue for retrieval: ",
},
"lrlm": {
"query": "Embed this text chunk for finding useful historical chunks: ",
"key": "Embed this historical text chunk for retrieval: ",
},
"tool": {
"query": "Transform this user request for fetching helpful tool descriptions: ",
"key": "Transform this tool description for retrieval: "
},
"convsearch": {
"query": "Encode this query and context for searching relevant passages: ",
"key": "Encode this passage for retrieval: ",
},
}
# Define queries and keys
queries = ["test query 1", "test query 2"]
keys = ["test key 1", "test key 2"]
# Encode for a specific task (qa, icl, chat, lrlm, tool, convsearch)
task = "qa"
# Load model (automatically use GPUs)
model = FlagModel('BAAI/llm-embedder',
use_fp16=False,
query_instruction_for_retrieval=INSTRUCTIONS[task]['query'],
passage_instruction_for_retrieval=INSTRUCTIONS[task]['key'],
devices=['cuda:0'])
query_embeddings = model.encode_queries(queries)
key_embeddings = model.encode_corpus(keys)
similarity = query_embeddings @ key_embeddings.T
print(similarity)
# [[0.8971, 0.8534]
# [0.8462, 0.9091]]
```
### Using `transformers`
```pip install -U transformers```
```python
import torch
from transformers import AutoTokenizer, AutoModel
INSTRUCTIONS = {
"qa": {
"query": "Represent this query for retrieving relevant documents: ",
"key": "Represent this document for retrieval: ",
},
"icl": {
"query": "Convert this example into vector to look for useful examples: ",
"key": "Convert this example into vector for retrieval: ",
},
"chat": {
"query": "Embed this dialogue to find useful historical dialogues: ",
"key": "Embed this historical dialogue for retrieval: ",
},
"lrlm": {
"query": "Embed this text chunk for finding useful historical chunks: ",
"key": "Embed this historical text chunk for retrieval: ",
},
"tool": {
"query": "Transform this user request for fetching helpful tool descriptions: ",
"key": "Transform this tool description for retrieval: "
},
"convsearch": {
"query": "Encode this query and context for searching relevant passages: ",
"key": "Encode this passage for retrieval: ",
},
}
# Define queries and keys
queries = ["test query 1", "test query 2"]
keys = ["test key 1", "test key 2"]
# Load model
tokenizer = AutoTokenizer.from_pretrained('BAAI/llm-embedder')
model = AutoModel.from_pretrained('BAAI/llm-embedder')
# Add instructions for specific task (qa, icl, chat, lrlm, tool, convsearch)
instruction = INSTRUCTIONS["qa"]
queries = [instruction["query"] + query for query in queries]
keys = [instruction["key"] + key for key in keys]
# Tokenize sentences
query_inputs = tokenizer(queries, padding=True, return_tensors='pt')
key_inputs = tokenizer(keys, padding=True, return_tensors='pt')
# Encode
with torch.no_grad():
query_outputs = model(**query_inputs)
key_outputs = model(**key_inputs)
# CLS pooling
query_embeddings = query_outputs.last_hidden_state[:, 0]
key_embeddings = key_outputs.last_hidden_state[:, 0]
# Normalize
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
key_embeddings = torch.nn.functional.normalize(key_embeddings, p=2, dim=1)
similarity = query_embeddings @ key_embeddings.T
print(similarity)
# [[0.8971, 0.8534]
# [0.8462, 0.9091]]
```
### Using `sentence-transformers`
```pip install -U sentence-transformers```
```python
from sentence_transformers import SentenceTransformer
INSTRUCTIONS = {
"qa": {
"query": "Represent this query for retrieving relevant documents: ",
"key": "Represent this document for retrieval: ",
},
"icl": {
"query": "Convert this example into vector to look for useful examples: ",
"key": "Convert this example into vector for retrieval: ",
},
"chat": {
"query": "Embed this dialogue to find useful historical dialogues: ",
"key": "Embed this historical dialogue for retrieval: ",
},
"lrlm": {
"query": "Embed this text chunk for finding useful historical chunks: ",
"key": "Embed this historical text chunk for retrieval: ",
},
"tool": {
"query": "Transform this user request for fetching helpful tool descriptions: ",
"key": "Transform this tool description for retrieval: "
},
"convsearch": {
"query": "Encode this query and context for searching relevant passages: ",
"key": "Encode this passage for retrieval: ",
},
}
# Define queries and keys
queries = ["test query 1", "test query 2"]
keys = ["test key 1", "test key 2"]
# Load model
model = SentenceTransformer('BAAI/llm-embedder', device="cpu")
# Add instructions for specific task (qa, icl, chat, lrlm, tool, convsearch)
instruction = INSTRUCTIONS["qa"]
queries = [instruction["query"] + query for query in queries]
keys = [instruction["key"] + key for key in keys]
# Encode
query_embeddings = model.encode(queries)
key_embeddings = model.encode(keys)
similarity = query_embeddings @ key_embeddings.T
print(similarity)
# [[0.8971, 0.8534]
# [0.8462, 0.9091]]
```
## Contact
If you have any question or suggestion related to this project, feel free to open an issue or pull request. You also can email Peitian Zhang (namespace.pt@gmail.com).
## Citation
If you find this repository useful, please consider giving a star ⭐ and citation
```
@misc{zhang2023retrieve,
title={Retrieve Anything To Augment Large Language Models},
author={Peitian Zhang and Shitao Xiao and Zheng Liu and Zhicheng Dou and Jian-Yun Nie},
year={2023},
eprint={2310.07554},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
```