66 lines
2.7 KiB
Markdown
66 lines
2.7 KiB
Markdown
# Q&A Example
|
|
|
|
Vector Database can help LLMs to access external knowledge.
|
|
You can load baai-general-embedding as the encoder to generate the vectors.
|
|
Here a example to build a bot which can answer your question using the knowledge in chinese wikipedia.
|
|
|
|
Here's a description of the Q&A dialogue scenario using flag embedding and a large language model:
|
|
|
|
1. **Data Preprocessing and Indexing:**
|
|
- Download a Chinese wikipedia dataset.
|
|
- Encode the Chinese wikipedia text using flag embedding.
|
|
- Build an index using BM25.
|
|
2. **Query Enhancement with Large Language Model (LLM):**
|
|
- Utilize a Large Language Model (LLM) to enhance and enrich the original user query based on the chat history.
|
|
- The LLM can perform tasks such as text completion and paraphrasing to make the query more robust and comprehensive.
|
|
3. **Document Retrieval:**
|
|
- Employ BM25 to retrieve the top-n documents from the locally stored Chinese wiki dataset based on the newly enhanced query.
|
|
4. **Embedding Retrieval:**
|
|
- Perform an embedding retrieval on the top-n retrieved documents using brute force search to get top-k documents.
|
|
5. **Answer Retrieval with Language Model (LLM):**
|
|
- Present the question, the top-k retrieved documents, and chat history to the Large Language Model (LLM).
|
|
- The LLM can utilize its understanding of language and context to provide accurate and comprehensive answers to the user's question.
|
|
|
|
By following these steps, the Q&A system can leverage flag embedding, BM25 indexing, and a Large Language Model to improve the accuracy and intelligence of the system. The integration of these techniques can create a more sophisticated and reliable Q&A system for users, providing them with comprehensive information to effectively answer their questions.
|
|
|
|
### Installation
|
|
|
|
```shell
|
|
sudo apt install default-jdk
|
|
pip install -r requirements.txt
|
|
conda install -c anaconda openjdk
|
|
```
|
|
|
|
### Prepare Data
|
|
|
|
```shell
|
|
python pre_process.py --data_path ./data
|
|
```
|
|
|
|
This script will download the dataset (Chinese wikipedia), building BM25 index, inference embedding, and then save them to `data_path`.
|
|
|
|
## Q&A usage
|
|
|
|
### Run Directly
|
|
|
|
```shell
|
|
export OPENAI_API_KEY=...
|
|
python run.py --data_path ./data
|
|
```
|
|
|
|
This script will build a Q&A dialogue scenario.
|
|
|
|
### Quick Start
|
|
|
|
```python
|
|
# encoding=gbk
|
|
from tool import LocalDatasetLoader, BMVectorIndex, Agent
|
|
loader = LocalDatasetLoader(data_path="./data/dataset",
|
|
embedding_path="./data/emb/data.npy")
|
|
index = BMVectorIndex(model_path="BAAI/bge-large-zh",
|
|
bm_index_path="./data/index",
|
|
data_loader=loader)
|
|
agent = Agent(index)
|
|
question = "上次有人登月是什么时候"
|
|
agent.Answer(question, RANKING=1000, TOP_N=5, verbose=False)
|
|
``` |