# Evaluate on MIRACL

[MIRACL](https://project-miracl.github.io/) (Multilingual Information Retrieval Across a Continuum of Languages) is an WSDM 2023 Cup challenge that focuses on search across 18 different languages. They release a multilingual retrieval dataset containing the train and dev set for 16 “known languages” and only dev set for 2 “surprise languages”. The topics are generated by native speakers of each language, who also label the relevance between the topics and a given document list. You can found the dataset on HuggingFace.

Note: We highly recommend you to run the evaluation of MIRACL on GPU. For reference, it takes about an hour for the whole process on a 8xA100 40G node.

## 0. Installation

First install the libraries we are using:

In [None]:
% pip install FlagEmbedding pytrec_eval

## 1. Dataset

With the great number of passages and articles in the 18 languages. MIRACL is a resourceful dataset for training or evaluating multi-lingual model. The data can be downloaded from [Hugging Face](https://huggingface.co/datasets/miracl/miracl).

| Language        | # of Passages | # of Articles |
|:----------------|--------------:|--------------:|
| Arabic (ar)     |     2,061,414 |       656,982 |
| Bengali (bn)    |       297,265 |        63,762 |
| English (en)    |    32,893,221 |     5,758,285 |
| Spanish (es)    |    10,373,953 |     1,669,181 |
| Persian (fa)    |     2,207,172 |       857,827 |
| Finnish (fi)    |     1,883,509 |       447,815 |
| French (fr)     |    14,636,953 |     2,325,608 |
| Hindi (hi)      |       506,264 |       148,107 |
| Indonesian (id) |     1,446,315 |       446,330 |
| Japanese (ja)   |     6,953,614 |     1,133,444 |
| Korean (ko)     |     1,486,752 |       437,373 |
| Russian (ru)    |     9,543,918 |     1,476,045 |
| Swahili (sw)    |       131,924 |        47,793 |
| Telugu (te)     |       518,079 |        66,353 |
| Thai (th)       |       542,166 |       128,179 |
| Chinese (zh)    |     4,934,368 |     1,246,389 |

In [38]:
from datasets import load_dataset

lang = "en"
corpus = load_dataset("miracl/miracl-corpus", lang, trust_remote_code=True)['train']

Each passage in the corpus has three parts: `docid`, `title`, and `text`. In the structure of document with docid `x#y`, `x` indicates the id of Wikipedia article, and `y` is the number of passage within that article. The title is the name of the article with id `x` that passage belongs to. The text is the text body of the passage.

In [39]:
corpus[0]

{'docid': '56672809#4',
 'title': 'Glen Tomasetti',
 'text': 'In 1967 Tomasetti was prosecuted after refusing to pay one sixth of her taxes on the grounds that one sixth of the federal budget was funding Australia\'s military presence in Vietnam. In court she argued that Australia\'s participation in the Vietnam War violated its international legal obligations as a member of the United Nations. Public figures such as Joan Baez had made similar protests in the USA, but Tomasetti\'s prosecution was "believed to be the first case of its kind in Australia", according to a contemporary news report. Tomasetti was eventually ordered to pay the unpaid taxes.'}

The qrels have following form:

In [40]:
dev = load_dataset('miracl/miracl', lang, trust_remote_code=True)['dev']

In [41]:
dev[0]

{'query_id': '0',
 'query': 'Is Creole a pidgin of French?',
 'positive_passages': [{'docid': '462221#4',
   'text': "At the end of World War II in 1945, Korea was divided into North Korea and South Korea with North Korea (assisted by the Soviet Union), becoming a communist government after 1946, known as the Democratic People's Republic, followed by South Korea becoming the Republic of Korea. China became the communist People's Republic of China in 1949. In 1950, the Soviet Union backed North Korea while the United States backed South Korea, and China allied with the Soviet Union in what was to become the first military action of the Cold War.",
   'title': 'Eighth United States Army'},
  {'docid': '29810#23',
   'text': 'The large size of Texas and its location at the intersection of multiple climate zones gives the state highly variable weather. The Panhandle of the state has colder winters than North Texas, while the Gulf Coast has mild winters. Texas has wide variations in precipi

Each item has four parts: `query_id`, `query`, `positive_passages`, and `negative_passages`. Here, `query_id` and `query` correspond to the id and text content of the qeury. `positive_passages` and `negative_passages` are list of passages with their corresponding `docid`, `title`, and `text`. 

This structure is the same in the `train`, `dev`, `testA` and `testB` sets.

Then we process the ids and text of queries and corpus, and get the qrels of the dev set.

In [42]:
corpus_ids = corpus['docid']
corpus_text = []
for doc in corpus:
   corpus_text.append(f"{doc['title']} {doc['text']}".strip())

queries_ids = dev['query_id']
queries_text = dev['query']

## 2. Evaluate from scratch

### 2.1 Embedding

In the demo we use bge-base-en-v1.5, feel free to change to the model you prefer.

In [43]:
import os 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['SETUPTOOLS_USE_DISTUTILS'] = ''

In [44]:
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5')

# get the embedding of the queries and corpus
queries_embeddings = model.encode_queries(queries_text)
corpus_embeddings = model.encode_corpus(corpus_text)

print("shape of the embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)

initial target device: 100%|██████████| 8/8 [00:29<00:00,  3.66s/it]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 52.84it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 55.15it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 56.49it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 55.22it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 49.22it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 54.69it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 49.16it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 50.77it/s]
Chunks: 100%|██████████| 8/8 [00:10<00:00,  1.27s/it]
pre tokenize: 100%|██████████| 16062/16062 [08:12<00:00, 32.58it/s]  
pre tokenize: 100%|██████████| 16062/16062 [08:44<00:00, 30.60it/s]68s/it]
pre tokenize: 100%|██████████| 16062/16062 [08:39<00:00, 30.90it/s]41s/it]
pre tokenize: 100%|██████████| 16062/16062 [09:04<00:00, 29.49it/s]43s/it]
pre tokenize: 100%|██████████| 16062/16062 [09:27<00:00, 28.29it/s]it/s]t]
pre tokenize: 100%|████████

shape of the embeddings: (32893221, 768)
data type of the embeddings:  float16


### 2.2 Indexing

Create a Faiss index to store the embeddings.

In [45]:
import faiss
import numpy as np

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
# train and add the embeddings to the index
index.train(corpus_embeddings)
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")

total number of vectors: 32893221


### 2.3 Searching

Use the Faiss index to search for each query.

In [46]:
from tqdm import tqdm

query_size = len(queries_embeddings)

all_scores = []
all_indices = []

for i in tqdm(range(0, query_size, 32), desc="Searching"):
    j = min(i + 32, query_size)
    query_embedding = queries_embeddings[i: j]
    score, indice = index.search(query_embedding.astype(np.float32), k=100)
    all_scores.append(score)
    all_indices.append(indice)

all_scores = np.concatenate(all_scores, axis=0)
all_indices = np.concatenate(all_indices, axis=0)

Searching: 100%|██████████| 25/25 [15:03<00:00, 36.15s/it]


Then map the search results back to the indices in the dataset.

In [47]:
results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
    results[queries_ids[idx]] = {}
    for score, index in zip(scores, indices):
        if index != -1:
            results[queries_ids[idx]][corpus_ids[index]] = float(score)

### 2.4 Evaluating

Download the qrels file for evaluation:

In [48]:
endpoint = os.getenv('HF_ENDPOINT', 'https://huggingface.co')
file_name = "qrels.miracl-v1.0-en-dev.tsv"
qrel_url = f"wget {endpoint}/datasets/miracl/miracl/resolve/main/miracl-v1.0-en/qrels/{file_name}"

os.system(qrel_url)

--2024-11-21 10:26:16--  https://hf-mirror.com/datasets/miracl/miracl/resolve/main/miracl-v1.0-en/qrels/qrels.miracl-v1.0-en-dev.tsv
Resolving hf-mirror.com (hf-mirror.com)... 153.121.57.40, 133.242.169.68, 160.16.199.204
Connecting to hf-mirror.com (hf-mirror.com)|153.121.57.40|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 167817 (164K) [text/plain]
Saving to: ‘qrels.miracl-v1.0-en-dev.tsv’

     0K .......... .......... .......... .......... .......... 30%  109K 1s
    50K .......... .......... .......... .......... .......... 61% 44.5K 1s
   100K .......... .......... .......... .......... .......... 91% 69.6K 0s
   150K .......... ...                                        100% 28.0K=2.8s

2024-11-21 10:26:20 (58.6 KB/s) - ‘qrels.miracl-v1.0-en-dev.tsv’ saved [167817/167817]



0

Read the qrels from the file:

In [49]:
qrels_dict = {}
with open(file_name, "r", encoding="utf-8") as f:
    for line in f.readlines():
        qid, _, docid, rel = line.strip().split("\t")
        qid, docid, rel = str(qid), str(docid), int(rel)
        if qid not in qrels_dict:
            qrels_dict[qid] = {}
        qrels_dict[qid][docid] = rel

Finally, use [pytrec_eval](https://github.com/cvangysel/pytrec_eval) library to help us calculate the scores of selected metrics:

In [50]:
import pytrec_eval
from collections import defaultdict

ndcg_string = "ndcg_cut." + ",".join([str(k) for k in [10,100]])
recall_string = "recall." + ",".join([str(k) for k in [10,100]])

evaluator = pytrec_eval.RelevanceEvaluator(
    qrels_dict, {ndcg_string, recall_string}
)
scores = evaluator.evaluate(results)

all_ndcgs, all_recalls = defaultdict(list), defaultdict(list)
for query_id in scores.keys():
    for k in [10,100]:
        all_ndcgs[f"NDCG@{k}"].append(scores[query_id]["ndcg_cut_" + str(k)])
        all_recalls[f"Recall@{k}"].append(scores[query_id]["recall_" + str(k)])

ndcg, recall = (
    all_ndcgs.copy(),
    all_recalls.copy(),
)

for k in [10,100]:
    ndcg[f"NDCG@{k}"] = round(sum(ndcg[f"NDCG@{k}"]) / len(scores), 5)
    recall[f"Recall@{k}"] = round(sum(recall[f"Recall@{k}"]) / len(scores), 5)

print(ndcg)
print(recall)

defaultdict(<class 'list'>, {'NDCG@10': 0.46073, 'NDCG@100': 0.54336})
defaultdict(<class 'list'>, {'Recall@10': 0.55972, 'Recall@100': 0.83827})


## 3. Evaluate using FlagEmbedding

We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in [example](../../examples/evaluation/miracl/eval_miracl.sh) folder.

In [None]:
import sys

arguments = """- \
    --eval_name miracl \
    --dataset_dir ./miracl/data \
    --dataset_names en \
    --splits dev \
    --corpus_embd_save_dir ./miracl/corpus_embd \
    --output_dir ./miracl/search_results \
    --search_top_k 100 \
    --cache_path ./cache/data \
    --overwrite True \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./miracl/miracl_eval_results.md \
    --eval_metrics ndcg_at_10 recall_at_100 \
    --embedder_name_or_path BAAI/bge-base-en-v1.5 \
    --devices cuda:0 cuda:1 \
    --embedder_batch_size 1024
""".replace('\n','')

sys.argv = arguments.split()

In [3]:
from transformers import HfArgumentParser

from FlagEmbedding.evaluation.miracl import (
    MIRACLEvalArgs, MIRACLEvalModelArgs,
    MIRACLEvalRunner
)


parser = HfArgumentParser((
    MIRACLEvalArgs,
    MIRACLEvalModelArgs
))

eval_args, model_args = parser.parse_args_into_dataclasses()
eval_args: MIRACLEvalArgs
model_args: MIRACLEvalModelArgs

runner = MIRACLEvalRunner(
    eval_args=eval_args,
    model_args=model_args
)

runner.run()

  from .autonotebook import tqdm as notebook_tqdm
initial target device: 100%|██████████| 2/2 [00:09<00:00,  4.98s/it]
pre tokenize: 100%|██████████| 16062/16062 [18:01<00:00, 14.85it/s]  
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 16062/16062 [18:44<00:00, 14.29it/s]92s/it]
Inference Embeddings:   0%|          | 42/16062 [00:54<8:28:19,  1.90s/it]You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 16062/16062 [48:29<00:00,  5.52it/s] 
Inference Embeddings: 100%|██████████| 16062/16062 [48:55<00:00,  5.47it/s]
Chunks: 100%|██████████| 2/2 [1:10:57<00:00, 2128.54s/it]

In [4]:
with open('miracl/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:
    print(content_file.read())

{
    "en-dev": {
        "ndcg_at_10": 0.46053,
        "ndcg_at_100": 0.54313,
        "map_at_10": 0.35928,
        "map_at_100": 0.38726,
        "recall_at_10": 0.55972,
        "recall_at_100": 0.83809,
        "precision_at_10": 0.14018,
        "precision_at_100": 0.02347,
        "mrr_at_10": 0.54328,
        "mrr_at_100": 0.54929
    }
}
