# Evaluate on MLDR

[MLDR](https://huggingface.co/datasets/Shitao/MLDR) is a Multilingual Long-Document Retrieval dataset built on Wikipeida, Wudao and mC4, covering 13 typologically diverse languages. Specifically, we sample lengthy articles from Wikipedia, Wudao and mC4 datasets and randomly choose paragraphs from them. Then we use GPT-3.5 to generate questions based on these paragraphs. The generated question and the sampled article constitute a new text pair to the dataset.

## 0. Installation

First install the libraries we are using:

In [None]:
% pip install FlagEmbedding pytrec_eval

## 1. Dataset

Download the dataset of 13 different languages from [Hugging Face](https://huggingface.co/datasets/Shitao/MLDR).

| Language Code |  Language  |      Source      | #train  | #dev  | #test | #corpus | Avg. Length of Docs |
| :-----------: | :--------: | :--------------: | :-----: | :---: | :---: | :-----: | :-----------------: |
|      ar       |   Arabic   |    Wikipedia     |  1,817  |  200  |  200  |  7,607  |        9,428        |
|      de       |   German   |  Wikipedia, mC4  |  1,847  |  200  |  200  | 10,000  |        9,039        |
|      en       |  English   |    Wikipedia     | 10,000 |  200  |  800  | 200,000 |        3,308        |
|      es       |  Spanish   |  Wikipedia, mc4  |  2,254  |  200  |  200  |  9,551  |        8,771        |
|      fr       |   French   |    Wikipedia     |  1,608  |  200  |  200  | 10,000  |        9,659        |
|      hi       |   Hindi    |    Wikipedia     |  1,618  |  200  |  200  |  3,806  |        5,555        |
|      it       |  Italian   |    Wikipedia     |  2,151  |  200  |  200  | 10,000  |        9,195        |
|      ja       |  Japanese  |    Wikipedia     |  2,262  |  200  |  200  | 10,000  |        9,297        |
|      ko       |   Korean   |    Wikipedia     |  2,198  |  200  |  200  |  6,176  |        7,832        |
|      pt       | Portuguese |    Wikipedia     |  1,845  |  200  |  200  |  6,569  |        7,922        |
|      ru       |  Russian   |    Wikipedia     |  1,864  |  200  |  200  | 10,000  |        9,723        |
|      th       |    Thai    |       mC4        |  1,970  |  200  |  200  | 10,000  |        8,089        |
|      zh       |  Chinese   | Wikipedia, Wudao | 10,000  |  200  |  800  | 200,000 |        4,249        |
|     Total     |     -      |        -         | 41,434  | 2,600 | 3,800 | 493,709 |        4,737        |

First download the queries and corresponding qrels:

In [22]:
from datasets import load_dataset

lang = "en"
dataset = load_dataset('Shitao/MLDR', lang, trust_remote_code=True)

Each item has four parts: `query_id`, `query`, `positive_passages`, and `negative_passages`. `query_id` and `query` correspond to the id and text content of the qeury. `positive_passages` and `negative_passages` are list of passages with their corresponding `docid` and `text`. 

In [23]:
dataset['dev'][0]

{'query_id': 'q-en-1',
 'query': 'What is the syntax for the shorthand of the conditional operator in PHP 5.3?',
 'positive_passages': [{'docid': 'doc-en-8',
   'text': 'In computer programming,  is a ternary operator that is part of the syntax for basic conditional expressions in several programming languages. It is commonly referred to as the conditional operator, inline if (iif), or ternary if. An expression  evaluates to  if the value of  is true, and otherwise to . One can read it aloud as "if a then b otherwise c".\n\nIt originally comes from CPL, in which equivalent syntax for e1 ? e2 : e3 was e1 → e2, e3.\n\nAlthough many ternary operators are possible, the conditional operator is so common, and other ternary operators so rare, that the conditional operator is commonly referred to as the ternary operator.\n\nVariations\nThe detailed semantics of "the" ternary operator as well as its syntax differs significantly from language to language.\n\nA top level distinction from one lang

Each passage in the corpus has two parts: `docid` and `text`. `docid` has the form of `doc-<language>-<id>`

In [24]:
corpus = load_dataset('Shitao/MLDR', f"corpus-{lang}", trust_remote_code=True)['corpus']

In [33]:
corpus[0]

{'docid': 'doc-en-9633',
 'text': 'Mars Hill Church was a Christian megachurch, founded by Mark Driscoll, Lief Moi, and Mike Gunn. It was a multi-site church based in Seattle, Washington and grew from a home Bible study to 15 locations in 4 U.S. states. Services were offered at its 15 locations; the church also podcast content of weekend services, and of conferences, on the Internet with more than 260,000 sermon views online every week. In 2013, Mars Hill had a membership of 6,489 and average weekly attendance of 12,329. Following controversy in 2014 involving founding pastor Mark Driscoll, attendance dropped to 8,0009,000 people per week.\n\nAt the end of September, 2014, an investigation by the church elders found "bullying" and "patterns of persistent sinful behavior" by Driscoll. The church elders crafted a "restoration" plan to help Driscoll and save the church. Instead, Driscoll declined the restoration plan and resigned. On October 31, 2014, lead pastor Dave Bruskas announced pl

Then we process the ids and text of queries and corpus for preparation of embedding and searching.

In [25]:
corpus_ids = corpus['docid']
corpus_text = corpus['text']

queries_ids = dataset['dev']['query_id']
queries_text = dataset['dev']['query']

## 2. Evaluate from scratch

### 2.1 Embedding

In the demo we use bge-base-en-v1.5, feel free to change to the model you prefer.

In [26]:
import os 
os.environ['TRANSFORMERS_NO_ADVISORY_WARNINGS'] = 'true'
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

In [27]:
from FlagEmbedding import FlagModel

# get the BGE embedding model
model = FlagModel('BAAI/bge-base-en-v1.5',)
                #   query_instruction_for_retrieval="Represent this sentence for searching relevant passages:")

# get the embedding of the queries and corpus
queries_embeddings = model.encode_queries(queries_text)
corpus_embeddings = model.encode_corpus(corpus_text)

print("shape of the embeddings:", corpus_embeddings.shape)
print("data type of the embeddings: ", corpus_embeddings.dtype)

pre tokenize: 100%|██████████| 1/1 [00:00<00:00, 60.08it/s]
pre tokenize: 100%|██████████| 782/782 [02:22<00:00,  5.50it/s]
Inference Embeddings: 100%|██████████| 782/782 [02:47<00:00,  4.66it/s]


shape of the embeddings: (200000, 768)
data type of the embeddings:  float16


### 2.2 Indexing

Create a Faiss index to store the embeddings.

In [28]:
import faiss
import numpy as np

# get the length of our embedding vectors, vectors by bge-base-en-v1.5 have length 768
dim = corpus_embeddings.shape[-1]

# create the faiss index and store the corpus embeddings into the vector space
index = faiss.index_factory(dim, 'Flat', faiss.METRIC_INNER_PRODUCT)
corpus_embeddings = corpus_embeddings.astype(np.float32)
# train and add the embeddings to the index
index.train(corpus_embeddings)
index.add(corpus_embeddings)

print(f"total number of vectors: {index.ntotal}")

total number of vectors: 200000


### 2.3 Searching

Use the Faiss index to search answers for each query.

In [29]:
from tqdm import tqdm

query_size = len(queries_embeddings)

all_scores = []
all_indices = []

for i in tqdm(range(0, query_size, 32), desc="Searching"):
    j = min(i + 32, query_size)
    query_embedding = queries_embeddings[i: j]
    score, indice = index.search(query_embedding.astype(np.float32), k=100)
    all_scores.append(score)
    all_indices.append(indice)

all_scores = np.concatenate(all_scores, axis=0)
all_indices = np.concatenate(all_indices, axis=0)

Searching: 100%|██████████| 7/7 [00:01<00:00,  5.15it/s]


In [30]:
results = {}
for idx, (scores, indices) in enumerate(zip(all_scores, all_indices)):
    results[queries_ids[idx]] = {}
    for score, index in zip(scores, indices):
        if index != -1:
            results[queries_ids[idx]][corpus_ids[index]] = float(score)

### 2.4 Evaluating

Process the qrels into a dictionary with qid-docid pairs.

In [31]:
qrels_dict = {}
for data in dataset['dev']:
    qid = str(data["query_id"])
    if qid not in qrels_dict:
        qrels_dict[qid] = {}
    for doc in data["positive_passages"]:
        docid = str(doc["docid"])
        qrels_dict[qid][docid] = 1
    for doc in data["negative_passages"]:
        docid = str(doc["docid"])
        qrels_dict[qid][docid] = 0

Finally, use [pytrec_eval](https://github.com/cvangysel/pytrec_eval) library to help us calculate the scores of selected metrics:

In [32]:
import pytrec_eval
from collections import defaultdict

ndcg_string = "ndcg_cut." + ",".join([str(k) for k in [10,100]])
recall_string = "recall." + ",".join([str(k) for k in [10,100]])

evaluator = pytrec_eval.RelevanceEvaluator(
    qrels_dict, {ndcg_string, recall_string}
)
scores = evaluator.evaluate(results)

all_ndcgs, all_recalls = defaultdict(list), defaultdict(list)
for query_id in scores.keys():
    for k in [10,100]:
        all_ndcgs[f"NDCG@{k}"].append(scores[query_id]["ndcg_cut_" + str(k)])
        all_recalls[f"Recall@{k}"].append(scores[query_id]["recall_" + str(k)])

ndcg, recall = (
    all_ndcgs.copy(),
    all_recalls.copy(),
)

for k in [10,100]:
    ndcg[f"NDCG@{k}"] = round(sum(ndcg[f"NDCG@{k}"]) / len(scores), 5)
    recall[f"Recall@{k}"] = round(sum(recall[f"Recall@{k}"]) / len(scores), 5)

print(ndcg)
print(recall)

defaultdict(<class 'list'>, {'NDCG@10': 0.35304, 'NDCG@100': 0.38694})
defaultdict(<class 'list'>, {'Recall@10': 0.465, 'Recall@100': 0.625})


## 3. Evaluate using FlagEmbedding

We provide independent evaluation for popular datasets and benchmarks. Try the following code to run the evaluation, or run the shell script provided in [example](../../examples/evaluation/mldr/eval_mldr.sh) folder.

In [2]:
import sys

arguments = """- \
    --eval_name mldr \
    --dataset_dir ./mldr/data \
    --dataset_names en \
    --splits dev \
    --corpus_embd_save_dir ./mldr/corpus_embd \
    --output_dir ./mldr/search_results \
    --search_top_k 1000 \
    --cache_path ./cache/data \
    --overwrite False \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./mldr/mldr_eval_results.md \
    --eval_metrics ndcg_at_10 \
    --embedder_name_or_path BAAI/bge-base-en-v1.5 \
    --devices cuda:0 cuda:1 \
    --embedder_batch_size 1024
""".replace('\n','')

sys.argv = arguments.split()

In [4]:
from transformers import HfArgumentParser

from FlagEmbedding.evaluation.mldr import (
    MLDREvalArgs, MLDREvalModelArgs,
    MLDREvalRunner
)


parser = HfArgumentParser((
    MLDREvalArgs,
    MLDREvalModelArgs
))

eval_args, model_args = parser.parse_args_into_dataclasses()
eval_args: MLDREvalArgs
model_args: MLDREvalModelArgs

runner = MLDREvalRunner(
    eval_args=eval_args,
    model_args=model_args
)

runner.run()

  from .autonotebook import tqdm as notebook_tqdm
initial target device: 100%|██████████| 2/2 [00:07<00:00,  3.54s/it]
pre tokenize: 100%|██████████| 98/98 [01:01<00:00,  1.58it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
pre tokenize: 100%|██████████| 98/98 [01:07<00:00,  1.44it/s]09it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Inference Embeddings: 100%|██████████| 98/98 [01:22<00:00,  1.19it/s]
Inference Embeddings: 100%|██████████| 98/98 [01:23<00:00,  1.17it/s]
Chunks: 100%|██████████| 2/2 [02:40<00:00, 80.21s/it] 
pre tokenize: 100%|██████████| 1/1 [00:00<00:00,  2.16it/s]
pre tokenize: 100%|██████████| 1/1 [00:00<0

In [5]:
with open('mldr/search_results/bge-base-en-v1.5/NoReranker/EVAL/eval_results.json', 'r') as content_file:
    print(content_file.read())

{
    "en-dev": {
        "ndcg_at_10": 0.35304,
        "ndcg_at_100": 0.38694,
        "map_at_10": 0.31783,
        "map_at_100": 0.32469,
        "recall_at_10": 0.465,
        "recall_at_100": 0.625,
        "precision_at_10": 0.0465,
        "precision_at_100": 0.00625,
        "mrr_at_10": 0.31783,
        "mrr_at_100": 0.32469
    }
}
