# Hard Negatives

Hard negatives are those negative samples that are particularly challenging for the model to distinguish from the positive ones. They are often close to the decision boundary or exhibit features that make them highly similar to the positive samples. Thus hard negative mining is widely used in machine learning tasks to make the model focus on subtle differences between similar instances, leading to better discrimination.

In text retrieval system, a hard negative could be document that share some feature similarities with the query but does not truly satisfy the query's intent. During retrieval, those documents could rank higher than the real answers. Thus it's valuable to explicitly train the model on these hard negatives.

## 1. Preparation

First, load an embedding model:

In [1]:
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-base-en-v1.5')

  from .autonotebook import tqdm as notebook_tqdm


Then, load the queries and corpus from dataset:

In [4]:
from datasets import load_dataset

corpus = load_dataset("BeIR/scifact", "corpus")["corpus"]
queries = load_dataset("BeIR/scifact", "queries")["queries"]

corpus_ids = corpus.select_columns(["_id"])["_id"]
corpus = corpus.select_columns(["text"])["text"]

We create a dictionary maping auto generated ids (starting from 0) used by FAISS index, for later use.

In [24]:
corpus_ids_map = {}
for i in range(len(corpus)):
    corpus_ids_map[i] = corpus_ids[i]

## 2. Indexing

Use the embedding model to encode the queries and corpus:

In [6]:
p_vecs = model.encode(corpus)

pre tokenize: 100%|██████████| 21/21 [00:00<00:00, 46.18it/s]
You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings:   0%|          | 0/21 [00:00<?, ?it/s]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings:   5%|▍         | 1/21 [00:49<16:20, 49.00s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings:  10%|▉         | 2/21 [01:36<15:10, 47.91s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings:  14%|█▍        | 3/21 [02:16<13:23, 44.66s/it]Attempting to cast a BatchEncoding to type None. This is not supported.
Inference Embeddings:  19%|█▉        | 4/21 [02:52<11:39, 41.13s/it]Attempting to cast a

In [7]:
p_vecs.shape

(5183, 768)

Then create a FAISS index

In [8]:
import torch, faiss
import numpy as np

# create a basic flat index with dimension match our embedding
index = faiss.IndexFlatIP(len(p_vecs[0]))
# make sure the embeddings are float32
p_vecs = np.asarray(p_vecs, dtype=np.float32)
# use gpu to accelerate index searching
if torch.cuda.is_available():
    co = faiss.GpuMultipleClonerOptions()
    co.shard = True
    co.useFloat16 = True
    index = faiss.index_cpu_to_all_gpus(index, co=co)
# add all the embeddings to the index
index.add(p_vecs)

## 3. Searching

For better demonstration, let's use a single query:

In [9]:
query = queries[0]
query

{'_id': '0',
 'title': '',
 'text': '0-dimensional biomaterials lack inductive properties.'}

Get the id and content of that query, then use our embedding model to get its embedding vector.

In [20]:
q_id, q_text = query["_id"], query["text"]
# use the encode_queries() function to encode query
q_vec = model.encode_queries(queries=q_text)

Use the index to search for closest results:

In [31]:
_, ids = index.search(np.expand_dims(q_vec, axis=0), k=15)
# convert the auto ids back to ids in the original dataset
converted = [corpus_ids_map[id] for id in ids[0]]
converted

['4346436',
 '17388232',
 '14103509',
 '37437064',
 '29638116',
 '25435456',
 '32532238',
 '31715818',
 '23763738',
 '7583104',
 '21456232',
 '2121272',
 '35621259',
 '58050905',
 '196664003']

In [32]:
qrels = load_dataset("BeIR/scifact-qrels")["train"]
pos_id = qrels[0]
pos_id

{'query-id': 0, 'corpus-id': 31715818, 'score': 1}

Lastly, we use the mothod of top-k shifted by N, which get the top 10 negatives after rank 5.

In [44]:
negatives = [id for id in converted[5:] if int(id) != pos_id["corpus-id"]]
negatives

['25435456',
 '32532238',
 '23763738',
 '7583104',
 '21456232',
 '2121272',
 '35621259',
 '58050905',
 '196664003']

Now we have select a group of hard negatives for the first query!

There are other methods to refine the process of choosing hard negatives. For example, the [implementation](https://github.com/FlagOpen/FlagEmbedding/blob/master/scripts/hn_mine.py) in our GitHub repo get the top 200 shifted by 10, which mean top 10-210. And then sample 15 from the 200 candidates. The reason is directly choosing the top K may introduce some false negatives, passages that somehow relative to the query but not exactly the answer to that query, into the negative set. This could influence model's performance.