This guide contains instructions for running a BGE-base baseline for NFCorpus.
If you're a Waterloo student traversing the onboarding path (which starts here), make sure you've first done the previous step, a conceptual framework for retrieval. In general, don't try to rush through this guide by just blindly copying and pasting commands into a shell; that's what I call cargo culting. Instead, really try to understand what's going on.
If you've traversed the onboarding path, by now you've learned the basics of bag-of-words retrieval with BM25 using Lucene (via Anserini and Pyserini). Conceptually, you understand how it's a specific manifestation of a bi-encoder architecture where the vector representations are lexical and the weights are assigned in an unsupervised (or heuristic) manner.
In this guide, we're going to go through an example of retrieval using a learned, dense representation. These are often called "dense retrieval models" and informally referred to as "vector search". Coming back to here:
The document and query encoders are now transformer-based models that are trained on large amounts of supervised data. The outputs of the encoders are often called embedding vectors, or just embeddings for short.
For this guide, assume that we've already got trained encoders. How to actually train such models will be covered later.
Learning outcomes for this guide, building on previous steps in the onboarding path:
- Be able to use Pyserini to encode documents in NFCorpus with an existing dense retrieval model (BGE-base) and to build a Faiss index on the vector representations..
- Be able to use Pyserini to perform a batch retrieval run on queries from NFCorpus.
- Be able to evaluate the retrieved results above.
- Be able to generate the retrieved results above interactively by directly manipulating Pyserini Python classes.
In this lesson, we'll be working with NFCorpus, a full-text learning to rank dataset for medical information retrieval. The rationale is that the corpus is quite small — only 3633 documents — so the latency of CPU-based inference with neural models (i.e., the encoders) is tolerable, i.e., this lesson is doable on a laptop. It is not practical to work with the MS MARCO passage ranking corpus using CPUs.
Let's first start by fetching the data:
wget https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip -P collections
unzip collections/nfcorpus.zip -d collections
This just gives you an idea of what the corpus contains:
$ head -1 collections/nfcorpus/corpus.jsonl
{"_id": "MED-10", "title": "Statin Use and Breast Cancer Survival: A Nationwide Cohort Study from Finland", "text": "Recent studies have suggested that statins, an established drug group in the prevention of cardiovascular mortality, could delay or prevent breast cancer recurrence but the effect on disease-specific mortality remains unclear. We evaluated risk of breast cancer death among statin users in a population-based cohort of breast cancer patients. The study cohort included all newly diagnosed breast cancer patients in Finland during 1995\u20132003 (31,236 cases), identified from the Finnish Cancer Registry. Information on statin use before and after the diagnosis was obtained from a national prescription database. We used the Cox proportional hazards regression method to estimate mortality among statin users with statin use as time-dependent variable. A total of 4,151 participants had used statins. During the median follow-up of 3.25 years after the diagnosis (range 0.08\u20139.0 years) 6,011 participants died, of which 3,619 (60.2%) was due to breast cancer. After adjustment for age, tumor characteristics, and treatment selection, both post-diagnostic and pre-diagnostic statin use were associated with lowered risk of breast cancer death (HR 0.46, 95% CI 0.38\u20130.55 and HR 0.54, 95% CI 0.44\u20130.67, respectively). The risk decrease by post-diagnostic statin use was likely affected by healthy adherer bias; that is, the greater likelihood of dying cancer patients to discontinue statin use as the association was not clearly dose-dependent and observed already at low-dose/short-term use. The dose- and time-dependence of the survival benefit among pre-diagnostic statin users suggests a possible causal effect that should be evaluated further in a clinical trial testing statins\u2019 effect on survival in breast cancer patients.", "metadata": {"url": "http://www.ncbi.nlm.nih.gov/pubmed/25329299"}}
We need to do a bit of data munging to get the queries into the right format (from json to tsv). Run the following Python script:
import json
with open('collections/nfcorpus/queries.tsv', 'w') as out:
with open('collections/nfcorpus/queries.jsonl', 'r') as f:
for line in f:
l = json.loads(line)
out.write(l['_id'] + '\t' + l['text'] + '\n')
Similarly, we need to munge the relevance judgments (qrels) into the right format. This command-line invocation does the trick:
tail -n +2 collections/nfcorpus/qrels/test.tsv | sed 's/\t/\tQ0\t/' > collections/nfcorpus/qrels/test.qrels
Okay, the data are ready now.
We can now "index" these documents using Pyserini:
python -m pyserini.encode \
input --corpus collections/nfcorpus/corpus.jsonl \
--fields title text \
output --embeddings indexes/nfcorpus.bge-base-en-v1.5 \
--to-faiss \
encoder --encoder BAAI/bge-base-en-v1.5 --l2-norm \
--device cpu \
--pooling mean \
--fields title text \
--batch 32
We're using the BAAI/bge-base-en-v1.5
encoder, which can be found on HuggingFace.
Use --device cuda
for a faster computation if you have a CUDA-enabled GPU.
Try it using the Contriever model!
python -m pyserini.encode \
input --corpus collections/nfcorpus/corpus.jsonl \
--fields title text \
output --embeddings indexes/faiss.nfcorpus.contriever-msmacro \
--to-faiss \
encoder --encoder facebook/contriever-msmarco \
--device cpu \
--pooling mean \
--fields title text \
--batch 32
We're using the facebook/contriever-msmarco
encoder, which can be found on HuggingFace.
Use --device cuda
for a faster computation if you have a CUDA-enabled GPU.
Pyserini wraps Faiss, which is a library for efficient similarity search on dense vectors.
That is, once all the documents have been encoded (i.e., converted into representation vectors), they are passed to Faiss to manage (i.e., for storage and for search later on).
"Index" here is in quotes because, in reality we're using something called a "flat" index (FlatIP
to be exact), which just stores the vectors in fixed-width bytes, one after the other.
At search time, each document vector is sequentially compared to the query vector.
In other words, the library just performs brute force dot products of each query vector against all document vectors.
The above indexing command takes around 30 minutes to run on a modern laptop, with most of the time occupied by performing neural inference using the CPU.
Adjust the batch
parameter above accordingly for your hardware; 32 is the default, but reduce the value if you find that the encoding is taking too long.
We can now perform retrieval in Pyserini using the following command:
python -m pyserini.search.faiss \
--encoder-class auto --encoder BAAI/bge-base-en-v1.5 --l2-norm \
--pooling mean \
--index indexes/nfcorpus.bge-base-en-v1.5 \
--topics collections/nfcorpus/queries.tsv \
--output runs/run.beir.bge-base-en-v1.5.nfcorpus.txt \
--batch 128 --threads 8 \
--hits 1000
(Adjust the batch
and threads
parameters above accordingly for your hardware; e.g., lower the settings on a laptop.)
The queries are in collections/nfcorpus/queries.tsv
.
If you indexed with Contriever above, try retrieval with it too:
python -m pyserini.search.faiss \
--encoder-class contriever --encoder facebook/contriever-msmarco \
--index indexes/faiss.nfcorpus.contriever-msmacro \
--topics collections/nfcorpus/queries.tsv \
--output runs/run.beir-contriever-msmarco.nfcorpus.txt \
--batch 128 --threads 8 \
--hits 1000
(Adjust the batch
and threads
parameters above accordingly for your hardware; e.g., lower the settings on a laptop.)
As mentioned above, Pyserini wraps the Faiss library. With the flat index here, we're performing brute-force computation of dot products (albeit in parallel and with batching). As a result, we are performing exact search, i.e., we are finding the exact top-k documents that have the highest dot products.
The above retrieval command takes only a few minutes on a modern laptop.
Adjust the threads
and batch
parameters above accordingly for your hardware.
After the run finishes, we can evaluate the results using trec_eval
:
python -m pyserini.eval.trec_eval \
-c -m ndcg_cut.10 collections/nfcorpus/qrels/test.qrels \
runs/run.beir.bge-base-en-v1.5.nfcorpus.txt
The results will be something like:
Results:
ndcg_cut_10 all 0.3808
And if you've been following along with Contriever:
python -m pyserini.eval.trec_eval \
-c -m ndcg_cut.10 collections/nfcorpus/qrels/test.qrels \
runs/run.beir-contriever-msmarco.nfcorpus.txt
The results will be something like:
Results:
ndcg_cut_10 all 0.3306
If you've gotten here, congratulations! You've completed your first indexing and retrieval run using a dense retrieval model.
The final step, as with Lucene, is to learn to use the dense retriever interactively. This contrasts with the batch run above.
Here's the snippet of Python code that does what we want:
from pyserini.search.faiss import FaissSearcher
from pyserini.encode import AutoQueryEncoder
encoder = AutoQueryEncoder('BAAI/bge-base-en-v1.5', device='cpu', pooling='mean', l2_norm=True)
searcher = FaissSearcher('indexes/nfcorpus.bge-base-en-v1.5', encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')
The FaissSearcher
provides search capabilities using Faiss as its underlying implementation.
The AutoQueryEncoder
allows us to initialize an encoder using a HuggingFace model.
1 MED-4555 0.791379
2 MED-4560 0.710725
3 MED-4421 0.688938
4 MED-4993 0.686238
5 MED-4424 0.686214
6 MED-1663 0.682199
7 MED-3436 0.680585
8 MED-2750 0.677033
9 MED-4324 0.675772
10 MED-2939 0.674646
You'll see that the ranked list is the same as the batch run you performed above:
$ grep PLAIN-3074 runs/run.beir.bge-base-en-v1.5.nfcorpus.txt | head -10
PLAIN-3074 Q0 MED-4555 1 0.791378 Faiss
PLAIN-3074 Q0 MED-4560 2 0.710725 Faiss
PLAIN-3074 Q0 MED-4421 3 0.688938 Faiss
PLAIN-3074 Q0 MED-4993 4 0.686238 Faiss
PLAIN-3074 Q0 MED-4424 5 0.686214 Faiss
PLAIN-3074 Q0 MED-1663 6 0.682199 Faiss
PLAIN-3074 Q0 MED-3436 7 0.680585 Faiss
PLAIN-3074 Q0 MED-2750 8 0.677033 Faiss
PLAIN-3074 Q0 MED-4324 9 0.675772 Faiss
PLAIN-3074 Q0 MED-2939 10 0.674647 Faiss
Again with Contriever!
Here's the snippet of Python code that does what we want:
from pyserini.search.faiss import FaissSearcher
from pyserini.encode import AutoQueryEncoder
encoder = AutoQueryEncoder('facebook/contriever-msmarco', device='cpu', pooling='mean')
searcher = FaissSearcher('indexes/faiss.nfcorpus.contriever-msmacro', encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.6f}')
The FaissSearcher
provides search capabilities using Faiss as its underlying implementation.
The AutoQueryEncoder
allows us to initialize an encoder using a HuggingFace model.
1 MED-4555 1.472201
2 MED-3180 1.125014
3 MED-1309 1.067153
4 MED-2224 1.059536
5 MED-4423 1.038440
6 MED-4887 1.032622
7 MED-2530 1.020758
8 MED-2372 1.016142
9 MED-1006 1.013599
10 MED-2587 1.010811
You'll see that the ranked list is the same as the batch run you performed above:
$ grep PLAIN-3074 runs/run.beir-contriever-msmarco.nfcorpus.txt | head -10
PLAIN-3074 Q0 MED-4555 1 1.472201 Faiss
PLAIN-3074 Q0 MED-3180 2 1.125014 Faiss
PLAIN-3074 Q0 MED-1309 3 1.067153 Faiss
PLAIN-3074 Q0 MED-2224 4 1.059537 Faiss
PLAIN-3074 Q0 MED-4423 5 1.038440 Faiss
PLAIN-3074 Q0 MED-4887 6 1.032622 Faiss
PLAIN-3074 Q0 MED-2530 7 1.020758 Faiss
PLAIN-3074 Q0 MED-2372 8 1.016142 Faiss
PLAIN-3074 Q0 MED-1006 9 1.013599 Faiss
PLAIN-3074 Q0 MED-2587 10 1.010811 Faiss
And that's it!
The next lesson will provide a deeper dive into dense and sparse representations.
Before you move on, however, add an entry in the "Reproduction Log" at the bottom of this page, following the same format: use yyyy-mm-dd
, make sure you're using a commit id that's on the main trunk of Pyserini, and use its 7-hexadecimal prefix for the link anchor text.
Reproduction Log*
- Results reproduced by @sahel-sh on 2023-08-04 (commit
19da81c
) - Results reproduced by @Mofetoluwa on 2023-08-05 (commit
6a2088b
) - Results reproduced by @Andrwyl on 2023-08-26 (commit
d9da49e
) - Results reproduced by @yilinjz on 2023-08-30 (commit
42b3549
) - Results reproduced by @UShivani3 on 2023-09-01 (commit
42b3549
) - Results reproduced by @Edward-J-Xu on 2023-09-05 (commit
8063322
) - Results reproduced by @mchlp on 2023-09-07 (commit
d8dc5b3
) - Results reproduced by @lucedes27 on 2023-09-10 (commit
54014af
) - Results reproduced by @MojTabaa4 on 2023-09-14 (commit
d4a829d
) - Results reproduced by @Kshama on 2023-09-24 (commit
7d18f4b
) - Results reproduced by @MelvinMo on 2023-09-24 (commit
7d18f4b
) - Results reproduced by @ksunisth on 2023-09-27 (commit
142c774
) - Results reproduced by @maizerrr on 2023-10-01 (commit
bdb9504
) - Results reproduced by @Stefan824 on 2023-10-04 (commit
4f3da10
) - Results reproduced by @shayanbali on 2023-10-13 (commit
f1d623c
) - Results reproduced by @gituserbs on 2023-10-19 (commit
f1d623c
) - Results reproduced by @shakibaam on 2023-11-04 (commit
01889cc
) - Results reproduced by @gitHubAndyLee2020 on 2023-11-05 (commit
01889cc
) - Results reproduced by @Melissa1412 on 2023-11-05 (commit
acd969f
) - Results reproduced by @oscarbelda86 on 2023-11-13 (commit
086e16b
) - Results reproduced by @salinaria on 2023-11-14 (commit
086e16b
) - Results reproduced by @aliranjbari on 2023-11-15 (commit
b02ac99
) - Results reproduced by @Seun-Ajayi on 2023-11-16 (commit
5d63bc5
) - Results reproduced by @AndreSlavescu on 2023-11-28 (commit
1219cdb
) - Results reproduced by @tudou0002 on 2023-11-28 (commit
723e06c
) - Results reproduced by @alimt1992 on 2023-11-29 (commit
e6700f6
) - Results reproduced by @golnooshasefi on 2023-11-29 (commit
1219cdb
) - Results reproduced by @sueszli on 2023-12-01 (commit
170e271
) - Results reproduced by @kdricci on 2023-12-01 (commit
a2049c4
) - Results reproduced by @ljk423 on 2023-12-04 (commit
35002ad
) - Results reproduced by @saharsamr on 2023-12-14 (commit
039c137
) - Results reproduced by @Panizghi on 2023-12-17 (commit
0f5db95
) - Results reproduced by @AreelKhan on 2023-12-22 (commit
f75adca
) - Results reproduced by @wu-ming233 on 2023-12-31 (commit
38a571f
) - Results reproduced by @Yuan-Hou on 2024-01-02 (commit
38a571f
) - Results reproduced by @himasheth on 2024-01-10 (commit
a6ed27e
) - Results reproduced by @Tanngent on 2024-01-13 (commit
57a00cf
) - Results reproduced by @BeginningGradeMaker on 2024-01-15 (commit
d4ea011
) - Results reproduced by @ia03 on 2024-01-18 (commit
05ee8ef
) - Results reproduced by @AlexStan0 on 2024-01-20 (commit
833ee19
) - Results reproduced by @charlie-liuu on 2024-01-23 (commit
87a120e
) - Results reproduced by @dannychn11 on 2024-01-28 (commit
2f7702f
) - Results reproduced by @ru5h16h on 2024-02-20 (commit
758eaaa
) - Results reproduced by @ASChampOmega on 2024-02-23 (commit
442e7e1
) - Results reproduced by @16BitNarwhal on 2024-02-26 (commit
19fcd3b
) - Results reproduced by @HaeriAmin on 2024-02-27 (commit
19fcd3b
) - Results reproduced by @17Melissa on 2024-03-03 (commit
a9f295f
) - Results reproduced by @devesh-002 on 2024-03-05 (commit
84c6742
) - Results reproduced by @chloeqxq on 2024-03-07 (commit
19fcd3b
) - Results reproduced by @xpbowler on 2024-03-11 (commit
19fcd3b
) - Results reproduced by @jodyz0203 on 2024-03-12 (commit
280e009
) - Results reproduced by @kxwtan on 2024-03-12 (commit
2bb342a
) - Results reproduced by @syedhuq28 on 2024-03-28 (commit
2bb342a
) - Results reproduced by @khufia on 2024-03-29 (commit
2bb342a
) - Results reproduced by @Lindaaa8 on 2024-03-29 (commit
7dda9f3
) - Results reproduced by @th13nd4n0 on 2024-04-05 (commit
df3bc6c
) - Results reproduced by @a68lin on 2024-04-12 (commit
7dda9f3
) - Results reproduced by @DanielKohn1208 on 2024-04-22 (commit
184a212
) - Results reproduced by @emadahmed19 on 2024-04-28 (commit
9db2584
) - Results reproduced by @CheranMahalingam on 2024-05-05 (commit
f817186
) - Results reproduced by @billycz8 on 2024-05-08 (commit
c945c50
) - Results reproduced by @KenWuqianhao on 2024-05-11 (commit
c945c50
) - Results reproduced by @hrouzegar on 2024-05-13 (commit
bf68fc5
) - Results reproduced by @Yuv-sue1005 on 2024-05-15 (commit '9df4015')
- Results reproduced by @RohanNankani on 2024-05-17 (commit a91ef1d)
- Results reproduced by @IR3KT4FUNZ on 2024-05-25 (commit
a6f4d6
) - Results reproduced by @bilet-13 on 2024-06-01 (commit
b0c53f3
) - Results reproduced by @SeanSong25 on 2024-06-05 (commit
b7e1da3
) - Results reproduced by @alireza-taban on 2024-06-11 (commit
d814290
) - Results reproduced by @hosnahoseini on 2024-06-18 (commit
49d8c43
) - Results reproduced by @FaizanFaisal25 on 2024-07-07 (commit
3b9d541
) - Results reproduced by @Feng-12138 on 2024-07-11(commit
3b9d541
) - Results reproduced by @XKTZ on 2024-07-13 (commit
544046e
) - Results reproduced by @MehrnazSadeghieh on 2024-07-19 (commit
26a2538
) - Results reproduced by @alireza-nasirian on 2024-07-19 (commit
544046e
) - Results reproduced by @MariaPonomarenko38 on 2024-07-19 (commit
d4509dc
) - Results reproduced by @valamuri2020 on 2024-08-02 (commit
3f81997
) - Results reproduced by @daisyyedda on 2024-08-06 (commit
d814290
) - Results reproduced by @emily-emily on 2024-08-16 (commit
1bbf7a7
) - Results reproduced by @nicoella on 2024-08-19 (commit
e65dd95
) - Results reproduced by @natek-1 on 2024-08-19 ( commit
e65dd95
) - Results reproduced by @setarehbabajani on 2024-09-01 (commit
0dd5fa7
) - Results reproduced by @anshulsc on 2024-09-07 (commit
2e4fa5d
) - Results reproduced by @r-aya on 2024-09-08 (commit
2e4fa5d
) - Results reproduced by @Amirkia1998 on 2024-09-20 (commit
83537a3
) - Results reproduced by @pjyi2147 on 2024-09-20 (commit
f511655
) - Results reproduced by @krishh-p on 2024-09-21 (commit
f511655
) - Results reproduced by @andrewxucs on 2024-09-22 (commit
dd57b7d
) - Results reproduced by @Hossein-Molaeian on 2024-09-22 (commit
bc13901
) - Results reproduced by @AhmedEssam19 on 2024-09-30 (commit
07f04d4
) - Results reproduced by @sisixili on 2024-10-01 (commit
07f04d4
) - Results reproduced by @alirezaJvh on 2024-10-05 (commit
3f76099
) - Results reproduced by @Raghav0005 on 2024-10-09 (commit
7ed8369
) - Results reproduced by @Pxlin-09 on 2024-10-26 (commit
af2d3c5
) - Results reproduced by @Samantha-Zhan on 2024-11-17 (commit
a95b0e0
) - Results reproduced by @Divyajyoti02 on 2024-11-24 (commit
f6f8ecc
) - Results reproduced by @b8zhong on 2024-11-24 (commit
778968f
) - Results reproduced by @vincent-4 on 2024-11-24 (commit
576fdaf
) - Results reproduced by @ShreyasP20 on 2024-11-28 (commit
576fdaf
) - Results reproduced by @nihalmenon on 2024-11-30 (commit
94492de
) - Results reproduced by @zdann15 on 2024-12-04 (commit
5e66e98
) - Results reproduced by @sherloc512 on 2024-12-05 (commit
5e66e98
) - Results reproduced by @Alireza-Zwolf on 2024-12-18 (commit
6cc23d5
) - Results reproduced by @Linsen-gao-457 on 2024-12-20 (commit
10606f0
) - Results reproduced by @robro612 on 2025-01-05 (commit
9268591
) - Results reproduced by @nourj98 on 2025-01-07 (commit
6ac07cc
) - Results reproduced by @mithildamani256 on 2025-01-13 (commit
ad41512
) - Results reproduced by @ezafar on 2025-01-15 (commit
e1a3386
) - Results reproduced by @ErfanSadraiye on 2025-01-16 (commit
cb14c93
)