PyGaggle: Neural Ranking Baselines on MS MARCO Passage Retrieval - Dev Subset
This page contains instructions for running various neural reranking baselines on the MS MARCO passage ranking task. Note that there is also a separate MS MARCO document ranking task.
Prior to running this, we suggest looking at our first-stage BM25 ranking instructions. We rerank the BM25 run files that contain ~1000 passages per query using both monoBERT and monoT5. monoBERT and monoT5 are pointwise rerankers. This means that each document is scored independently using either BERT or T5 respectively.
Since it can take many hours to run these models on all of the 6980 queries from the MS MARCO dev set, we will instead use a subset of 105 queries randomly sampled from the dev set. Running these instructions with the entire MS MARCO dev set should give about the same results as that in the corresponding paper.
Note 1: Run the following instructions at root of this repo. Note 2: Make sure that you have access to a GPU Note 3: Installation must have been done from source and make sure the anserini-eval submodule is pulled. To do this, first clone the repository recursively.
git clone --recursive https://github.com/castorini/pygaggle.git
Then install PyGaggle using:
pip install pygaggle/
- monoBERT-Large: Passage Re-ranking with BERT (Nogueira et al., 2019)
- monoT5-base: Document Ranking with a Pretrained Sequence-to-Sequence Model (Nogueira et al., 2020)
We're first going to download the queries, qrels and run files corresponding to the MS MARCO set considered. The run file is generated by following the BM25 ranking instructions. We'll store all these files in the data
directory.
wget https://www.dropbox.com/s/5xa5vjbjle0c8jv/msmarco_ans_small.zip -P data
To confirm, msmarco_ans_small.zip
should have MD5 checksum of 65d8007bfb2c72b5fc384738e5572f74
.
Next, we extract the contents into data
.
unzip data/msmarco_ans_small.zip -d data
As a sanity check, we can evaluate the first-stage retrieved documents using the official MS MARCO evaluation script.
python tools/scripts/msmarco/msmarco_passage_eval.py data/msmarco_ans_small/qrels.dev.small.tsv data/msmarco_ans_small/run.dev.small.tsv
The output should be:
#####################
MRR @10: 0.15906651549508694
QueriesRanked: 105
#####################
Let's download and extract the pre-built MS MARCO index into indexes
:
wget https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-msmarco-passage-20191117-0ed488.tar.gz -P indexes
tar xvfz indexes/index-msmarco-passage-20191117-0ed488.tar.gz -C indexes
Now, we can begin with re-ranking the set.
First, lets evaluate using monoBERT!
python -um pygaggle.run.evaluate_passage_ranker --split dev \
--method seq_class_transformer \
--model castorini/monobert-large-msmarco \
--dataset data/msmarco_ans_small/ \
--index-dir indexes/index-msmarco-passage-20191117-0ed488 \
--task msmarco \
--output-file runs/run.monobert.ans_small.dev.tsv
Upon completion, the following output will be visible:
precision@1 0.2761904761904762
recall@3 0.42698412698412697
recall@50 0.8174603174603176
recall@1000 0.8476190476190476
mrr 0.41089693612003686
mrr@10 0.4026795162509449
It takes about ~52 minutes to re-rank this subset on MS MARCO using a P100.
The type of GPU will directly influence your inference time.
It is possible that the default batch results in a GPU OOM error.
In this case, assigning a batch size (using option --batch-size
) which is smaller than the default (96) should help!
The re-ranked run file run.monobert.ans_small.dev.tsv
will also be available in the runs
directory upon completion.
We can use the official MS MARCO evaluation script to verify the MRR@10:
python tools/scripts/msmarco/msmarco_passage_eval.py data/msmarco_ans_small/qrels.dev.small.tsv runs/run.monobert.ans_small.dev.tsv
You should see the same result. Great, let's move on to monoT5!
We use the monoT5-base variant as it is the easiest to run without access to larger GPUs/TPUs. Let us now re-rank the set:
python -um pygaggle.run.evaluate_passage_ranker --split dev \
--method t5 \
--model castorini/monot5-base-msmarco \
--dataset data/msmarco_ans_small \
--model-type t5-base \
--task msmarco \
--index-dir indexes/index-msmarco-passage-20191117-0ed488 \
--batch-size 32 \
--output-file runs/run.monot5.ans_small.dev.tsv
The following output will be visible after it has finished:
precision@1 0.26666666666666666
recall@3 0.4603174603174603
recall@50 0.8063492063492063
recall@1000 0.8476190476190476
mrr 0.3973368360121561
mrr@10 0.39044217687074834
It takes about ~13 minutes to re-rank this subset on MS MARCO using a P100. It is worth noting again that you might need to modify the batch size to best fit the GPU at hand.
Upon completion, the re-ranked run file run.monot5.ans_small.dev.tsv
will be available in the runs
directory.
We can use the official MS MARCO evaluation script to verify the MRR@10:
python tools/scripts/msmarco/msmarco_passage_eval.py data/msmarco_ans_small/qrels.dev.small.tsv runs/run.monot5.ans_small.dev.tsv
You should see the same result.
If you were able to replicate these results, please submit a PR adding to the replication log!
- Results replicated by @MXueguang on 2020-05-22 (commit
69de7db
) (Tesla P4) - Results replicated by @richard3983 on 2020-05-22 (commit
6e9dfc6
) (Tesla P100) - Results replicated by @HangCui0510 on 2020-05-29 (commit
591e7ff
) (Tesla P100) - Results replicated by @kelvin-jiang on 2020-05-31 (commit
82dc086
) (GeForce RTX 2080 Ti) - Results replicated by @justinborromeo on 2020-07-02 (commit
70b2a9f
) (GeForce GTX 960M) - Results replicated by @mrkarezina on 2020-07-19 (commit
c1a54cb
) (Tesla T4) - Results replicated by @qguo96 on 2020-09-08 (commit
94befbd
) (Tesla T4 on Colab) - Results replicated by @yuxuan-ji on 2020-09-08 (commit
94befbd
) (Tesla T4 on Colab) - Results replicated by @LizzyZhang-tutu on 2020-09-09 (commit
8eeefa5
) (Tesla T4 on Colab) - Results replicated by @wiltan-uw on 2020-09-13 (commit
41513a9
) (RTX 2070S) - Results replicated by @jhuang265 on 2020-10-18 (commit
e815051
) (Tesla P100 on Colab) - Results replicated by @stephaniewhoo on 2020-10-25 (commit
e815051
) (Tesla V100 on Compute Canada) - Results replicated by @rayyang29 on 2020-11-05 (commit
19b16d2
) (Tesla T4) - Results replicated by @estella98 on 2020-11-10 (commit
5e1e0dd
) (Tesla T4 on Colab) - Results replicated by @rakeeb123 on 2020-12-10 (commit
9a1fe70
) (GeForce 940MX and Tesla V100 on Compute Canada) - Results replicated by @Dahlia-Chehata on 2021-01-01 (commit
968363e
) (Tesla P100 on Compute Canada) - Results replicated by @KaiSun314 on 2021-01-08 (commit
c7fdc4f
) (Nvidia GeForce GTX 1060) - Results replicated by @wongalvis14 on 2021-02-22 (commit
7c0ebbe
) (GeForce RTX 2080 Ti on Hydra) - Results replicated by @saileshnankani on 2021-05-05 (commit
95b3da7
) (Tesla T4 on Colab) - Results replicated by @andrewyguo on 2021-05-05 (commit
6f0381e
) (Tesla T4 on Colab) - Results replicated by @larryli1999 on 2021-05-05 (commit
53b77f4
) (Tesla T4 on Colab) - Results replicated by @mzzchy on 2021-08-29 (commit
6b9c895
) (GeForce GTX 1660 Ti) - Results replicated by @AlexWang000 on 2021-10-22 (commit
63f92cf
) (Tesla T4 on Colab) - Results replicated by @manveertamber on 2021-12-08 (commit
b3e11c4
) (GeForce GTX 1660) - Results replicated by @lingwei-gu on 2022-01-05 (commit
d671f62
) (Tesla T4 on Colab) - Results replicated by @jx3yang on 2022-05-10 (commit
a326d49
) (Tesla T4 on Colab) - Results replicated by @alvind1 on 2022-05-12 (commit
9d859a1
) (Tesla T4 on Colab) - Results replicated by @aivan6842 on 2022-08-09 (commit
f54ae53
) (GeForce RTX 3070)