Neural Pairwise Ranking Baselines on MS MARCO Passage Retrieval - with TPU
This page contains instructions for running duoT5 on the MS MARCO passage ranking task.
We will focus on using duoT5-3B to rerank, since it is difficult to run such a large model without a TPU. We also mention the changes required to run duoT5-base for those with a more constrained compute budget.
- duoT5: The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models (Pradeep et al., 2021)
Note that there are also separate documents to run MS MARCO ranking tasks on regular GPU. Please see MS MARCO document ranking task, MS MARCO passage ranking task - Subset and MS MARCO passage ranking task - Entire.
Prior to running this, we suggest looking at our second-stage pointwise ranking instructions. We rerank the monoT5 run files that contain ~1000 passages per query (of which we'll focus on the top 50 passages) using duoT5. duo5 is a pairwise reranker. This means that the reranker estimates the probability that a document is more relevant than another. These scores are aggregated to get a single score for each document.
Since we will use some scripts form PyGaggle to process data and evaluate results, we first install it from source.
git clone --recursive
cd pygaggle
pip install .
We store all the files in the data/msmarco_passage
export DATA_DIR=data/msmarco_passage
mkdir ${DATA_DIR}
We provide specific data prep instructions for the train and dev set.
First, download the MS MARCO train triples:
cd ${DATA_DIR}
tar -xvf triples.train.small.tar.gz
rm triples.train.small.tar.gz
cd ../../
Then convert the train triples file to the duoT5 input format:
python pygaggle/data/ --triples_train ${DATA_DIR}/triples.train.small.tsv --output_to_t5 ${DATA_DIR}/query_docs_triples.train.tsv
Next, copy the duoT5 input file to Google Storage. TPU training will read data directly from gs
gsutil cp ${DATA_DIR}/query_docs_triples.train.tsv ${GS_FOLDER}/
This file is made available in our bucket.
We download the query, qrels, and corpus files corresponding to the MS MARCO passage dev set.
The run file is generated by following the PyGaggle's monoT5 TPU instructions.
In short, the files are:
: 6,980 queries from the MS MARCO dev
: 7,437 pairs of query relevant passage ids from the MS MARCO dev set.collection.tar.gz
: All passages (8,841,823) in the MS MARCO passage corpus. In this tsv file, the first column is the passage id, and the second is the passage text.
A more detailed description of the data is available here.
Let's start.
cd ${DATA_DIR}
tar -xvf collection.tar.gz
rm collection.tar.gz
cd ../../
As a sanity check, we can evaluate the second-stage retrieved documents using the official MS MARCO evaluation script. We choose one of the monoT5-base run file to rerank with duoT5-base and the monoT5-3B run file to rerank with duoT5-3B.
export MODEL_NAME=<base or 3B>
python tools/scripts/msmarco/ ${DATA_DIR}/ ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv
In the case of monoT5-3B, the output should be:
MRR @10: 0.3983799517896949
QueriesRanked: 6980
In the case of monoT5-base, the output should be:
MRR @10: 0.38160657433938283
QueriesRanked: 6980
Then, we prepare the query-doc0-doc1 pairs in the duoT5 input format.
python pygaggle/data/ --queries ${DATA_DIR}/ \
--run ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv \
--corpus ${DATA_DIR}/collection.tsv \
--t5_input ${DATA_DIR}/ \
--t5_input_ids ${DATA_DIR}/ \
--top_k 50
We will get two output files here:
: The query-doc0-doc1 triples for duoT5
: Thequery_id
s, anddoc_id_1
s that map to the query-doc0-doc1 triples. We will use this to map query-doc0-doc1 triples to their corresponding duoT5 output scores.
The files are made available in our bucket.
Note that there might be a memory issue if the duoT5 input file is too large for the memory in the instance. We thus split the input file into multiple files.
split --suffix-length 3 --numeric-suffixes --lines 500000 ${DATA_DIR}/ ${DATA_DIR}/
, we will get 35 files after split. i.e. (
Note that it is possible that running reranking might still result in OOM issues in which case reduce the number of lines to smaller than 500000
We copy these input files to Google Storage. TPU inference will read data directly from gs
export GS_FOLDER=<google storage folder to store input/output data>
gsutil cp ${DATA_DIR}/ ${GS_FOLDER}
Define environment variables.
export PROJECT_NAME=<gcloud project name>
export PROJECT_ID=<gcloud project id>
export INSTANCE_NAME=<name of vm to create>
export TPU_NAME=<name of tpu to create>
Create the VM.
gcloud beta compute --project=${PROJECT_NAME} instances create ${INSTANCE_NAME} --zone=europe-west4-a --machine-type=n1-standard-4 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --service-account=${PROJECT_ID}[email protected] --scopes= --image=debian-10-buster-v20201112 --image-project=debian-cloud --boot-disk-size=25GB --boot-disk-type=pd-standard --boot-disk-device-name=${INSTANCE_NAME} --reservation-affinity=any
It is possible that the image
and machine-type
provided here are dated so feel free to update them to whichever fits your needs.
After the VM created, we can ssh
to the machine.
Make sure to initialize PROJECT_NAME
from within the machine too.
Then create a TPU.
curl -O && chmod a+x ctpu
./ctpu up --name=${TPU_NAME} --project=${PROJECT_NAME} --zone=europe-west4-a --tpu-size=v3-8 --tpu-only --noconf
Install required tools including Miniconda.
sudo apt-get update
sudo apt-get install git gcc screen --yes
curl -O
bash ./
source ~/.bashrc
Then create a Python virtual environment for the experiments and install dependencies.
conda init
conda create --y --name py36 python=3.6
conda activate py36
conda install -c conda-forge httptools jsonnet --yes
pip install tensorflow tensorflow-text t5[gcp]
git clone
pip install --editable mesh
Let's first define the model type and checkpoint.
export MODEL_NAME=<base or 3B>
export MODEL_DIR=gs://castorini/duot5/experiments/${MODEL_NAME}
Then run following command to start the process in background and monitor the log
for ITER in {000..034}; do
echo "Running iter: $ITER" >> out.log_eval_exp
nohup t5_mesh_transformer \
--tpu="${TPU_NAME}" \
--gcp_project="${PROJECT_NAME}" \
--tpu_zone="europe-west4-a" \
--model_dir="${MODEL_DIR}" \
--gin_file="gs://t5-data/pretrained_models/${MODEL_NAME}/operative_config.gin" \
--gin_file="infer.gin" \
--gin_file="beam_search.gin" \
--gin_param="utils.tpu_mesh_shape.tpu_topology = '2x2'" \
--gin_param="infer_checkpoint_step = 1150000" \
--gin_param=" = {'inputs': 512, 'targets': 2}" \
--gin_param="Bitransformer.decode.max_decode_length = 2" \
--gin_param="input_filename = '${GS_FOLDER}/${ITER}'" \
--gin_param="output_filename = '${GS_FOLDER}/${ITER}'" \
--gin_param="'tokens_per_batch', 65536)" \
--gin_param="Bitransformer.decode.beam_size = 1" \
--gin_param="Bitransformer.decode.temperature = 0.0" \
--gin_param="Unitransformer.sample_autoregressive.sampling_keep_top_k = -1" \
>> out.log_eval_exp 2>&1
done &
tail -100f out.log_eval_exp
Using a TPU v3-8, it takes approximately 12 hours and 38 hours to rerank with duoT5-base and duoT5-3B respectively.
Note that we strongly encourage you to run any of the long processes in screen
to make sure they don't get interrupted.
After reranking is done, let's copy the results from GS to our working directory, where we concatenate all the score files back into one file.
gsutil cp ${GS_FOLDER}/ ${DATA_DIR}/
cat ${DATA_DIR}/ > ${DATA_DIR}/
Then we convert the duoT5 output to the required MSMARCO format.
python pygaggle/data/ --t5_output ${DATA_DIR}/ \
--t5_output_ids ${DATA_DIR}/ \
--duo_run ${DATA_DIR}/run.duot5_${MODEL_NAME}.dev.tsv \
--input_run ${DATA_DIR}/run.monot5_${MODEL_NAME}.dev.tsv \
--aggregate sym-sum
Now we can evaluate the reranked results using the official MS MARCO evaluation script.
python tools/scripts/msmarco/ ${DATA_DIR}/ ${DATA_DIR}/run.duot5_${MODEL_NAME}.dev.tsv
In the case of duoT5-3B, the output should be:
MRR @10: 0.40913556874516793
QueriesRanked: 6980
In the case of duoT5-base, the output should be:
MRR @10: 0.3929155864829223
QueriesRanked: 6980
If you were able to replicate any of these results, please submit a PR adding to the replication log, along with the model(s) you replicated. Please mention in your PR if you note any differences.