The source code for our EMNLP 2024 Paper Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision.
pip install -r requirements.txt
- fanjiang98/CLASS-XOR-Retrieve: model fine-tuned on XOR-Retrieve.
- fanjiang98/CLASS-XOR-Full: model fine-tuned on XOR-Full.
mkdir -p data/XOR-Retrieve
cd data/XOR-Retrieve
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_train_retrieve_eng_span.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_retrieve_eng_span_v1_1.jsonl
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/models/enwiki_20190201_w100.tsv -O psgs_w100.tsv
cd ../../
mkdir -p data/XOR-Full
cd data/XOR-Full
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_train_full.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_full_v1_1.jsonl
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv
wget https://nlp.cs.washington.edu/xorqa/cora/models/all_w100.tsv
cd ../../
Encode Query
bash scripts/XOR-Retrieve/encode_query.sh
Encode Corpus
bash scripts/XOR-Retrieve/encode_corpus.sh
Note that MODEL_PATH
should be fanjiang98/CLASS-XOR-Retrieve
.
bash scripts/XOR-Retrieve/retrieve_hn.sh
Note that MODEL_PATH
should be fanjiang98/CLASS-XOR-Retrieve
We use the official scripts provided by XOR-TYDI-QA for evaluation:
python3 evals/eval_xor_retrieve.py \
--data_file <path_to_input_data> \
--pred_file <path_to_predictions>
This leads to the following results:
Model | R@2k | R@5k | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Ar | Bn | Fi | Ja | Ko | Ru | te | Avg | Ar | Bn | Fi | Ja | Ko | Ru | te | Avg | |
CLASS-US | 54.5 | 67.4 | 58.6 | 47.7 | 51.6 | 59.9 | 65.6 | 57.9 | 64.8 | 73.0 | 64.7 | 57.3 | 58.6 | 67.9 | 70.6 | 65.3 |
CLASS-ZS | 59.2 | 70.1 | 59.9 | 51.5 | 57.2 | 51.5 | 72.3 | 60.2 | 66.7 | 78.6 | 66.6 | 60.2 | 63.2 | 58.2 | 78.2 | 67.4 |
CLASS | 66.7 | 79.6 | 64.3 | 58.1 | 66.0 | 64.1 | 77.7 | 68.1 | 70.6 | 84.9 | 71.0 | 66.0 | 72.6 | 70.0 | 81.9 | 73.9 |
It is the same as in XOR-Retrieve. Please find corresponding scripts under scripts/XOR-Full
and replace MODEL_PATH
with fanjiang98/CLASS-XOR-Full
.
bash scripts/XOR-Full/eval_reader.sh
MODEL_PATH
should be fanjiang98/CLASS-XOR-Full
. We use the official scripts provided by XOR-TYDI-QA for evaluation:
python3 evals/eval_xor_full.py \
--data_file <path_to_input_data> \
--pred_file <path_to_predictions>
This leads to the following results:
Model | F1 | Macro Average | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Ar | Bn | Fi | Ja | Ko | Ru | te | F1 | EM | BLEU | |
CORA | 42.9 | 26.9 | 41.4 | 36.8 | 30.4 | 33.9 | 30.9 | 34.7 | 25.8 | 23.3 |
CLASS | 49.1 | 32.0 | 46.7 | 44.1 | 38.4 | 39.9 | 41.1 | 41.6 | 32.5 | 28.2 |
Please download the training data from OneDrive and put them on corresponding directories under data
.
- Stage-1 Pre-training:
bash scripts/train_mss_distill_reader.sh
- Stage-2 Pre-training:
bash scripts/XOR-Retrieve/train_mss_iterative_reader.sh
- Fine-tuning on Natural Questions (zero-shot model):
bash scripts/XOR-Retrieve/train_nq_iterative_reader.sh
- Fine-tuning on XOR-Retrieve training data (i.e., our released CLASS-XOR-Retrieve model):
bash scripts/XOR-Retrieve/train_iterative_reader.sh
The training pipeline for XOR-Full is the same, please find corresponding scripts under scripts/XOR-Full
for steps 2, 3 and 4.
We use slurm for training on 32 80G A100 for stage-1 and 16 A100 for the rest.
Some of the code was adapted from https://github.com/jzbjyb/ReAtt.