GitHub - Fantabulous-J/CLASS

Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision

The source code for our EMNLP 2024 Paper Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision.

Install environment

pip install -r requirements.txt

Evaluation

Models

fanjiang98/CLASS-XOR-Retrieve: model fine-tuned on XOR-Retrieve.
fanjiang98/CLASS-XOR-Full: model fine-tuned on XOR-Full.

XOR-TYDI-QA

Download Dataset

mkdir -p data/XOR-Retrieve
cd data/XOR-Retrieve
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_train_retrieve_eng_span.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_retrieve_eng_span_v1_1.jsonl
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/models/enwiki_20190201_w100.tsv -O psgs_w100.tsv
cd ../../

mkdir -p data/XOR-Full
cd data/XOR-Full
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_train_full.jsonl
wget https://nlp.cs.washington.edu/xorqa/XORQA_site/data/xor_dev_full_v1_1.jsonl
wget https://dl.fbaipublicfiles.com/dpr/data/retriever/nq-train.qa.csv
wget https://nlp.cs.washington.edu/xorqa/cora/models/all_w100.tsv
cd ../../

XOR-Retrieve

Generate Embeddings

Encode Query

bash scripts/XOR-Retrieve/encode_query.sh

Encode Corpus

bash scripts/XOR-Retrieve/encode_corpus.sh

Note that MODEL_PATH should be fanjiang98/CLASS-XOR-Retrieve.

Retrieve

bash scripts/XOR-Retrieve/retrieve_hn.sh

Note that MODEL_PATH should be fanjiang98/CLASS-XOR-Retrieve We use the official scripts provided by XOR-TYDI-QA for evaluation:

python3 evals/eval_xor_retrieve.py \
    --data_file <path_to_input_data> \
    --pred_file <path_to_predictions>

This leads to the following results:

Model	R@2k								R@5k
	Ar	Bn	Fi	Ja	Ko	Ru	te	Avg	Ar	Bn	Fi	Ja	Ko	Ru	te	Avg
CLASS-US	54.5	67.4	58.6	47.7	51.6	59.9	65.6	57.9	64.8	73.0	64.7	57.3	58.6	67.9	70.6	65.3
CLASS-ZS	59.2	70.1	59.9	51.5	57.2	51.5	72.3	60.2	66.7	78.6	66.6	60.2	63.2	58.2	78.2	67.4
CLASS	66.7	79.6	64.3	58.1	66.0	64.1	77.7	68.1	70.6	84.9	71.0	66.0	72.6	70.0	81.9	73.9

XOR-Full

Retrieve

It is the same as in XOR-Retrieve. Please find corresponding scripts under scripts/XOR-Full and replace MODEL_PATH with fanjiang98/CLASS-XOR-Full.

Answer Generation

bash scripts/XOR-Full/eval_reader.sh

MODEL_PATH should be fanjiang98/CLASS-XOR-Full. We use the official scripts provided by XOR-TYDI-QA for evaluation:

python3 evals/eval_xor_full.py \
    --data_file <path_to_input_data> \
    --pred_file <path_to_predictions>

This leads to the following results:

Model	F1							Macro Average
	Ar	Bn	Fi	Ja	Ko	Ru	te	F1	EM	BLEU
CORA	42.9	26.9	41.4	36.8	30.4	33.9	30.9	34.7	25.8	23.3
CLASS	49.1	32.0	46.7	44.1	38.4	39.9	41.1	41.6	32.5	28.2

Training

Please download the training data from OneDrive and put them on corresponding directories under data.

Stage-1 Pre-training:

bash scripts/train_mss_distill_reader.sh

Stage-2 Pre-training:

bash scripts/XOR-Retrieve/train_mss_iterative_reader.sh

Fine-tuning on Natural Questions (zero-shot model):

bash scripts/XOR-Retrieve/train_nq_iterative_reader.sh

Fine-tuning on XOR-Retrieve training data (i.e., our released CLASS-XOR-Retrieve model):

bash scripts/XOR-Retrieve/train_iterative_reader.sh

The training pipeline for XOR-Full is the same, please find corresponding scripts under scripts/XOR-Full for steps 2, 3 and 4.

We use slurm for training on 32 80G A100 for stage-1 and 16 A100 for the rest.

Acknowledgement

Some of the code was adapted from https://github.com/jzbjyb/ReAtt.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
evals		evals
scripts		scripts
README.md		README.md
arguments.py		arguments.py
dataloader.py		dataloader.py
encode.py		encode.py
model.py		model.py
normalize_text.py		normalize_text.py
requirements.txt		requirements.txt
retriever.py		retriever.py
test_reader.py		test_reader.py
train.py		train.py
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision

Install environment

Evaluation

Models

XOR-TYDI-QA

Download Dataset

XOR-Retrieve

Generate Embeddings

Retrieve

XOR-Full

Retrieve

Answer Generation

Training

Acknowledgement

About

Releases

Packages

Languages

Fantabulous-J/CLASS

Folders and files

Latest commit

History

Repository files navigation

Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision

Install environment

Evaluation

Models

XOR-TYDI-QA

Download Dataset

XOR-Retrieve

Generate Embeddings

Retrieve

XOR-Full

Retrieve

Answer Generation

Training

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages