This repo is the implementation of the following paper:
Improving Generalization in Semantic Parsing by Increasing Natural Language Variation
Irina Saparina and Mirella Lapata
EACL'24
This dataset is released under the CC BY-SA 4.0 license, meaning you must credit the original source and share any derivative works under the same license, even for commercial use.
You can download augmentated Spider and evaluation datasets from Google Drive.
Preprocess Dr.Spider:
cd data/diagnostic-robustness-text-to-sql
python data_preprocess.py
Preprocess KaggleDBQA:
cd data/kaggle-dbqa
python preprocess.py
T5 checkpoint is available on the HuggingFace Hub.
RESDSQL checkpoints are available on Google Drive. Download it and unzip files into models/RESDSQL
.
Create conda env:
conda env create -n nlvariation_env -f enviroment.yaml
conda activate nlvariation_env
Install RESDSQL dependencies:
cd RESDSQL
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
python nltk_downloader.py
Clone evaluation scripts:
mkdir picard/third_party
cd picard/third_party
git clone https://github.com/facebookincubator/hsthrift
git clone https://github.com/facebook/zstd
git clone https://github.com/facebook/wangle
git clone https://github.com/facebook/folly
git clone https://github.com/elementai/spider
git clone https://github.com/elementai/test-suite-sql-eval
git clone https://github.com/hasktorch/tokenizers
git clone https://github.com/facebook/fbthrift
git clone https://github.com/fmtlib/fmt
git clone https://github.com/rsocket/rsocket-cpp
git clone https://github.com/facebookincubator/fizz
cd ../../
mkdir RESDSQL/third_party
cd RESDSQL/third_party
git clone https://github.com/ElementAI/spider.git
git clone https://github.com/ElementAI/test-suite-sql-eval.git
mv ./test-suite-sql-eval ./test_suite
The code used for exeperiments with T5 and PICARD is a fork of official PICARD implementation:
cd picard
You can run T5 evaluation with:
sh ./configs/dr_spider/eval_dr_spider_t5-spider-augs.sh # Dr.Spider
sh ./configs/kaggle/eval_kaggle_t5-spider-augs.sh # KaggeDBQA
sh ./configs/geoquery/eval_geoquery_t5-spider-augs.sh # Dr.Spider
You need to use Docker (see more info) to run PICARD. You can run evaluation with:
sh ./configs/dr_spider/eval_dr_spider_t5-spider-augs.sh # Dr.Spider
sh ./configs/kaggle/eval_kaggle_t5-spider-augs.sh # KaggeDBQA
sh ./configs/geoquery/eval_geoquery_t5-spider-augs.sh # GeoQuery
You can run training on augmented dataset with:
python seq2seq/run_seq2seq.py configs/train_augs.json
The code used for exeperiments with RESDSQL is a fork of official RESDSQL implementation:
cd RESDSQL
You can run RESDSQL evaluation with:
sh ./configs/dr_spider/eval_dr_spider_t5-spider-augs.sh # Dr.Spider
sh ./configs/kaggle/eval_kaggle_t5-spider-augs.sh # KaggeDBQA
sh ./configs/geoquery/eval_geoquery_t5-spider-augs.sh # GeoQuery
You can run training on augmented dataset with:
sh ./configs/train_augs.sh
We used the following datasets: Spider, Dr.Spider, KaggleDBQA, GeoQuery. The code is based on official PICARD implementation and official RESDSQL implementation (includes NatSQL). We thank all authors for their work.