🏆 Award-Winning Project: Recipient of the MIT Open Data Prize 2022.
This project collects legalNLP data to predict whether a legal article of WTO rulings can be applied to the given textual description of dispute between two countries in WTO judicial body.
Currnetly there exists no publicly shared dataset that researchers can share and study together in the field of LegalNLP
. One of the main reason of this vacancy is because of the locality of the law. Thus this repo aims to prepare publicly available LegalNLP dataset
within the field of International Law
which is operating globally in English. Moreover, to let the researchers to understand how deep Learning could be implemented, the repo provides sample model code with the prepared dataset.
This project is assumed to achieve following two main goals:
- Build a Legal NLP Dataset so that everyone can participate in this legal prediction/classification agenda in objective manner
- Performs a classification with a neural network to achieve the
naive-baseline, 0.5 <
AUC-ROC
of the classification task.
Basically, the WTO panel process determines whether a country's government measure at issue is contrary or not contrary to a certain article(s) of rules of WTO, by explicitly saying as following :
"Korea’s domestic support for beef in 1997 and 1998 exceeded the de minimis level contrary to Article 6 of the Agreement on Agriculture."
Government measure is the most tricky part to prepare the data to train. Government measure is usually descriptive and case-specific, therefore it is hard to be generalized across the cases. Moreover, Government measure has no strictly enforced formatting style but mainly depends on the preference of each panel body. Therefore, for the first version of the dataset, we just naively includes all the text strings that can be found under the chapter-name of Factual Aspect in every Panel Report.
Normally, description about the Government Measure could be found at following classes of WTO documents:
- Factual Aspects in Panel Report [example]
- Request for Consultations [example]
- Request for the Establishment of a Panel [example]
-
Download
train_data.json
andtest_data.json
-
Each data instance looks as following:
{"testid": [DS_number]_[Article Name] "gov": Textual description of Goverment Measure "art": Article contents correponding to [Article Name] "label": [0] if not cited, [1] if cited} * [DS_number] is an unique identification code for each case requested to WTO * Example of [Article Name] is Article I:1, Article III:4, etc.
-
After Download
place the downloaded
train_data.json
andtest_data.json
to your preferredPATH
. Then edit theTRAININGSET_DIR
andVALIDATIONSET_DIR
variable inmodels/cite_wa/OneLabelTextCNN/train.py
withPATH
. -
In case you don't have
GoogleNews-vectors-negative300.bin
Download
GoogleNews-vectors-negative300.bin.gz
from Googl Drive andgunzip
it to your preferredPATH
. Then edit theword2vec_path
argument at the last line ofmodels/cite_wa/OneLabelTextCNN/train.py
withPATH
.
git clone https://github.com/syyunn/DeepWTO
cd DeepWTO
conda env create -f environment.yaml
python -m spacy download en # download spacy model to tokenize
cd models/cite_wa/OneLabelTextCNN
python train.py
Path | Description |
---|---|
DeepWTO | Main folder. |
├ assests | Images required in README |
├ models | TF Models for different tasks code with data |
├ |
Prediction of which articles being cited without legal text (multi label classification; Deprecated) |
└ cite_wa | Prediction of whether an article is cited with article content (one label classification) |
├ prep | Storage of codes to prepare data for all different tasks |
├ download | Codes to crawl/cleanse the data from WTO database / Crawl Results |
├ factual | Codes to parse factual aspects parts from the Panel Report |
├ label | Codes and Raw data that is to be used as label |
└ cite | Codes and Raw data to prepare labels to be used in citability prediction task |
└ provision | PDF and TEXT file that contains raw data of legal provisions |
├ utils | Simple util codes to use |
└ web | Front/Server codes to deploy the project (Currently Working) |
The model has achieved AUC-ROC 0.8432
in one-label-classification task (cite_wa) with test data.
The maximum of AUC-ROC
is 1
.
Also, the model has achieved Accuracy 92.04%
in test data set with following statistics :
Total Correct Prediction for label [1] is 37 out of 83
Total Correct Prediction for label [0] is 2068 out of 2204
However, the preferred metric for the model performance is AUC-ROC
because a naive-baseline of the Accuracy
is 96.37% (2204/2287)
when the model just keep predicting label [0]
for every case. Since only a few number of articles are cited (label [1])
among entire articles for each case, it is more preferred to measure how the model precisely predicts the label [1]
with AUC-ROC
.
- More detailed and kind explanation about the project can be found in this preprint (or slides)
- Or, you may want to refer to this paper which explains how to use this data to find the correlation between legal articles in WTO rulings.
Special Thanks to RandolphIV
The repo has provided awesome model code to conduct document classification that this repo has referred to.