Most of the existing intermediary files have been removed from GitHub and instead are available from this reproduction package. Please move all folders to the repo before re-running scripts below: clusters
, dataset
, downloads
, manual
, output
.
You can generate the benchmark with accessible datasets using the build_benchmark.sh
script.
All folder names are arbitrary and can be modified in the parameters.py
script.
Be sure to create the associated conda environment from the environment.yml
file in the main repo using:
conda env create --file environment.yml -n opend5
The pull_data.py
script contains individual functions for each dataset in the benchmark. Each function has one of three methods for obtaining data:
- Download: It downloads and leaves mostly intact a mirror of an existing dataset. Datasets are variously downloaded from GitHub repositories, Zenodo, Harvard Dataverse, sites hosted by authors, and other sources of reproduction material.
- Scrape: Some datasets (e.g.
open_review
oradmin_statemnts
) are constructed by collecting data from an API or crawler. These datasets should be separated into separate scripts beginning withscrape
. - Manual: For datasets without an easily accessible URL, source files are downloaded manually. The respective function should directly preprocess the downloaded dataset.
By default, manually downloaded datasets should be located in the manual
folder under a subdirectory with the dataset's name. Automatically downloaded datasets should have a copy saved to the downloads
folder (to preserve reproducibility). All formatted datasets should be outputted to the outputs
folder.
To generate just the datasets that are accessible but not licensed for sharing, use pull_data.py --access
. To retrieve data from Twitter, get a valid API key and add it to the relevant scripts (scrape_twitter_rumors.py
and scrape_blm_countermovements.py
). You should then set the status
field of these datasets in datasets.yaml
to accessible
.
Some tasks in the benchmark are generated automatically at scale from large datasets. generate.py
contains helper functions for doing this. Functionalities include:
- For categorical features, contrasting each label with every label or creating every label-to-label pair.
- For discrete features, pairing labels step-wise (i.e. 1 with 2, 2 with 3, etc.)
For example usage, reference the add_pairs.py
script, which should also hold any of your automatically generated pairs. Datasets should be stored in tabular form in the datasets
folder.
Several of the D5 tasks are clusters generated from large corpora. We store these unlabeled collections of text in the unlabeled
folder. get_embeddings.py
embeds the text and create_cluster.py
uses those embeddings to create K-means clusters.
make_benchmark.py
aggregates the full benchmark. If everything is set up properly running the following command (from the main repo) should generate the full benchmark in the benchmarks
folder: python scripts/make_benchmark.py --full
utils.py
contains helper functions that are used throughout.test_discriminative.py
has weak classifiers which we use to measure problem difficulty.