Dataset recognition resources

Original resources

Resources and install path of the resources

Dataseer corpus (dataseer/), biomedicine domain, focusing on identification of data sentences, annotations of implicit/explicit data mentions, data types and annotation of data acquisition devices (but missing annotation of explicit dataset names), non-public
https://github.com/xjaeh/ner_dataset_recognition (ner_dataset_recognition/), IR/ML/NLP domain, only explicitly named and reused datasets
https://www.kaggle.com/datasets/panhuitong/dmdd-corpus (dmdd/) is close to the previous one (Heddes et al., 2021), same IR/ML/NLP domain, only explicitly named and reused datasets, 450 manually annotated articles but false negative not manually corrected
oddpub dataset https://osf.io/yv5rx/ (oddpub-dataset/), biomedicine domain, only article screening (no annotation), only datasets with open access statements, only explicit datasets
transparency-indicators dataset https://osf.io/e58ws/ (transparency-indicators-dataset/), biomedicine domain, only article screening (no annotation)
Coleridge corpus (coleridge/), partial (only a very small subset of named "datasets" considered), no explicit annotation, no valid definition of datasets (e.g. research initiative name considered as "dataset")
SciREX, a dataset of 438 annotated arXiv documents only on ML domain, with identification of named datasets (label is "Material"), see https://github.com/allenai/SciREX (reported IAA on 5 documents is 95% average cohen-κ scores), one drawback is the pre-tokenized words which is destructive (because we lose the original delimiters and we can't reconstruct the original text)
EneRex (https://github.com/DiscoveryAnalyticsCenter/EneRex) has data sentences and dataset/software annotations (Brat format) for 147 full text files, however only arXiv computer domain and only named dataset/software.

Assemble resources

Survive in the python dependency marshlands:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install dependencies

pip3 install -r requirements.txt

Assemble resources in the same JSON format:

python3 assemble.py --output combined/

This will create under combined/ one JSON file per orginal corpus in the same JSON format using span offsets.

Recycled and upcycled resources

sentences from https://github.com/xjaeh/ner_dataset_recognition have been reviewed, re-annotated to follow common dataset annotation principles: it covers now new dataset (not just reused ones) and annotation is at dataset level (avoid one annotation for a conjunction expression of datasets). They can be used to train public models for dataset name recognition.
sentences from dataseer: labeling of data sentences infomation. Other annotations are implicit data (it should be complete) and data acquisition devices (imcomplete), non-public: can be used for eval, but not for training public models (and can't be shared of course).

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
EneRex		EneRex
additional_statements		additional_statements
combined		combined
oddpub-dataset		oddpub-dataset
scirex		scirex
.gitignore		.gitignore
Readme.md		Readme.md
assemble.py		assemble.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset recognition resources

Original resources

Assemble resources

Recycled and upcycled resources

About

Releases

Packages

Contributors 2

Languages

kermitt2/dataset_recognition_resources

Folders and files

Latest commit

History

Repository files navigation

Dataset recognition resources

Original resources

Assemble resources

Recycled and upcycled resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages