Skip to content

kermitt2/dataset_recognition_resources

Repository files navigation

Dataset recognition resources

Original resources

Resources and install path of the resources

  • Dataseer corpus (dataseer/), biomedicine domain, focusing on identification of data sentences, annotations of implicit/explicit data mentions, data types and annotation of data acquisition devices (but missing annotation of explicit dataset names), non-public
  • https://github.com/xjaeh/ner_dataset_recognition (ner_dataset_recognition/), IR/ML/NLP domain, only explicitly named and reused datasets
  • https://www.kaggle.com/datasets/panhuitong/dmdd-corpus (dmdd/) is close to the previous one (Heddes et al., 2021), same IR/ML/NLP domain, only explicitly named and reused datasets, 450 manually annotated articles but false negative not manually corrected
  • oddpub dataset https://osf.io/yv5rx/ (oddpub-dataset/), biomedicine domain, only article screening (no annotation), only datasets with open access statements, only explicit datasets
  • transparency-indicators dataset https://osf.io/e58ws/ (transparency-indicators-dataset/), biomedicine domain, only article screening (no annotation)
  • Coleridge corpus (coleridge/), partial (only a very small subset of named "datasets" considered), no explicit annotation, no valid definition of datasets (e.g. research initiative name considered as "dataset")
  • SciREX, a dataset of 438 annotated arXiv documents only on ML domain, with identification of named datasets (label is "Material"), see https://github.com/allenai/SciREX (reported IAA on 5 documents is 95% average cohen-κ scores), one drawback is the pre-tokenized words which is destructive (because we lose the original delimiters and we can't reconstruct the original text)
  • EneRex (https://github.com/DiscoveryAnalyticsCenter/EneRex) has data sentences and dataset/software annotations (Brat format) for 147 full text files, however only arXiv computer domain and only named dataset/software.

Assemble resources

Survive in the python dependency marshlands:

virtualenv --system-site-packages -p python3.8 env
source env/bin/activate

Install dependencies

pip3 install -r requirements.txt 

Assemble resources in the same JSON format:

python3 assemble.py --output combined/

This will create under combined/ one JSON file per orginal corpus in the same JSON format using span offsets.

Recycled and upcycled resources

  • sentences from https://github.com/xjaeh/ner_dataset_recognition have been reviewed, re-annotated to follow common dataset annotation principles: it covers now new dataset (not just reused ones) and annotation is at dataset level (avoid one annotation for a conjunction expression of datasets). They can be used to train public models for dataset name recognition.

  • sentences from dataseer: labeling of data sentences infomation. Other annotations are implicit data (it should be complete) and data acquisition devices (imcomplete), non-public: can be used for eval, but not for training public models (and can't be shared of course).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages