Skip to content

Releases: dimsum16/dimsum-data

Training/test data + scripts 1.5

28 Dec 21:10
Compare
Choose a tag to compare

Updates the counts in TAGSET.md to match the training data in the 1.4 release.

Training/test data + scripts 1.4

28 Dec 20:52
Compare
Choose a tag to compare
  • Data: Revised annotations in Twitter portions of training data (see README.md for a description)
  • Scripts: Evaluation script now prints a less cryptic error message with malformed input

Training/test data + scripts 1.3

17 Dec 00:40
Compare
Choose a tag to compare
  • Data: Adds blind test set (see README.md for a description)
  • Scripts: Fixes a couple of bugs in the evaluation script, and updates sst2tags.py to support non-ASCII characters in tokens

Training data + scripts 1.2

09 Nov 21:36
Compare
Choose a tag to compare
  • Fixes several inconsistencies in the training data, especially in the treatment of auxiliaries and URLs, and the parent index for non-I/i tokens (now uniformly an explicit 0).
  • Added scripts for evaluation and conversion to/from a one-sentence-per-line format. The 9-column CoNLLesque format remains the official one for the task.

Training data 1.1

08 Oct 11:40
Compare
Choose a tag to compare

Lemmas in the Twitter part of the training data were not true lemmas but only lowercased versions of the observed tokens. This release brings consistent lemmatization for the whole training set.

Training data v1.0

25 Sep 13:33
Compare
Choose a tag to compare

Training data for the DiMSUM shared task at SemEval 2016. The dataset combines and harmonizes existing corpora annotated for multiword expressions and noun and verb supersenses.