Adding WNUT_2020_NER dataset support #1942

aynetdia · 2020-11-05T17:21:40Z

This implementation appropriately prepares and loads the aforementioned dataset.

I could also add commit the testing function for the pipenv testing, if needed.

Edit: typo

flair/datasets/sequence_labeling.py

alanakbik · 2020-11-06T05:19:44Z

@aynetdia thanks for adding this! There is an import statement missing - can you add it?

Also, I notice the statistics of the corpus are slightly off. If I print the WNUT_2020_NER object, I get:

Corpus: 8075 train + 2740 dev + 2691 test sentences

But the task website at https://github.com/jeniyat/WNUT_2020_NER lists the following statistics:

train_data: 370 protocols with 8444 sentences
dev_data: 122 protocols with 2839 sentences
test_data: 123 protocols with 2813 sentences

Could you double-check?

aynetdia · 2020-11-06T09:56:14Z

I was able to fix the corpus length problem, however only for the train_data and test_data. Now they contain 8444 and 2813 sentences respectively. The dev_data contains 2862 sentences instead of 2839 as specified in the GitHub repo.

Not sure what the reason for the discrepancy - I went over the whole dev sample manually and couldn't notice any obvious irregularities.

Edit: typo

alanakbik · 2020-11-09T10:23:55Z

@aynetdia thanks for updating the PR!

Just one more item: There is a folder for the 2020 version of the test data (called test_data_2020), which contains a lot more sentences. Their paper (see Table 1) distinguishes between Text-18 and Test-20 datasets. Test-20 is what was used to evaluate and has 3562 sentences. Can you change it so that test_data_2020 is used instead of test_data?

alanakbik · 2020-11-09T19:57:01Z

@aynetdia thanks for adding this!

aynetdia added 2 commits November 5, 2020 18:12

Add WNUT_2020_NER dataset support

f298119

Update Tutorial 6

168122a

alanakbik requested changes Nov 6, 2020

View reviewed changes

flair/datasets/sequence_labeling.py Show resolved Hide resolved

Added missing import and fixed the corpus length

fcc1dcb

Merge branch 'master' into wnut_2020_ner

1f77402

alanakbik approved these changes Nov 9, 2020

View reviewed changes

Changing to the 2020 version of the test data

a0935fb

alanakbik merged commit b1efb77 into flairNLP:master Nov 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding WNUT_2020_NER dataset support #1942

Adding WNUT_2020_NER dataset support #1942

aynetdia commented Nov 5, 2020 •

edited

Loading

alanakbik commented Nov 6, 2020

aynetdia commented Nov 6, 2020 •

edited

Loading

alanakbik commented Nov 9, 2020

alanakbik commented Nov 9, 2020

Adding WNUT_2020_NER dataset support #1942

Adding WNUT_2020_NER dataset support #1942

Conversation

aynetdia commented Nov 5, 2020 • edited Loading

alanakbik commented Nov 6, 2020

aynetdia commented Nov 6, 2020 • edited Loading

alanakbik commented Nov 9, 2020

alanakbik commented Nov 9, 2020

aynetdia commented Nov 5, 2020 •

edited

Loading

aynetdia commented Nov 6, 2020 •

edited

Loading