Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding WNUT_2020_NER dataset support #1942

Merged
merged 5 commits into from
Nov 9, 2020

Conversation

aynetdia
Copy link
Collaborator

@aynetdia aynetdia commented Nov 5, 2020

This implementation appropriately prepares and loads the aforementioned dataset.

I could also add commit the testing function for the pipenv testing, if needed.

Edit: typo

@alanakbik
Copy link
Collaborator

@aynetdia thanks for adding this! There is an import statement missing - can you add it?

Also, I notice the statistics of the corpus are slightly off. If I print the WNUT_2020_NER object, I get:

Corpus: 8075 train + 2740 dev + 2691 test sentences

But the task website at https://github.com/jeniyat/WNUT_2020_NER lists the following statistics:

  • train_data: 370 protocols with 8444 sentences
  • dev_data: 122 protocols with 2839 sentences
  • test_data: 123 protocols with 2813 sentences

Could you double-check?

@aynetdia
Copy link
Collaborator Author

aynetdia commented Nov 6, 2020

I was able to fix the corpus length problem, however only for the train_data and test_data. Now they contain 8444 and 2813 sentences respectively. The dev_data contains 2862 sentences instead of 2839 as specified in the GitHub repo.

Not sure what the reason for the discrepancy - I went over the whole dev sample manually and couldn't notice any obvious irregularities.

Edit: typo

@alanakbik
Copy link
Collaborator

@aynetdia thanks for updating the PR!

Just one more item: There is a folder for the 2020 version of the test data (called test_data_2020), which contains a lot more sentences. Their paper (see Table 1) distinguishes between Text-18 and Test-20 datasets. Test-20 is what was used to evaluate and has 3562 sentences. Can you change it so that test_data_2020 is used instead of test_data?

@alanakbik
Copy link
Collaborator

@aynetdia thanks for adding this!

@alanakbik alanakbik merged commit b1efb77 into flairNLP:master Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants