-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding WNUT_2020_NER dataset support #1942
Conversation
@aynetdia thanks for adding this! There is an import statement missing - can you add it? Also, I notice the statistics of the corpus are slightly off. If I print the Corpus: 8075 train + 2740 dev + 2691 test sentences But the task website at https://github.com/jeniyat/WNUT_2020_NER lists the following statistics:
Could you double-check? |
I was able to fix the corpus length problem, however only for the train_data and test_data. Now they contain 8444 and 2813 sentences respectively. The dev_data contains 2862 sentences instead of 2839 as specified in the GitHub repo. Not sure what the reason for the discrepancy - I went over the whole dev sample manually and couldn't notice any obvious irregularities. Edit: typo |
@aynetdia thanks for updating the PR! Just one more item: There is a folder for the 2020 version of the test data (called |
@aynetdia thanks for adding this! |
This implementation appropriately prepares and loads the aforementioned dataset.
I could also add commit the testing function for the pipenv testing, if needed.
Edit: typo