Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for new ICDAR Europeana NER Dataset #2911

Merged
merged 4 commits into from
Aug 18, 2022

Conversation

stefan-it
Copy link
Member

@stefan-it stefan-it commented Aug 17, 2022

Hi,

this PR adds support for our recently released ICDAR Europeana NER dataset.

The dataset itself is based on the French and Dutch part of the Europeana Newspapers NER dataset, and we performed further preprocessing steps, such as sentence splitting, punctuation normalization and introducing training/development and training splits.

The dataset is released with our "Data Centric Domain Adaptation for Historical Text with OCR Errors ICDAR 2021 paper and can be found in our repo here.

Usage

The dataset can be used in Flair like:

from flair.datasets import NER_ICDAR_EUROPEANA

french_corpus = NER_ICDAR_EUROPEANA(language="fr")
dutch_corpus  = NER_ICDAR_EUROPEANA(language="nl")

@alanakbik
Copy link
Collaborator

@stefan-it thanks for adding this!

@alanakbik alanakbik merged commit 493b61f into master Aug 18, 2022
@alanakbik alanakbik deleted the add-ner-icdar-europeana-dataset branch August 18, 2022 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants