Add support for new ICDAR Europeana NER Dataset #2911

stefan-it · 2022-08-17T14:17:01Z

Hi,

this PR adds support for our recently released ICDAR Europeana NER dataset.

The dataset itself is based on the French and Dutch part of the Europeana Newspapers NER dataset, and we performed further preprocessing steps, such as sentence splitting, punctuation normalization and introducing training/development and training splits.

The dataset is released with our "Data Centric Domain Adaptation for Historical Text with OCR Errors ICDAR 2021 paper and can be found in our repo here.

Usage

The dataset can be used in Flair like:

from flair.datasets import NER_ICDAR_EUROPEANA

french_corpus = NER_ICDAR_EUROPEANA(language="fr")
dutch_corpus  = NER_ICDAR_EUROPEANA(language="nl")

alanakbik · 2022-08-18T08:53:44Z

@stefan-it thanks for adding this!

stefan-it added 4 commits August 17, 2022 16:10

datasets: add support for ICDAR Europeana NER dataset

9e2d628

datasets: global import of new NER_ICDAR_EUROPEANA dataset

2f26e86

tests: add some testcases for tests/test_datasets.py dataset

150eeae

datasets: add missing module import for NER_ICDAR_EUROPEANA

20eaf90

alanakbik merged commit 493b61f into master Aug 18, 2022

alanakbik deleted the add-ner-icdar-europeana-dataset branch August 18, 2022 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for new ICDAR Europeana NER Dataset #2911

Add support for new ICDAR Europeana NER Dataset #2911

stefan-it commented Aug 17, 2022 •

edited

Loading

alanakbik commented Aug 18, 2022

Add support for new ICDAR Europeana NER Dataset #2911

Add support for new ICDAR Europeana NER Dataset #2911

Conversation

stefan-it commented Aug 17, 2022 • edited Loading

Usage

alanakbik commented Aug 18, 2022

stefan-it commented Aug 17, 2022 •

edited

Loading