-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for HIPE 2022 #2675
Conversation
@stefan-it awesome, thanks a lot for adding this! The unit tests fail due to two minor flake8 errors: Unused import: Should use 'not in' to test for membership: |
Thanks for these hints @alanakbik ! I added the missing import to the |
Awesome @stefan-it finally we have the ALIEN class in Flair :D |
@stefan-it thanks again for this. A few questions:
Here's my script to check if all annotations are there: from flair.datasets import NER_HIPE_2022
for config in [
# ("ajmc", "de"), no training split
("hipe2020", "de"),
("letemps", "fr"),
("newseye", "de"),
# ("sonar", "de"), no training split
("topres19th", "en"),
]:
print("\n\n ---- " + str(config))
corpus = NER_HIPE_2022(dataset_name=config[0],
language=config[1],
add_document_separator=True,
in_memory=False)
print(corpus)
print(corpus.make_label_dictionary('ner')) |
Hi @alanakbik
From the original NewsEye dataset:
HIPE 2022:
For HIPE 2022: The original end of paragraph (newline between the sentences) in the original NewsEye dataset is removed. And "EndOfSentence" is added. |
@stefan-it thanks for the infos and the PR! |
Just a comment on the motivation why we have dev and dev2 in the newseye data. Newseye already published a public test set where people might (will) have results published for. But newseye reserved a currently still private second test set for HIPE 2022. In order not to confuse participants of the HIPE shared task, we felt it would be better not to call the published data set "testset". Additionally, we still wanted people to be able to evaluate on the "published newseye train/test/dev splits" even if they use the HIPE 2022 data packages. |
Another comment: EndOfSentence in MISC is now used for all dataset where we have relatively good automatic or manual sentence splitting. EndOfLine in MISC refers to layout information as before. The ajcm dataset is currently just consisting of a sample. Tomorrow, a proper train/dev split will be available. |
@simon-clematide thanks for the info! @stefan-it does the class have to be adapted to make use of this information? |
Whenever there's a new version out, I will update split information and test cases 🤗 |
Hi @simon-clematide do you accidentally plan to perform de-hyphenation as well in the upstream data? If not, I'm going to add a flag to enable de-hyphenation. As far as I can tell this is only needed for NewEye:
HIPE-2020:
|
Hi,
this PR adds support for the recently released HIPE 2022 Shared Tasks NER datasets.
HIPE 2022 is a lot more challenging, because of more datasets, languages and different label sets compared to the previous CLEF-HIPE 2020 dataset.
The current released v1.0 version of the dataset includes support for the following datasets and languages:
ajmc
de
,en
pers
,work
,loc
,object
,date
,scope
hipe2020
de
,en
,fr
pers
,org
,prod
,time
,loc
letemps
fr
pers
,loc
newseye
de
,fi
,fr
,sv
PER
,LOC
,ORG
,HumanProd
sonar
de
PER
,LOC
,ORG
topres19th
en
LOC
,BUILDING
,STREET
, (not used:ALIEN
,OTHER
,FICTION
)More details can be found in the dataset documentation from HIPE 2022 repo.
In the current form, there's no "opinionated" corpus implementation. That means: no special normalization (e.g. de-hyphenation) is done: all datasets are just processed as accurate as possible.
Here's a quick example of how to use the Finnish part of the
newseye
dataset:As the HIPE 2022 is part of an ongoing Shared Task, there's no test data available in the current v1.0 release. It will be added in future versions of HIPE 2022.
Caveats
Some datasets come with no training split, such as
ajmc
, english part ofhipe2020
andsonar
.The
neweye
datasets are really special, because they come with two different development sets:dev
anddev2
. For this reason, thedev_split_name
argument can be used to control which development split should be used:Example
The NER datasets in HIPE 2022 can be used with the
NER_HIPE_2022
implementation. Only two arguments are necessary to initialize a corpus:dataset_name
)language
)The
dev_split_name
argument is mentioned in previous section. Another useful option isadd_document_separator
: it will add special-DOCSTART-
sentences into the corpus to mark document boundaries. These document boundaries are very helpful when using the previously added FLERT approach. To use, the dataset can be initialized with: