Integrate BigBio NER data sets into HunFlair #3146

mariosaenger · 2023-03-15T14:46:19Z

This PR implements an adapter to integrate biomedical named entity recognition data sets provided by the BigScience biomedical initiative (also known as BigBio):

https://github.com/bigscience-workshop/biomedical

BigBIO is an open library of biomedical dataloaders built using Huggingface's datasets library. It provides programmatic and harmonized access to over 120 biomedical datasets. This PR implements an adapter to the named entity recognition data sets of the library enabling users to easily work with these corpora in HunFlair (e.g. for model training or evaluation).

alanakbik · 2023-03-15T14:48:53Z

@mariosaenger thanks for adding this! Does this mean that the "old" HUNER dataset classes (like HUNER_CELL_LINE) can be removed?

mariosaenger · 2023-03-15T14:54:07Z

Hi @alanakbik! No, the "old" data sets are sill needed and used. This is due to the fact that in HUNER, in addition to the more technical harmonisation (e.g. common data format), we also standardize data sets on a content-related / semantic level (e.g. standardise different entity type labels).

alanakbik · 2023-03-15T14:55:47Z

Ah I see - but I guess the non-standardized corpora like JNLPBA can be replaced with the BIGBIO version, or are they also still needed?

mariosaenger · 2023-03-15T15:04:17Z

Good point. We have to check these corpora.

mariosaenger · 2023-03-23T15:07:04Z

Hej @alanakbik! We discussed the deletion of data sets in our developer group. We would be rather reluctant to do this as it would break existing implementations referencing to these data sets. Furthermore, the BigBio datasets are (unfortunately) sometimes still of mixed quality and our data set implementations are more assured, in contrast.

However, if you insist on deleting these data sets: could we implement this in an separate PR and mark the data sets as deprecated first, since deleting the data sets would result in massive code changes?

alanakbik · 2023-03-24T06:34:41Z

No worries, we can keep the old classes in this case.

marctorsoc · 2023-03-30T15:42:54Z

@alanakbik after merging this, or some other PR, could we release a new version?

a) README says we're in 0.12.2, but the last one I can see is 0.12.1 :)
b) I'm eager to get hugging-face-hub unpinned (#3149) to unpin a bunch of deps in my repos

If that's not possible, when do you expect to release a new version? is there a calendar?

alanakbik · 2023-03-30T15:46:30Z

It was just released on pip!

marctorsoc · 2023-03-30T16:36:46Z

amazing! thanks a ton

mariosaenger · 2023-04-03T12:24:50Z

Hej @alanakbik! I added a deprecated tag to all data sets that are available in BigBio. Are there any other things that need to be changed before the implementation go into main?

alanakbik · 2023-04-11T21:57:29Z

Hello @mariosaenger, everything mostly looks good. The handling of different sentence splitters however is suboptimal with the new classes: In the old Huner classes, we appended the sentence splitter name to the generated files. This made it possible to switch sentence splitters.

Here is an illustration:

corpus = HUNER_GENE_CELL_FINDER()
print(corpus)

corpus = HUNER_GENE_CELL_FINDER(sentence_splitter=SegtokSentenceSplitter())
print(corpus)

This prints two different corpus sizes, as the first corpus is loaded using the default SciSpaCy sentence splitter, and the second with a different splitter.

However, when doing this with the new classes:

corpus = HUNER_GENE_TMVAR_V3()
print(corpus)

corpus = HUNER_GENE_TMVAR_V3(sentence_splitter=SegtokSentenceSplitter())
print(corpus)

The same corpus gets loaded twice. For the second corpus, the segtok sentence splitter is not applied.

Can you fix this? Easiest would probably be to use the same solution as for the old classes.

mariosaenger · 2023-04-12T07:36:54Z

@alanakbik I will have a look at it

…e directories per sentence splitter (configuration)

mariosaenger · 2023-04-12T14:36:44Z

@alanakbik Fixed this issue. Now the new data sets work as expected 😉

2023-04-12 16:28:29,421 Reading data from /home/mario/.flair/datasets/huner_gene_tmvar_v3/SciSpacySentenceSplitter_core_sci_sm_0.2.5_SciSpacyTokenizer_core_sci_sm_0.2.5
2023-04-12 16:28:29,422 Train: /home/mario/.flair/datasets/huner_gene_tmvar_v3/SciSpacySentenceSplitter_core_sci_sm_0.2.5_SciSpacyTokenizer_core_sci_sm_0.2.5/train.conll
2023-04-12 16:28:29,422 Dev: None
2023-04-12 16:28:29,422 Test: None
Corpus: 4454 train + 495 dev + 550 test sentences
2023-04-12 16:28:31,072 Reading data from /home/mario/.flair/datasets/huner_gene_tmvar_v3/SegtokSentenceSplitter
2023-04-12 16:28:31,072 Train: /home/mario/.flair/datasets/huner_gene_tmvar_v3/SegtokSentenceSplitter/train.conll
2023-04-12 16:28:31,072 Dev: None
2023-04-12 16:28:31,072 Test: None
Corpus: 4364 train + 485 dev + 539 test sentences

alanakbik · 2023-04-13T10:02:28Z

@mariosaenger thanks for adding this!

Mario Sänger and others added 9 commits December 15, 2022 17:58

Adapt first version of BigBio adapter implementation

6174381

Bug fix: only write dev/val split if it exists

20821ec

Merge branch 'master' into bigbio-integration

0108e6d

Added BigBio adapter classes for new datasets of Hunflair v2

35c9330

merged current master (07/02/23) into branch bigbio integration

d9a7f46

merged current master (07/02/23) into branch bigbio integration

c857602

added new BigBio dataset to HunFlair

804247b

finished BigBio integration for HunFlair v2

72a7ddf

Remove debugging code + fix local data set paths

9ed9726

Fix typo

ebaa95f

Add deprecated tag to data sets that are also available in BigBio

e134ee3

alanakbik added 2 commits April 11, 2023 22:48

Fix formatting

8c3b16d

Fix typing problems

abd4fc0

Revise BIGBIO_NER_CORPUS initialization: store conll files in separat…

a503c5b

…e directories per sentence splitter (configuration)

alanakbik merged commit dca69ab into master Apr 13, 2023

alanakbik deleted the bigbio-integration branch April 13, 2023 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate BigBio NER data sets into HunFlair #3146

Integrate BigBio NER data sets into HunFlair #3146

mariosaenger commented Mar 15, 2023

alanakbik commented Mar 15, 2023

mariosaenger commented Mar 15, 2023

alanakbik commented Mar 15, 2023

mariosaenger commented Mar 15, 2023

mariosaenger commented Mar 23, 2023

alanakbik commented Mar 24, 2023

marctorsoc commented Mar 30, 2023

alanakbik commented Mar 30, 2023

marctorsoc commented Mar 30, 2023

mariosaenger commented Apr 3, 2023

alanakbik commented Apr 11, 2023

mariosaenger commented Apr 12, 2023

mariosaenger commented Apr 12, 2023

alanakbik commented Apr 13, 2023

Integrate BigBio NER data sets into HunFlair #3146

Integrate BigBio NER data sets into HunFlair #3146

Conversation

mariosaenger commented Mar 15, 2023

alanakbik commented Mar 15, 2023

mariosaenger commented Mar 15, 2023

alanakbik commented Mar 15, 2023

mariosaenger commented Mar 15, 2023

mariosaenger commented Mar 23, 2023

alanakbik commented Mar 24, 2023

marctorsoc commented Mar 30, 2023

alanakbik commented Mar 30, 2023

marctorsoc commented Mar 30, 2023

mariosaenger commented Apr 3, 2023

alanakbik commented Apr 11, 2023

mariosaenger commented Apr 12, 2023

mariosaenger commented Apr 12, 2023

alanakbik commented Apr 13, 2023