-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate BigBio NER data sets into HunFlair #3146
Conversation
@mariosaenger thanks for adding this! Does this mean that the "old" HUNER dataset classes (like |
Hi @alanakbik! No, the "old" data sets are sill needed and used. This is due to the fact that in HUNER, in addition to the more technical harmonisation (e.g. common data format), we also standardize data sets on a content-related / semantic level (e.g. standardise different entity type labels). |
Ah I see - but I guess the non-standardized corpora like JNLPBA can be replaced with the BIGBIO version, or are they also still needed? |
Good point. We have to check these corpora. |
Hej @alanakbik! We discussed the deletion of data sets in our developer group. We would be rather reluctant to do this as it would break existing implementations referencing to these data sets. Furthermore, the BigBio datasets are (unfortunately) sometimes still of mixed quality and our data set implementations are more assured, in contrast. However, if you insist on deleting these data sets: could we implement this in an separate PR and mark the data sets as deprecated first, since deleting the data sets would result in massive code changes? |
No worries, we can keep the old classes in this case. |
@alanakbik after merging this, or some other PR, could we release a new version? a) README says we're in 0.12.2, but the last one I can see is 0.12.1 :) If that's not possible, when do you expect to release a new version? is there a calendar? |
It was just released on pip! |
amazing! thanks a ton |
Hej @alanakbik! I added a deprecated tag to all data sets that are available in BigBio. Are there any other things that need to be changed before the implementation go into main? |
Hello @mariosaenger, everything mostly looks good. The handling of different sentence splitters however is suboptimal with the new classes: In the old Huner classes, we appended the sentence splitter name to the generated files. This made it possible to switch sentence splitters. Here is an illustration: corpus = HUNER_GENE_CELL_FINDER()
print(corpus)
corpus = HUNER_GENE_CELL_FINDER(sentence_splitter=SegtokSentenceSplitter())
print(corpus) This prints two different corpus sizes, as the first corpus is loaded using the default SciSpaCy sentence splitter, and the second with a different splitter. However, when doing this with the new classes: corpus = HUNER_GENE_TMVAR_V3()
print(corpus)
corpus = HUNER_GENE_TMVAR_V3(sentence_splitter=SegtokSentenceSplitter())
print(corpus) The same corpus gets loaded twice. For the second corpus, the segtok sentence splitter is not applied. Can you fix this? Easiest would probably be to use the same solution as for the old classes. |
@alanakbik I will have a look at it |
…e directories per sentence splitter (configuration)
@alanakbik Fixed this issue. Now the new data sets work as expected 😉
|
@mariosaenger thanks for adding this! |
This PR implements an adapter to integrate biomedical named entity recognition data sets provided by the BigScience biomedical initiative (also known as BigBio):
https://github.com/bigscience-workshop/biomedical
BigBIO is an open library of biomedical dataloaders built using Huggingface's datasets library. It provides programmatic and harmonized access to over 120 biomedical datasets. This PR implements an adapter to the named entity recognition data sets of the library enabling users to easily work with these corpora in HunFlair (e.g. for model training or evaluation).