Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update HunFlair tutorial to Flair 0.12 #3137

Merged
merged 4 commits into from
Mar 10, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 37 additions & 18 deletions resources/docs/HUNFLAIR.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,44 +23,63 @@ Then, in your favorite virtual environment, simply do:
```
pip install flair
```
Furthermore, we recommend to install [SciSpaCy](https://allenai.github.io/scispacy/) for improved pre-processing
and tokenization of scientific / biomedical texts:
```
pip install scispacy==0.2.5
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
```

#### Example Usage
#### Example 1: Biomedical NER
Let's run named entity recognition (NER) over an example sentence. All you need to do is
make a Sentence, load a pre-trained model and use it to predict tags for the sentence:
```python
from flair.data import Sentence
from flair.models import MultiTagger
from flair.tokenization import SciSpacyTokenizer
from flair.nn import Classifier

# make a sentence and tokenize with SciSpaCy
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=SciSpacyTokenizer())
# make a sentence
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

# load biomedical tagger
tagger = MultiTagger.load("hunflair")
tagger = Classifier.load("hunflair")

# tag sentence
tagger.predict(sentence)
```
Done! The Sentence now has entity annotations. Let's print the entities found by the tagger:
```python
for annotation_layer in sentence.annotation_layers.keys():
for entity in sentence.get_spans(annotation_layer):
print(entity)
for entity in sentence.get_labels():
print(entity)
```
This should print:
~~~
```console
Span[0:2]: "Behavioral abnormalities" → Disease (0.6736)
Span[9:12]: "Fragile X Syndrome" → Disease (0.99)
Span[4:5]: "Fmr1" → Gene (0.838)
Span[6:7]: "Mouse" → Species (0.9979)
~~~
```


#### Example 2: Biomedical NER with Better Tokenization

Scientific texts are difficult to tokenize. For this reason, we recommend to install [SciSpaCy](https://allenai.github.io/scispacy/) for improved pre-processing and tokenization of scientific / biomedical texts:
```
pip install scispacy==0.2.5
pip install https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.2.5/en_core_sci_sm-0.2.5.tar.gz
```

Use this code to apply scientific tokenization:

```python
from flair.data import Sentence
from flair.nn import Classifier
from flair.tokenization import SciSpacyTokenizer

# make a sentence and tokenize with SciSpaCy
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome",
use_tokenizer=SciSpacyTokenizer())

# load biomedical tagger
tagger = Classifier.load("hunflair")

# tag sentence
tagger.predict(sentence)
```


## Comparison to other biomedical NER tools
Tools for biomedical NER are typically trained and evaluated on rather small gold standard data sets.
Expand Down
15 changes: 7 additions & 8 deletions resources/docs/HUNFLAIR_TUTORIAL_1_TAGGING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ Let's use the pre-trained *HunFlair* model for biomedical named entity recogniti
This model was trained over 24 biomedical NER data sets and can recognize 5 different entity types,
i.e. cell lines, chemicals, disease, gene / proteins and species.
```python
from flair.models import MultiTagger
from flair.nn import Classifier

tagger = MultiTagger.load("hunflair")
tagger = Classifier.load("hunflair")
```
All you need to do is use the predict() method of the tagger on a sentence.
This will add predicted tags to the tokens in the sentence.
Expand All @@ -23,7 +23,7 @@ sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fra
tagger.predict(sentence)

# print sentence with predicted tags
print(sentence.to_tagged_string())
print(sentence)
```
This should print:
~~~
Expand All @@ -40,7 +40,7 @@ Often named entities consist of multiple words spanning a certain text span in t
"_Behavioral Abnormalities_" or "_Fragile X Syndrome_" in our example sentence.
You can directly get such spans in a tagged sentence like this:
```python
for disease in sentence.get_spans("hunflair-disease"):
for disease in sentence.get_labels("hunflair-disease"):
print(disease)
```
This should print:
Expand Down Expand Up @@ -71,9 +71,8 @@ You can retrieve all annotated entities of the other entity types in analogous w
for cell lines, `hunflair-chemical` for chemicals, `hunflair-gene` for genes and proteins, and `hunflair-species`
for species. To get all entities in one you can run:
```python
for annotation_layer in sentence.annotation_layers.keys():
for entity in sentence.get_spans(annotation_layer):
print(entity)
for entity in sentence.get_labels():
print(entity)
```
This should print:
~~~
Expand Down Expand Up @@ -117,7 +116,7 @@ abstract = "Fragile X syndrome (FXS) is a developmental disorder caused by a mut
To work with complete abstracts or full-text, we first have to split them into separate sentences.
Again we can apply the integration of the [SciSpaCy](https://allenai.github.io/scispacy/) library:
```python
from flair.tokenization import SciSpacySentenceSplitter
from flair.splitter import SciSpacySentenceSplitter

# initialize the sentence splitter
splitter = SciSpacySentenceSplitter()
Expand Down