Skip to content

Releases: flairNLP/flair

Release 0.9

29 Aug 23:23
c5bed58
Compare
Choose a tag to compare

With release 0.9 we are refactoring Flair for simplicity and speed, to make Flair faster and more easily scale to new NLP tasks. The first new tasks included in this release are Relation Extraction (RE), support for GLUE benchmark tasks and Entity Linking - all in beta for early adopters! We're working towards a Flair 1.0 release that will span the whole suite of standard NLP tasks. Also included is a new approach for Zero-Shot Sequence Labeling based on TARS! This release also includes a wealth of new datasets for all these tasks and tons of other new features and bug fixes.

Zero-Shot Sequence Labeling with TARS (#2260)

We extend the TARS zero-shot learning approach to sequence labeling and ship a pre-trained model for English NER. Try defining some classes and see if the model can find them:

# 1. Load zero-shot NER tagger
tars = TARSTagger.load('tars-ner')

# 2. Prepare some test sentences
sentences = [
    Sentence("The Humboldt University of Berlin is situated near the Spree in Berlin, Germany"),
    Sentence("Bayern Munich played against Real Madrid"),
    Sentence("I flew with an Airbus A380 to Peru to pick up my Porsche Cayenne"),
    Sentence("Game of Thrones is my favorite series"),
]

# 3. Define some classes of named entities such as "soccer teams", "TV shows" and "rivers"
labels = ["Soccer Team", "University", "Vehicle", "River", "City", "Country", "Person", "Movie", "TV Show"]
tars.add_and_switch_to_new_task('task 1', labels, label_type='ner')

# 4. Predict for these classes and print results
for sentence in sentences:
    tars.predict(sentence)
    print(sentence.to_tagged_string("ner"))

This should print:

The Humboldt <B-University> University <I-University> of <I-University> Berlin <E-University> is situated near the Spree <S-River> in Berlin <S-City> , Germany <S-Country>

Bayern <B-Soccer Team> Munich <E-Soccer Team> played against Real <B-Soccer Team> Madrid <E-Soccer Team>

I flew with an Airbus <B-Vehicle> A380 <E-Vehicle> to Peru <S-City> to pick up my Porsche <B-Vehicle> Cayenne <E-Vehicle>

Game <B-TV Show> of <I-TV Show> Thrones <E-TV Show> is my favorite series

So in these examples, we are finding entity classes such as "TV show" (Game of Thrones), "vehicle" (Airbus A380 and Porsche Cayenne), "soccer team" (Bayern Munich and Real Madrid) and "river" (Spree), even though the model was never explicitly trained for this. Note that this is ongoing research and the examples are a bit cherry-picked. We expect the zero-shot model to improve quite a bit until the next release.

New NLP Tasks and Datasets

We prototypically now support new tasks such as GLUE benchmark, Relation Extraction and Entity Linking. With this, we ship the datasets and model classes you need to train your own models. But we are still tweaking both methods, meaning that we don't ship any pre-trained models as-of-yet.

GLUE Benchmark (#2149 #2363)

A standard benchmark to evaluate progress in language understanding, mostly consisting of single and pairwise sentence classification tasks.

New datasets in Flair:

  • 'GLUE_COLA' - The Corpus of Linguistic Acceptability from GLUE benchmark
  • 'GLUE_MNLI' - The Multi-Genre Natural Language Inference Corpus from the GLUE benchmark
  • 'GLUE_RTE' - The RTE task from the GLUE benchmark
  • 'GLUE_QNLI' - The Stanford Question Answering Dataset formated as NLI task from the GLUE benchmark
  • 'GLUE_WNLI' - The Winograd Schema Challenge formated as NLI task from the GLUE benchmark
  • 'GLUE_MRPC' - The MRPC task from GLUE benchmark
  • 'GLUE_QQP' - The Quora Question Pairs dataset where the task is to determine whether a pair of questions are semantically equivalent

Initialize datasets like so:

from flair.datasets import GLUE_QNLI

# load corpus
corpus = GLUE_QNLI()

# print corpus
print(corpus)

# print first sentence-pair of training data split
print(corpus.train[0])

# print all labels in corpus
print(corpus.make_label_dictionary("entailment"))

Relation Extraction (#2333 #2352)

Relation extraction classifies if and which relationship holds between two entities in a text.

Model class: RelationExtractor

Datasets in Flair:

Initialize datasets like so:

# initalize CoNLL 04 corpus for Relation extraction
corpus = RE_ENGLISH_CONLL04()
print(corpus)

# print first sentence of training split with annotations
sentence = corpus.train[0]

# print label dictionary
label_dict = corpus.make_label_dictionary("relation")
print(label_dict)

Entity Linking (#2375)

Entity Linking goes one step further than NER and uniquely links entities to knowledge bases such as Wikipedia.

Model class: EntityLinker

Datasets in Flair:

from flair.datasets import NEL_ENGLISH_REDDIT

# load corpus
corpus = NEL_ENGLISH_REDDIT()

# print corpus
print(corpus)

# print a sentence of training data split
print(corpus.train[3])

New NER Datasets

Other datasets

New Functionality

Support for Arabic NER (#2188)

Flair now supports NER and POS tagging for Arabic. To tag an Arabic sentence, just load the appropriate model:

# load model
tagger = SequenceTagger.load('ar-ner')

# make Arabic sentence
sentence = Sentence("احب برلين")

# predict NER tags
tagger.predict(sentence)

# print sentence with predicted tags
for entity in sentence.get_labels('ner'):
    print(entity)

This should print:

LOC [برلين (2)] (0.9803) 

More flexibility on main metric (#2161)

When training models, you can now chose any standard evaluation metric for model selection (previously it was fixed to micro F1). When calling the trainer, simply pass the desired metric as main_evaluation_metric like so:

trainer.train('resources/taggers/your_model',
              learning_rate=0.1,
              mini_batch_size=32,
              max_epochs=10,
              main_evaluation_metric=("macro avg", 'f1-score'),
              )

In this example, we now use macro F1 instead of the default micro F1.

Add handling for mapping labels to 'O' #2254

In ColumnDataset, labels can be remapped to other labels. But sometimes you may not wish to use all label types in a given dataset.
You can now remap them to 'O' and so exclude them.

For instance, to load CoNLL-03 without MISC, do:

corpus = CONLL_03(
    label_name_map={'MISC': 'O'}
)
print(corpus.make_label_dictionary('ner'))
print(corpus.train[0].to_tagged_string('ner'))

Other

  • add per-label thresholds for prediction (#2366)
  • add support for Spanish clinical Flair embeddings (#2323)
  • added 'mean', 'max' pooling strategy for TransformerDocumentEmbeddings class (#2180)
  • new DocumentCNNEmbeddings class to embed text with a trainable CNN (#2141)
  • allow negative ...
Read more

Release 0.8

05 Mar 11:57
2fde646
Compare
Choose a tag to compare

Release 0.8 adds major new features to Flair, including our best named entity recognition (NER) models yet and the ability to host, share and test Flair models on the HuggingFace model hub! In addition, there is a host of improvements, new features and new datasets to check out!

FLERT (#2031 #2032 #2104)

This release adds the "FLERT" approach to train sequence tagging models using cross-sentence features as presented in our recent paper. This yields new state-of-the-art models which we include in Flair, as well as the features to easily train your own "FLERT" models.

Pre-trained FLERT models (#2130)

We add 5 new NER models for English (4-class and 18-class), German, Dutch and Spanish (4-class each). Load for instance with:

from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("ner-large")

# make example sentence
sentence = Sentence("George Washington went to Washington")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

If you want to test these models in action, for instance the new large English Ontonotes model with 18 classes, you can now use the hosted inference API on the HF model hub, like here.

Contextualized Sentences

In order to enable cross-sentence context, we made some changes to the Sentence object and data readers:

  1. Sentence objects now have next_sentence() and previous_sentence() methods that are set automatically if loaded through ColumnCorpus. This is a pointer system to navigate through sentences in a corpus:
# load corpus
corpus = MIT_MOVIE_NER_SIMPLE(in_memory=False)

# get a sentence
sentence = corpus.test[123]
print(sentence)
# get the previous sentence
print(sentence.previous_sentence())
# get the sentence after that
print(sentence.next_sentence())
# get the sentence after the next sentence
print(sentence.next_sentence().next_sentence())

This allows dynamic computation of contexts in the embedding classes.

  1. Sentence objects now have the is_document_boundary field which is set through the ColumnCorpus. In some datasets, there are sentences like "-DOCSTART-" that just indicate document boundaries. This is now recorded as a boolean in the object.

Refactored TransformerWordEmbeddings (breaking)

TransformerWordEmbeddings refactored for dynamic context, robustness to long sentences and readability. The names of some constructor arguments have changed for clarity: pooling_operation is now subtoken_pooling (to make clear that we pool subtokens), use_scalar_mean is now layer_mean (we only do a simple layer mean) and use_context can now optionally take an integer to indicate the length of the context. Default arguments are also changed.

For instance, to create embeddings with a document-level context of 64 subtokens, init like this:

embeddings = TransformerWordEmbeddings(
    model='bert-base-uncased',
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=64,
)

Train your Own FLERT Models

You can train a FLERT-model like this:

import torch

from flair.data import Sentence
from flair.datasets import CONLL_03, WNUT_17
from flair.embeddings import TransformerWordEmbeddings, DocumentPoolEmbeddings, WordEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer


corpus = CONLL_03()

use_context = 64
hf_model = 'xlm-roberta-large'

embeddings = TransformerWordEmbeddings(
    model=hf_model,
    layers="-1",
    subtoken_pooling="first",
    fine_tune=True,
    use_context=use_context,
)

tag_dictionary = corpus.make_tag_dictionary('ner')

# init bare-bones tagger (no reprojection, LSTM or CRF)
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type='ner',
    use_crf=False,
    use_rnn=False,
    reproject_embeddings=False,
)

# train with XLM parameters (AdamW, 20 epochs, small LR)
trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)
from torch.optim.lr_scheduler import OneCycleLR

context_string = '+context' if use_context else ''

trainer.train(f"resources/flert",
              learning_rate=5.0e-6,
              mini_batch_size=4,
              mini_batch_chunk_size=1,
              max_epochs=20,
              scheduler=OneCycleLR,
              embeddings_storage_mode='none',
              weight_decay=0.,
              )

We recommend training FLERT this way if accuracy is by far the most important feature you need. FLERT is quite slow since it works on the document-level.

HuggingFace model hub integration (#2040 #2108 #2115)

We now host Flair sequence tagging models on the HF model hub (thanks for all the support @huggingface!).

Overview of all models. There is a dedicated 'Flair' tag on the hub, so to get a list of all Flair models, check here.

The hub allows all users to upload and share their own models. Even better, you can enable the Inference API and so test all models online without downloading and running them. For instance, you can test our new very powerful English 18-class NER model here.

To load any sequence tagger on the model hub, use the string identifier when instantiating a model. For instance, to load our English ontonotes model with the id "flair/ner-english-ontonotes-large", do

from flair.data import Sentence
from flair.models import SequenceTagger

# load tagger
tagger = SequenceTagger.load("flair/ner-english-ontonotes-large")

# make example sentence
sentence = Sentence("On September 1st George won 1 dollar while watching Game of Thrones.")

# predict NER tags
tagger.predict(sentence)

# print sentence
print(sentence)

# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
    print(entity)

Other New Features

New Task: Recognizing Textual Entailment (#2123)

Thanks to @marcelmmm we now support training textual entailment tasks (in fact, all pairwise sentence classification tasks) in Flair.

For instance, if you want to train an RTE task of the GLUE benchmark use this script:

import torch

from flair.data import Corpus
from flair.datasets import GLUE_RTE
from flair.embeddings import TransformerDocumentEmbeddings

# 1. get the entailment corpus
corpus: Corpus = GLUE_RTE()

# 2. make the tag dictionary from the corpus
label_dictionary = corpus.make_label_dictionary()

# 3. initialize text pair tagger
from flair.models import TextPairClassifier

tagger = TextPairClassifier(
    document_embeddings=TransformerDocumentEmbeddings(),
    label_dictionary=label_dictionary,
)

# 4. train trainer with AdamW
from flair.trainers import ModelTrainer

trainer = ModelTrainer(tagger, corpus, optimizer=torch.optim.AdamW)

# 5. run training
trainer.train('resources/taggers/glue-rte-english',
              learning_rate=2e-5,
              mini_batch_chunk_size=2, # this can be removed if you hae a big GPU
              train_with_dev=True,
              max_epochs=3)

Add possibility to specify empty label name to CSV corpora (#2068)

Some CSV classification datasets contain a value that means "no class". We now extend the CSVClassificationDataset so that it is possible to specify which value should be skipped using the no_class_label argument.

For instance:

# load corpus
corpus = CSVClassificationCorpus(
    data_folder='resources/tasks/code/',
    train_file='java_io.csv',
    skip_header=True,
    column_name_map={3: 'text', 4: 'label', 5: 'label', 6: 'label', 7: 'label', 8: 'label', 9: 'label'},
    no_class_label='NONE',
)

This causes all entries of NONE in one of the label columns to be skipped.

More options for splits in corpora and training (#2034)

For various reasons, we might want to have a Corpus that does not define all three splits (train/dev/test). For instance, we might want to train a model over the entire dataset and not hold out any data for validation/evaluation.

We add several ways of doing so.

  1. If a dataset has predefined splits, like most NLP datasets, you can pass the arguments train_with_test and train_with_dev to the ModelTrainer. This causes the trainer to train over all three splits (and do no evaluation):
trainer.train(f"path/to/your/folder",
    learning_rate=0.1,
    mini_batch_size=16,
    train_with_dev=True,
    train_with_test=True,
)
  1. You can also now create a Corpus with fewer splits without having all three splits automatically sampled. Pass sample_missing_splits=False as argument to do this. For instance, to load SemCor WSD corpus only as training data, do:
semcor = WSD_UFSAC(train_file='semcor.xml', sample_missing_splits=False, autofind_splits=False)

Add TFIDF Embeddings (#2086)

We added some old-school embeddings (thanks @yosipk), namely the legendary TF-IDF document embeddings. These are often good baselines, and additionally they keep NLP veterans nostalgic, if not happy.

To initialize these embeddings, you must pass the train split of your training corpus, i.e.

embeddings = DocumentTFIDFEmbeddings(corpus.train, max_features=10000)

This triggers the process where the most common words are used ...

Read more

Release 0.7

01 Dec 19:35
69b6692
Compare
Choose a tag to compare

Release 0.7 adds major few-shot and zero-shot learning capabilities to Flair with our new TARS approach, plus support for the Universal Proposition Banks, new NER datasets and lots of other new features!

Few-Shot and Zero-Shot Classification with TARS (#1917 #1926)

With TARS we add a major new feature to Flair for zero-shot and few-shot classification. Details on the approach can be found in our paper Halder et al. (2020). Our approach allows you to classify text in cases in which you have little or even no training data at all.

This example illustrates how you predict new classes without training data:

# 1. Load our pre-trained TARS model for English
tars = TARSClassifier.load('tars-base')

# 2. Prepare a test sentence
sentence = flair.data.Sentence("I am so glad you liked it!")

# 3. Define some classes that you want to predict using descriptive names
classes = ["happy", "sad"]

#4. Predict for these classes
tars.predict_zero_shot(sentence, classes)

# Print sentence with predicted labels
print(sentence)

For a full overview of TARS features, please refer to our new TARS tutorial.

Other New Features

Option to set Flair seed (#1979)

Adds the possibility to set a seed via wrapping the Hugging Face Transformers library helper method (thanks @stefan-it).

By specifying a seed with:

import flair

flair.set_seed(42)

you can make experimental runs reproducible. The wrapped set_seed method sets seeds for random, numpy and torch. More details here.

Control multi-word behavior in UD datasets (#1981)

To better handle multi-words in UD corpora, we introduce the split_multiwords constructor argument to all UD corpora which by default is set to True. It controls the handling of multiwords that are split into different tokens. For instance the German "am" is split into two different tokens: "am" -> "an" + "dem". Or the French "aux" -> "a" + "les".

If split_multiwords is set to True, they are split as in UD. If set to False, we keep the original multiword as a single token. Example:

# default mode: multiwords are split
corpus = UD_GERMAN(split_multiwords=True)
# print sentence 179
print(corpus.dev[179].to_plain_string())

# alternative mode: multiwords are kept as original
corpus = UD_GERMAN(split_multiwords=False)
# print sentence 179
print(corpus.dev[179].to_plain_string())  

This prints

Ein Hotel zu dem Wohlfühlen.

Ein Hotel zum Wohlfühlen.

The latter is how it appears in text, the former is after splitting of multiwords.

Pass pretokenized sentence to Sentence object (#1965)

You can now pass pass a pretokenized sequence as list of words (thanks @ulf1):

from flair.data import Sentence
sentence = Sentence(['The', 'grass', 'is', 'green', '.'])
print(sentence)

This should print:

Sentence: "The grass is green ."   [− Tokens: 5]

Map label names in sequence labeling datasets (#1988)

You can now pass a label map to sequence labeling datasets to change label names (thanks @pharnisch).

# print tag dictionary with mapped names
corpus = CONLL_03_DUTCH(label_name_map={'PER': 'person', 'ORG': 'organization', 'LOC': 'location', 'MISC': 'other'})
print(corpus.make_tag_dictionary('ner'))

# print tag dictionary with original names
corpus = CONLL_03_DUTCH()
print(corpus.make_tag_dictionary('ner'))

Data Sets

Universal Proposition Banks (#1870 #1866 #1888)

Flair 0.7 adds support 7 Universal Proposition Banks to train your own multilingual semantic role labelers (thanks to @Dabendorf).

Load for instance with:

# load English Universal Proposition Bank
corpus = UP_ENGLISH()
print(corpus)

# make dictionary of frames
frame_dictionary = corpus.make_tag_dictionary('frame')
print(frame_dictionary)

Now available for Finnish, Chinese, Italian, French, German, Spanish and English

NER Corpora

We add support for 6 new NER corpora:

Arabic NER Corpus (#1901)

Added the ANER corpus for Arabic NER (thanks to @megantosh).

# load Arabic NER corpus
corpus = ANER_CORP()
print(corpus)

Movie NER Corpora (#1912)

Added the MIT movie reviews corpora annotated with NER information, in the simple and complex variant (thanks to @pharnisch).

# load simple movie NER corpus
corpus = MITMovieNERSimple()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

# load complex movie NER corpus
corpus = MITMovieNERComplex()
print(corpus)
print(corpus.make_tag_dictionary('ner'))   

Added SEC Fillings NER corpus (#1922)

Added corpus of SEC fillings annotated with 4-class NER tags (thanks to @samahakk).

# load SEC fillings corpus
corpus = SEC_FILLINGS()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

WNUT 2020 NER dataset support (#1942)

Added corpus of wet lab protocols annotated with NER information used for WNUT 2020 challenge (thanks to @aynetdia).

# load wet lab protocol data
corpus = WNUT_2020_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

Weibo NER dataset support (#1944)

Added dataset about NER for Chinese Social Media (thanks to @87302380).

# load Weibo NER data
corpus = WEIBO_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

Added Finnish NER corpus (#1946)

Added the TURKU corpus for Finnish NER (thanks to @melvelet).

# load Finnish NER data
corpus = TURKU_NER()
print(corpus)
print(corpus.make_tag_dictionary('ner'))

Universal Depdency Treebanks

We add support for 11 new UD treebanks:

Load each with language name, for instance:

# load Gothic UD treebank data
corpus = UD_GOTHIC()
print(corpus)
print(corpus.test[0])

Added GoEmotions text classification corpus (#1914)

Added GoEmotions dataset containing 58k Reddit comments labeled with 27 emotion categories. Load with:

# load GoEmotions corpus
corpus = GO_EMOTIONS()
print(corpus)
print(corpus.make_label_dictionary())

Enhancements and bug fixes

  • Add handling for micro-average precision and recall (#1935)
  • Make dev and test splits in treebanks optional (#1951)
  • Updated communicative functions model (#1857)
  • Biomedical Data: Explicit encodings for Windows Support (#1893)
  • Fix wrong abstract method (#1923 #1940)
  • Improve tutorial (#1939)
  • Fix requirements (#1971 )

Release 0.6.1

23 Sep 10:40
0ac2704
Compare
Choose a tag to compare

Release 0.6.1 is bugfix release that fixes the issues caused by moving the server that originally hosted the Flair models. Additionally, this release adds a ton of new NER datasets, including the XTREME corpus for 40 languages, and a new model for NER on German-language legal text.

New Model: Legal NER (#1872)

Add legal NER model for German. Trained using the German legal NER dataset available here that can be loaded in Flair with the LER_GERMAN corpus object.

Uses German Flair and FastText embeddings and gets 96.35 F1 score.

Use like this:

# load German LER tagger
tagger = SequenceTagger.load('de-ler')

# example text
text = "vom 6. August 2020. Alle Beschwerdeführer befinden sich derzeit gemeinsam im Urlaub auf der Insel Mallorca , die vom Robert-Koch-Institut als Risikogebiet eingestuft wird. Sie wollen am 29. August 2020 wieder nach Deutschland einreisen, ohne sich gemäß § 1 Abs. 1 bis Abs. 3 der Verordnung zur Testpflicht von Einreisenden aus Risikogebieten auf das SARS-CoV-2-Virus testen zu lassen. Die Verordnung sei wegen eines Verstoßes der ihr zugrunde liegenden gesetzlichen Ermächtigungsgrundlage, des § 36 Abs. 7 IfSG , gegen Art. 80 Abs. 1 Satz 1 GG verfassungswidrig."

sentence = Sentence(text)

# predict and print entities
tagger.predict(sentence)

for entity in sentence.get_spans('ner'):
    print(entity)

New Datasets

Add XTREME and WikiANN corpora for multilingual NER (#1862)

These huge corpora provide training data for NER in 176 languages. You can either load the language-specific parts of it by supplying a language code:

# load German Xtreme
german_corpus = XTREME('de')
print(german_corpus)

# load French Xtreme
french_corpus = XTREME('fr')
print(french_corpus)

Or you can load the default 40 languages at once into one huge MultiCorpus by not providing a language ID:

# load Xtreme MultiCorpus for all
multi_corpus = XTREME()
print(multi_corpus)

Add Twitter NER Dataset (#1850)

Dataset of tweets annotated with NER tags. Load with:

# load twitter dataset
corpus = TWITTER_NER()

# print example tweet
print(corpus.test[0])

Add German Europarl NER Dataset (#1849)

Dataset of German-language speeches in the European parliament annotated with standard NER tags like person and location. Load with:

# load corpus
corpus = EUROPARL_NER_GERMAN()
print(corpus)

# print first test sentence
print(corpus.test[1])

Add MIT Restaurant NER Dataset (#1177)

Dataset of English restaurant reviews annotated with entities like "dish", "location" and "rating". Load with:

# load restaurant dataset
corpus = MIT_RESTAURANTS()

# print example sentence
print(corpus.test[0])  

Add Universal Propositions Banks for French and German (#1866)

Our kickoff into supporting the Universal Proposition Banks adds the first two UP datasets to Flair. Load with:

# load German UP
corpus = UP_GERMAN()
print(corpus)

# print example sentence
print(corpus.dev[1])

Add Universal Dependencies Dataset for Chinese (#1880)

Adds the Kyoto dataset for Chinese. Load with:

# load Chinese UD dataset
corpus = UD_CHINESE_KYOTO()

# print example sentence
print(corpus.test[0])  

Bug fixes

  • Move models to HU server (#1834 #1839 #1842)
  • Fix deserialization issues in transformer tokenizers #1865
  • Documentation fixes (#1819 #1821 #1836 #1852)
  • Add link to a repo with examples of Flair on GCP (#1825)
  • Correct variable names (#1875)
  • Fix problem with custom delimiters in ColumnDataset (#1876)
  • Fix offensive language detection model (#1877)
  • Correct Dutch NER model (#1881)

Release 0.6

17 Aug 13:34
1a12954
Compare
Choose a tag to compare

Release 0.6 is a major biomedical NLP upgrade for Flair, adding state-of-the-art models for biomedical NER, support for 31 biomedical NER corpora, clinical POS tagging, speculation and negation detection in biomedical literature, and many other features such as multi-tagging and one-cycle learning.

Biomedical Models and Datasets:

Most of the biomedical models and datasets were developed together with the Knowledge Management in Bioinformatics group at the HU Berlin, in particular @leonweber and @mariosaenger. This page gives an overview of the new models and datasets, and example tutorials. Some highlights:

Biomedical NER models (#1790)

Flair now has pre-trained models for biomedical NER trained over unified versions of 31 different biomedical corpora. Because they are trained on so many different datasets, the models are shown to be very robust with new datasets, outperforming all previously available off-the-shelf datasets. If you want to load a model to detect "diseases" in text for instance, do:

# make a sentence
sentence = Sentence("Behavioral abnormalities in the Fmr1 KO2 Mouse Model of Fragile X Syndrome")

# load disease tagger and predict
tagger = SequenceTagger.load("hunflair-disease")
tagger.predict(sentence)

Done! Let's print the diseases found by the tagger:

for entity in sentence.get_spans():
    print(entity)

This should print:

Span [1,2]: "Behavioral abnormalities"   [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome"   [− Labels: Disease (0.99)]

You can also get one model that finds 5 biomedical entity types (diseases, genes, species, chemicals and cell lines), like this:

# load bio-NER tagger and predict
tagger = MultiTagger.load("hunflair")
tagger.predict(sentence)

This should print:

Span [1,2]: "Behavioral abnormalities"   [− Labels: Disease (0.6736)]
Span [10,11,12]: "Fragile X Syndrome"   [− Labels: Disease (0.99)]
Span [5]: "Fmr1"   [− Labels: Gene (0.838)]
Span [7]: "Mouse"   [− Labels: Species (0.9979)]

So it now also finds genes and species. As explained here these models work best if you use them together with a biomedical tokenizer.

Biomedical NER datasets (#1790)

Flair now supports 31 biomedical NER datasets out of the box, both in their standard versions as well as the "Huner" splits for reproducibility of experiments. For a full list of datasets, refer to this page.

You can load a dataset like this:

# load one of the bioinformatics corpora
corpus = JNLPBA()

# print statistics and one sentence
print(corpus)
print(corpus.train[0])

We also include "huner" corpora that combine many different biomedical datasets into a single corpus. For instance, if you execute the following line:

# load combined chemicals corpus
corpus = HUNER_CHEMICAL()

This loads a combination of 6 different corpora that contain annotation of chemicals into a single corpus. This allows you to train stronger cross-corpus models since you now combine training data from many sources. See more info here.

POS model for Portuguese clinical text (#1789)

Thanks to @LucasFerroHAILab, we now include a model for part-of-speech tagging in Portuguese clinical text. Run this model like this:

# load your tagger
tagger = SequenceTagger.load('pt-pos-clinical')

# example sentence
sentence = Sentence('O vírus Covid causa fortes dores .')
tagger.predict(sentence)
print(sentence)

You can find more details in their paper here.

Model for negation and speculation in biomedical literature (#1758)

Using the BioScope corpus, we trained a model to recognize negation and speculation in biomedical literature. Use it like this:

sentence = Sentence("The picture most likely reflects airways disease")

tagger = SequenceTagger.load("negation-speculation")
tagger.predict(sentence)

for entity in sentence.get_spans():
    print(entity)

This should print:

Span [4,5,6,7]: "likely reflects airways disease"   [− Labels: SPECULATION (0.9992)]

Thus indicating that this portion of the sentence is speculation.

Other New Features:

MultiTagger (#1791)

We added support for tagging text with multiple models at the same time. This can save memory usage and increase tagging speed.

For instance, if you want to POS tag, chunk, NER and detect frames in your text at the same time, do:

# load tagger for POS, chunking, NER and frame detection
tagger = MultiTagger.load(['pos', 'upos', 'chunk', 'ner', 'frame'])

# example sentence
sentence = Sentence("George Washington was born in Washington")

# predict
tagger.predict(sentence)

print(sentence) 

This will give you a sentence annotated with 5 different layers of annotation.

Sentence splitting

Flair now includes convenience methods for sentence splitting. For instance, to use segtok to split and tokenize a text into sentences, use the following code:

from flair.tokenization import SegtokSentenceSplitter

# example text with many sentences
text = "This is a sentence. This is another sentence. I love Berlin."

# initialize sentence splitter
splitter = SegtokSentenceSplitter()

# use splitter to split text into list of sentences
sentences = splitter.split(text)  

We also ship other splitters, such as SpacySentenceSplitter (requires SpaCy to be installed).

Japanese tokenization (#1786)

Thanks to @himkt we now have expanded support for Japanese tokenization in Flair. For instance, use the following code to tokenize a Japanese sentence without installing extra libraries:

from flair.data import Sentence
from flair.tokenization import JapaneseTokenizer

# init japanese tokenizer
tokenizer = JapaneseTokenizer("janome")

# make sentence (and tokenize)
sentence = Sentence("私はベルリンが好き", use_tokenizer=tokenizer)

# output tokenized sentence
print(sentence)

One-Cycle Learning (#1776)

Thanks to @lucaventurini2 Flair one supports one-cycle learning, which may give quicker convergence. For instance, train a model in 20 epochs using the code below:

# train as always
trainer = ModelTrainer(tagger, corpus)

# set one cycle LR as scheduler
trainer.train('onecycle_ner',
              scheduler=OneCycleLR,
              max_epochs=20)

Improvements:

Changes in convention

Turn on tokenizer by default in Sentence object (#1806)

The Sentence object now executes tokenization (use_tokenizer=True) by default:

# Tokenizes by default
sentence = Sentence("I love Berlin.")
print(sentence)

# i.e. this is equivalent to
sentence = Sentence("I love Berlin.", use_tokenizer=True)
print(sentence)

# i.e. if you don't want to use tokenization, set it to False
sentence = Sentence("I love Berlin.", use_tokenizer=False)
print(sentence)

TransformerWordEmbeddings now handle long documents by default

Previously, so had to set allow_long_sentences=True to enable handling of long sequences (greater than 512 subtokens) in TransformerWordEmbeddings. This is no longer necessary as this value is now set to True by default.

Bug fixes

  • Fix serialization of BytePairEmbeddings (#1802)
  • Fix issues with loading models that use ELMoEmbeddings (#1803)
  • Allow longer lengths in transformers that can handle more than 512 subtokens (#1804)
  • Fix encoding for WASSA datasets (#1766)
  • Update BPE package (#1764)
  • Improve documentation (#1752 #1778)
  • Fix evaluation of TextClassifier if no label_type is passed (#1748)
  • Remove torch version checks that throw errors (#1744)
  • Update DaNE dataset URL (#1800)
  • Fix weight extraction error for empty sentences (#1805)

Release 0.5.1

05 Jul 21:39
13f5e8d
Compare
Choose a tag to compare

Release 0.5.1 with new features, datasets and models, including support for sentence transformers, transformer embeddings for arbitrary length sentences, new Dutch NER models, new tasks and more refactorings of evaluation and training routines to better organize the code!

New Features and Enhancements:

TransformerWordEmbeddings can now process long sentences (#1680)

Adds a heuristic as a workaround to the max sequence length of some transformer embeddings, making it possible to now embed sequences of arbitrary length if you set allow_long_sentences=True, like so:

TransformerWordEmbeddings(
        allow_long_sentences=True, # set allow_long_sentences to True to enable this features
),

Setting random seeds (#1671)

It is now possible to set seeds when loading and downsampling corpora, so that the sample is always the same:

# set a random seed 
import random
random.seed(4)

# load and downsample corpus
corpus = SENTEVAL_MR(filter_if_longer_than=50).downsample(0.1)

# print first sentence of dev and test 
print(corpus.dev[0])
print(corpus.test[0])

Make reprojection layer optional (#1676)

Makes the reprojection layer optional in SequenceTagger. You can control this behavior through the reproject_embeddings parameter. If you set it to True, embeddings are reprojected via linear map to identical size. If set to False, no reprojection happens. If you set this parameter to an integer, the linear map maps embedding vectors to vectors of this size.

# tagger with standard reprojection
tagger = SequenceTagger(
    hidden_size=256,
    [...]
    reproject_embeddings=True,
)

# tagger without reprojection
tagger = SequenceTagger(
    hidden_size=256,
    [...]
    reproject_embeddings=False,
)

# reprojection to vectors of length 128
tagger = SequenceTagger(
    hidden_size=256,
    [...]
    reproject_embeddings=128,
)

Set label name when predicting (#1671)

You can now optionally specify the "label name" of the predicted label. This may be useful if you want to for instance run two different NER models on the same sentence:

sentence = Sentence('I love Berlin')

# load two NER taggers
tagger_1 = SequenceTagger.load('ner')
tagger_2 = SequenceTagger.load('ontonotes-ner')

# specify label name of tagger_1 to be 'conll03_ner'
tagger_1.predict(sentence, label_name='conll03_ner')

# specify label name of tagger_2 to be 'onto_ner'
tagger_1.predict(sentence, label_name='onto_ner')

print(sentence)

This may be useful if you have multiple ner taggers and wish to tag the same sentence with them. Then you can distinguish between the tags by the taggers. It is also now no longer possible to give the predict method a string - you now must pass a sentence.

Sentence Transformers (#1696)

Adds the SentenceTransformerDocumentEmbeddings class so you get embeddings from the sentence-transformer library. Use as follows:

from flair.data import Sentence
from flair.embeddings import SentenceTransformerDocumentEmbeddings

# init embedding
embedding = SentenceTransformerDocumentEmbeddings('bert-base-nli-mean-tokens')

# create a sentence
sentence = Sentence('The grass is green .')

# embed the sentence
embedding.embed(sentence)

You can find a full list of their pretained models here.

Other enhancements

  • Update to transformers 3.0.0 (#1727)
  • Better Memory mode presets for classification corpora (#1701)
  • ClassificationDataset now also accepts line with "\t" seperator additionaly to blank spaces (#1654)
  • Change default fine-tuning in DocumentPoolEmbeddings to "none" (#1675)
  • Short-circuit the embedding loop (#1684)
  • Add option to pass kwargs into transformer models when initializing model (#1694)

New Datasets and Models

Two new dutch NER models (#1687)

The new default model is a BERT-based RNN model with the highest accuracy:

from flair.data import Sentence
from flair.models import SequenceTagger

# load the default BERT-based model
tagger = SequenceTagger.load('nl-ner')

# tag sentence
sentence = Sentence('Ik hou van Amsterdam')
tagger.predict(sentence)

You can also load a Flair-based RNN model (might be faster on some setups):

# load the default BERT-based model
tagger = SequenceTagger.load('nl-ner-rnn')

Corpus of communicative functions (#1683) and pre-trained model (#1706)

Adds corpus of communicate functions in scientific literature, described in this LREC paper and available here. Load with:

corpus = COMMUNICATIVE_FUNCTIONS()
print(corpus)

We also ship a pre-trained model on this corpus, which you can load with:

# load communicative function tagger
tagger = TextClassifier.load('communicative-functions')

# load communicative function tagger
sentence = Sentence("However, previous approaches are limited in scalability .")

# predict and print labels
tagger.predict(sentence)
print(sentence.labels)

Keyword Extraction Corpora (#1629) and pre-trained model (#1689)

Added 3 datasets available for keyphrase extraction via sequence labeling: Inspec, SemEval-2017 and Processed SemEval-2010

Load like this:

inspec_corpus = INSPEC()
semeval_2010_corpus = SEMEVAL2010()
semeval_2017 = SEMEVAL2017()

We also ship a pre-trained model on this corpus, which you can load with:

# load keyphrase tagger
tagger = SequenceTagger.load('keyphrase')

# load communicative function tagger
sentence = Sentence("Here, we describe the engineering of a new class of ECHs through the "
                    "functionalization of non-conductive polymers with a conductive choline-based "
                    "bio-ionic liquid (Bio-IL).", use_tokenizer=True)

# predict and print labels
tagger.predict(sentence)
print(sentence)

Swedish NER (#1652)

Add corpus for swedish NER using dataset https://github.com/klintan/swedish-ner-corpus/. Load with:

corpus = NER_SWEDISH()
print(corpus)

German Legal Named Entity Recognition (#1697)

Adds corpus of legal named entities for German. Load with:

corpus = LER_GERMAN()
print(corpus)

Refactoring of evaluation

We made a number of refactorings to the evaluation routines in Flair. In short: whenever possible, we now use the evaluation methods of sklearn (instead of our own implementations which kept getting issues). This applies to text classification and (most) sequence tagging.

A notable exception is "span-F1" which is used to evaluate NER because there is no good way of counting true negatives. After this PR, our implementation should now exactly mirror the original conlleval script of the CoNLL-02 challenge. In addition to using our reimplementation, an output file is now automatically generated that can be directly used with the conlleval script.

In more detail, this PR makes the following changes:

  • Span is now a list of Token and can now be iterated like a sentence
  • flair.DataLoader is now used throughout
  • The evaluate() interface in the Model base class is changed so that it no longer requires a data loader, but ran run either over list of Sentence or a Dataset
  • SequenceTagger.evaluate() now explicitly distinguishes between F1 and Span-F1. In the latter case, no TN are counted (#1663) and a non-sklearn implementation is used.
  • In the evaluate() method of the SequenceTagger and TextClassifier, we now explicitly call the .predict() method.

Bug fixes:

  • Fix figsize issue (#1622)
  • Allow strings to be passed instead of Path (#1637)
  • Fix segtok tokenization issue (#1653)
  • Serialize dropout in SequenceTagger (#1659)
  • Fix serialization error in DocumentPoolEmbeddings (#1671)
  • Fix subtokenization issues in transformers (#1674)
  • Add new datasets to init.py (#1677)
  • Fix deprecation warnings due to invalid escape sequences. (#1678)
  • Fix PooledFlairEmbeddings deserialization error (#1604)
  • Fix transformer tokenizer deserialization (#1686)
  • Fix issues caused by embedding mode and lambda functions in ELMoEmbeddings (#1692)
  • Fix serialization error in PooledFlairEmbeddings (#1593)
  • Fix mean pooling in PooledFlairEmbeddings (#1698)
  • Fix condition to assign whitespace_after attribute in the build_spacy_tokenizer wraper (#1700)
  • Fix WIKINER encoding for windows (#1713)
  • Detect and ignore empty sentences in BERT embeddings (#1716)
  • Fix error in returning multiple classes (#1717)

Release 0.5

24 May 12:00
63aeabf
Compare
Choose a tag to compare

Release 0.5 with tons of new models, embeddings and datasets, support for fine-tuning transformers, greatly improved sentiment analysis models for English, tons of new features and big internal refactorings to better organize the code!

New Fine-tuneable Transformers (#1494 #1544)

Flair 0.5 adds support for transformers and fine-tuning with two new embeddings classes: TransformerWordEmbeddings and TransformerDocumentEmbeddings, for word- and document-level transformer embeddings respectively. Both classes can be initialized with a model name that indicates what type of transformer (BERT, XLNet, RoBERTa, etc.) you wish to use (check the full list Here)

Transformer Word Embeddings

If you want to embed the words in a sentence with transformers, do it like this:

from flair.embeddings import TransformerWordEmbeddings

# init embedding
embedding = TransformerWordEmbeddings('bert-base-uncased')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)

If instead you want to use RoBERTa, do:

from flair.embeddings import TransformerWordEmbeddings

# init embedding
embedding = TransformerWordEmbeddings('roberta-base')

# create a sentence
sentence = Sentence('The grass is green .')

# embed words in sentence
embedding.embed(sentence)

Transformer Document Embeddings

To get a single embedding for the whole document with BERT, do:

from flair.embeddings import TransformerDocumentEmbeddings

# init embedding
embedding = TransformerDocumentEmbeddings('bert-base-uncased')

# create a sentence
sentence = Sentence('The grass is green .')

# embed the sentence
embedding.embed(sentence)

If instead you want to use RoBERTa, do:

from flair.embeddings import TransformerDocumentEmbeddings

# init embedding
embedding = TransformerDocumentEmbeddings('roberta-base')

# create a sentence
sentence = Sentence('The grass is green .')

# embed the sentence
embedding.embed(sentence)

Text classification by fine-tuning a transformer

Importantly, you can now fine-tune transformers to get state-of-the-art accuracies in text classification tasks.
Use TransformerDocumentEmbeddings for this and set fine_tune=True. Then, use the following example code:

from torch.optim.adam import Adam

from flair.data import Corpus
from flair.datasets import TREC_6
from flair.embeddings import TransformerDocumentEmbeddings
from flair.models import TextClassifier
from flair.trainers import ModelTrainer

# 1. get the corpus
corpus: Corpus = TREC_6()

# 2. create the label dictionary
label_dict = corpus.make_label_dictionary()

# 3. initialize transformer document embeddings (many models are available)
document_embeddings = TransformerDocumentEmbeddings('distilbert-base-uncased', fine_tune=True)

# 4. create the text classifier
classifier = TextClassifier(document_embeddings, label_dictionary=label_dict)

# 5. initialize the text classifier trainer with Adam optimizer
trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

# 6. start the training
trainer.train('resources/taggers/trec',
              learning_rate=3e-5, # use very small learning rate
              mini_batch_size=16,
              mini_batch_chunk_size=4, # optionally set this if transformer is too much for your machine
              max_epochs=5, # terminate after 5 epochs
              )

New Taggers, Embeddings and Datasets

Flair 0.5 adds a ton of new taggers, embeddings and datasets.

New Taggers

New sentiment models (#1613)

We added new sentiment models for English. The new models are trained over a combined corpus of sentiment dataset, including Amazon product reviews. So they should be applicable to more domains than the old sentiment models that were only trained with movie reviews.

There are two new models, a transformer-based model you can load like this:

# load tagger
classifier = TextClassifier.load('sentiment')

# predict for example sentence
sentence = Sentence("enormously entertaining for moviegoers of any age .")
classifier.predict(sentence)

# check prediction
print(sentence)

And a faster, slightly less accurate model based on RNNs you can load like this:

classifier = TextClassifier.load('sentiment-fast')

Fine-grained POS models for English (#1625)

Adds fine-grained POS models for English so you now have the option between 'pos' and 'upos' models for fine-grained and universal dependencies respectively. Load like this:

# Fine-grained POS model
tagger = SequenceTagger.load('pos')

# Fine-grained POS model (fast variant)
tagger = SequenceTagger.load('pos-fast')

# Universal POS model
tagger = SequenceTagger.load('upos')

# Universal POS model (fast variant)
tagger = SequenceTagger.load('upos-fast')

Added Malayalam POS and XPOS tagger model (#1522)

Added taggers for historical German speech and thought (#1532)

New Embeddings

Added language models for historical German by @redewiedergabe (#1507)

Load the language models with:

embeddings_forward = FlairEmbeddings('de-historic-rw-forward')
embeddings_backward = FlairEmbeddings('de-historic-rw-backward')

Added Malayalam flair embeddings models (#1458)

embeddings_forward = FlairEmbeddings('ml-forward')
embeddings_backward = FlairEmbeddings('ml-backward')

Added Flair Embeddings from CLEF HIPE Shared Task (#1554)

Adds the recently trained Flair embeddings on historic newspapers for German/English/French provided by the CLEF HIPE shared task.

New Datasets

Added NER dataset for Finnish (#1620)

You can now load a Finnish NER corpus with

ner_finnish = flair.datasets.NER_FINNISH()

Added DaNE dataset (#1425)

You can now load a Danish NER corpus with

dane = flair.datasets.DANE()

Added SentEval classification datasets (#1454)

Adds 6 SentEval classification datasets to Flair:

senteval_corpus_1 = flair.datasets.SENTEVAL_CR()
senteval_corpus_2 = flair.datasets.SENTEVAL_MR()
senteval_corpus_3 = flair.datasets.SENTEVAL_SUBJ()
senteval_corpus_4 = flair.datasets.SENTEVAL_MPQA()
senteval_corpus_5 = flair.datasets.SENTEVAL_SST_BINARY()
senteval_corpus_6 = flair.datasets.SENTEVAL_SST_GRANULAR()

Added Sentiment Datasets (#1545)

Adds two new sentiment datasets to Flair, namely AMAZON_REVIEWS, a very large corpus of Amazon reviews with sentiment labels, and SENTIMENT_140, a corpus of tweets labeled with sentiment.

amazon_reviews = flair.datasets.AMAZON_REVIEWS()
sentiment_140 = flair.datasets.SENTIMENT_140()

Added BIOfid dataset (#1589)

biofid = flair.datasets.BIOFID()

Refactorings

Any DataPoint can now be labeled (#1450)

Refactored the DataPoint class and classes that inherit from it (Token, Sentence, Image, Span, etc.) so that all have the same methods for adding and accessing labels.

  • DataPoint base class now defined labeling methods (closes #1449)
  • Labels can no longer be passed to Sentence constructor, so instead of:
sentence_1 = Sentence("this is great", labels=[Label("POSITIVE")])

you should now do:

sentence_1 = Sentence("this is great")
sentence_1.add_label('sentiment', 'POSITIVE')

or:

sentence_1 = Sentence("this is great").add_label('sentiment', 'POSITIVE')

Note that Sentence labels now have a label_type (in the example that's 'sentiment').

  • The Corpus method _get_class_to_count is renamed to _count_sentence_labels
  • The Corpus method _get_tag_to_count is renamed to _count_token_labels
  • Span is now a DataPoint (so it has an embedding and labels)

Embeddings module was split into smaller submodules (#1588)

Split the previously huge embeddings.py into several submodules organized in an embeddings/ folder. The submodules are:

  • token.py for all TokenEmbeddings classes
  • document.py for all DocumentEmbeddings classes
  • image.py for all ImageEmbeddings classes
  • legacy.py for embeddings that are now deprecated
  • base.py for remaining basic classes

All embeddings are still exposed through the embeddings package, so the command to load them doesn't change, e.g.:

from flair.embeddings import FlairEmbeddings
embeddings = FlairEmbeddings('news-forward')

so specifying the submodule is not needed.

Datasets module was split into smaller submodules (#1510)

Split the previously huge datasets.py into several submodules organized in a datasets/ folder. The submodules are:

  • sequence_labeling.py for all sequence labeling datasets
  • document_classification.py for all document classification datasets
  • treebanks.py for all dependency parsed corpora (UD treebanks)
  • text_text.py for all bi-text datasets (currently only parallel corpora)
  • text_image.py for all paired text-image datasets (currently only Feidegger)
  • base.py for remaining basic classes

All datasets are still exposed through the datasets package, so it is still possible to load corpora with

from flair.datasets import TREC_6

without specifying the submodule.

Other refactorings

  • Refactor datasets for code legibility (#1394)

Small refactorings on flair.datasets for easier code legibility and fewer redundancies, removing about 100 lines of code: (1) Moved the default sampling logic from all corpora classes to the parent Corpus class. You can now instantiate a Corpus only with a train file which will trigger the sampling. (2) Move...

Read more

Release 0.4.5

24 Jan 16:08
a1ef91a
Compare
Choose a tag to compare

This is an enhancement release that slims down Flair for quicker/easier installation and smaller library size. It also makes Flair compatible with torch 1.4.0 and adds enhancements that reduce model size and improve runtime speed for some embeddings. New features include the ability to steer the precision/recall tradeoff during training of models and support for CamemBERT embeddings.

Memory, Runtime and Dependency Improvements

Slim down dependency tree (#1296 #1299 #1335 #1336)

We want to keep list of dependencies of Flair generally small to avoid errors like #1245 and keep the library small and quick to setup. So we removed dependencies that were each only used for one particular feature, namely:

  • ipython and ipython-genutils, only used for visualization settings in iPython notebooks
  • tiny_tokenizer, used for Japanese tokenization (replaced with instructions for how to install for all users who want to use Japanese tokenizers)
  • pymongo, used for MongoDB datasets (replaced with instructions for how to install for all users who want to use MongoDB datasets)
  • torchvision, now only loaded when needed

We also relaxed version requirements for easier installation on Google CoLab (#1335 #1336)

Dramatic speed-up of BERT embeddings (#1308)

@shoarora optimized the BERTEmbeddings implementation by removing redundant calls. This was shown to lead to dramatic speed improvements.

Reduce size of models that use WordEmbeddings (#1315)

@timnon added a method to replace word embeddings in trained model with sqlite database to dramatically reduce memory usage. Creates class WordEmbeedingsStore which can be used to replace a WordEmbeddings-instance in a flair model via duck-typing. By using this, @timnon was able to reduce our ner-servers memory consumption from 6gig to 600mb (10x decrease) by adding a few lines of code. It can be tested using the following lines (also in the docstring). First create a headless version of a model without word embeddings:

from flair.inference_utils import WordEmbeddingsStore
from flair.models import SequenceTagger
import pickle
tagger = SequenceTagger.load("multi-ner-fast")
WordEmbeddingsStore.create_stores(tagger)
pickle.dump(tagger, open("multi-ner-fast-headless.pickle", "wb"))

and then to run the stored headless model without word embeddings, use:

from flair.data import Sentence
tagger = pickle.load(open("multi-ner-fast-headless.pickle", "rb"))
WordEmbeddingsStore.load_stores(tagger)
text = "Schade um den Ameisenbären. Lukas Bärfuss veröffentlicht Erzählungen aus zwanzig Jahren."
sentence = Sentence(text)
tagger.predict(sentence)

New Features

Prioritize precision/recall or specific classes during training (#1345)

@klasocki added ways to steer the precision/recall tradeoff during training of models, as well as prioritize certain classes. This option was added to the SequenceTagger and the TextClassifier.

You can steer precision/recall tradeoff by adding the beta parameter, which indicates how many more times recall is important than precision. So if you set beta=0.5, precision becomes twice as important than recall. If you set beta=2, recall becomes twice as important as precision. Do it like this:

tagger = SequenceTagger(
    hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dictionary,
    tag_type=tag_type,
    beta=0.5)

If you want to prioritize classes, you can pass a weight_loss dictionary to the model classes. For instance, to prioritize learning the NEGATIVE class in a sentiment tagger, do:

tagger = TextClassifier(
    document_embeddings=embeddings,
    label_dictionary=tag_dictionary,
    loss_weights={'NEGATIVE': 10.})

which will increase the importance of class NEGATIVE by a factor of 10.

CamemBERT Embeddings (#1297)

@stefan-it added support for the recently proposed French language model: CamemBERT.

Thanks to the awesome 🤗/Transformers library, CamemBERT can be used in Flair like in this example:

from flair.data import Sentence
from flair.embeddings import CamembertEmbeddings

embedding = CamembertEmbeddings()

sentence = Sentence("J'aime le camembert !")
embedding.embed(sentence)

for token in sentence.tokens:
  print(token.embedding)

Bug fixes and enhancements

  • Fix new RNN format for torch 1.4.0 (#1360, #1382 )
  • Fix memory issue in PooledFlairEmbeddings (#1337 #1339)
  • Correct subtoken mapping function for GPT-2 and RoBERTa (#1242)
  • Update the transformers library to the latest 2.3 version (#1333)
  • Add staticmethod decorator to some functions (#1257)
  • Add a warning if validation data is too small (#1115)
  • Remove leftover printline from MUSE embeddings (#1224)
  • Correct generate_text() UTF-8 conversion (#1238)
  • Clarify documentation (#1295 #1332)
  • Replace sklearn by scikit-learn (#1321)
  • Fix off-by-one error in progress logging (#1334)
  • Fix typo and annotation (#1341)
  • Various improvements (#1347)
  • Make load_big_file work with read-only file (#1353)
  • Rename tiny_tokenizer to konoha (#1363)
  • Make test loss plotting optional (#1372)
  • Add pretty print function for Dictionary (#1375)

Release 0.4.4

20 Oct 22:22
e223601
Compare
Choose a tag to compare

Release 0.4.4 introduces dramatic improvements in inference speed for taggers (thanks to many contributions by @pommedeterresautee), Flair embeddings in 300 languages (thanks @stefan-it), modular tokenization and many new features and refactorings.

Speed optimizations

Many refactorings by @pommedeterresautee to improve inference speed of sequence tagger (#1038 #1053 #1068 #1093 #1130), Flair embeddings (#1074 #1095 #1107 #1132 #1145), word embeddings (#1084),
embeddings memory management (#1082 #1117), general optimizations (#1112) and classification (#1187).

The combined improvements increase inference speed by a factor of 2-3!

New features

Modular tokenization (#1022)

You can now pass custom tokenizers to Sentence objects and Dataset loaders to use different tokenizers than the included segtok library by implementing a tokenizer method. Currently, in-built support exists for whitespace tokenization, segtok tokenization and Japanese tokenization with mecab (requires mecab to be installed). In the future, we expect support for additional external tokenizers to be added.

For instance, if you wish to use Japanese tokanization performed by mecab, you can instantiate the Sentence object like this:

from flair.data import build_japanese_tokenizer
from flair.data import Sentence

# instantiate Japanese tokenizer
japanese_tokenizer = build_japanese_tokenizer()

# init sentence and pass this tokenizer
sentence = Sentence("私はベルリンが好きです。", use_tokenizer=japanese_tokenizer)
print(sentence)

Flair Embeddings for 300 languages (#1146)

Thanks to @stefan-it, there is now a massivey multilingual Flair embeddings model that covers 300 languages. See #1099 for more info on these embeddings and this repo for more details.

This replaces the old multilingual Flair embeddings that were trained for 6 languages. Load them with:

embeddings_fw = FlairEmbeddings('multi-forward')
embeddings_bw = FlairEmbeddings('multi-backward')

Multilingual Character Dictionaries (#1157)

Adds two multilingual character dictionaries computed by @stefan-it.

Load with

dictionary = Dictionary.load('chars-large')
print(len(dictionary.idx2item))

dictionary = Dictionary.load('chars-xl')
print(len(dictionary.idx2item))

Batch-growth annealing (#1138)

The paper Don't Decay the Learning Rate, Increase the Batch Size makes the case for increasing the batch size over time instead of annealing the learning rate.

This version adds the possibility to have arbitrarily large mini-batch sizes with an accumulating gradient strategy. It introduces the parameter mini_batch_chunk_size that you can set to break down large mini-batches into smaller chunks for processing purposes.

So let's say you want to have a mini-batch size of 128, but your memory cannot handle more than 32 samples at a time. Then you can train like this:

trainer = ModelTrainer(tagger, corpus)
trainer.train(
    "path/to/experiment/folder",
    # set large mini-batch size
    mini_batch_size=128,
    # set chunk size to lower memory requirements
    mini_batch_chunk_size=32,
)

Because we now can arbitrarly raise mini-batch size, we can now execute the annealing strategy in the above paper. Do it like this:

trainer = ModelTrainer(tagger, corpus)
trainer.train(
    "path/to/experiment/folder",
    # set initial mini-batch size
    mini_batch_size=32,
    # choose batch growth annealing 
    batch_growth_annealing=True,
)

Document-level sequence labeling (#1194)

Introduces the option for reading entire documents into one Sentence object for sequence labeling. This option is now supported for CONLL_03, CONLL_03_GERMAN and CONLL_03_DUTCH datasets which indicate document boundaries.

Here's how to train a model on CoNLL-03 on the document level:

# read CoNLL-03 with document_as_sequence=True
corpus = CONLL_03(in_memory=True, document_as_sequence=True)

# what tag do we want to predict?
tag_type = 'ner'

# 3. make the tag dictionary from the corpus
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)

# init simple tagger with GloVe embeddings
tagger: SequenceTagger = SequenceTagger(
    hidden_size=256,
    embeddings=WordEmbeddings('glove'),
    tag_dictionary=tag_dictionary,
    tag_type=tag_type,
)

# initialize trainer
from flair.trainers import ModelTrainer

trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# start training
trainer.train(
    'path/to/your/experiment',
    # set a much smaller mini-batch size because documents are huge
    mini_batch_size=2,
)

Option to evaluate on training split (#1202)

Previously, the ModelTrainer only allowed monitoring of dev and test splits during training. Now, you can also monitor the train split to better check if your method is overfitting.

Support for Danish tagging (#1183)

Adds support for Danish POS and NER thanks to @AmaliePauli!

Use like this:

from flair.data import Sentence
from flair.models import SequenceTagger

# example sentence
sentence = Sentence("København er en fantastisk by .")

# load Danish NER model and predict
ner_tagger = SequenceTagger.load('da-ner')
ner_tagger.predict(sentence)

# print annotations (NER)
print(sentence.to_tagged_string())

# load Danish POS model and predict
pos_tagger = SequenceTagger.load('da-pos')
pos_tagger.predict(sentence)

# print annotations (NER + POS)
print(sentence.to_tagged_string())

Support for DistilBERT embeddings (#1044)

You can use them like this:

from flair.data import Sentence
from flair.embeddings import BertEmbeddings

embeddings = BertEmbeddings("distilbert-base-uncased")

s = Sentence("Berlin and Munich are nice cities .")
embeddings.embed(s)

for token in s.tokens:
  print(token.embedding)
  print(token.embedding.shape)

MongoDataset for reading text classification data from a Mongo database (#1192)

Adds the option of reading data from MongoDB. See this documentation on how to use this features.

Feidegger corpus (#1199)

Adds a dataset downloader for the Feidegger corpus consisting of text-image pairs. Instantiate the corpus like this:

from flair.datasets import FeideggerCorpus

# instantiate Feidegger corpus
corpus = FeideggerCorpus()

# print a text-image pair
print(corpus.train[0])

Refactorings

Refactor checkpointing mechanism (#1101)

Refactored the checkpointing mechanism and slimmed down interfaces / code required to load checkpoints.

In detail:

  • The methods save_checkpoint and load_checkpoint are no longer part of the flair.nn.Model interface. Instead, saving and restoring checkpoints is now (fully) performed by the ModelTrainer.
  • The optimizer state and scheduler state are removed from the ModelTrainer constructor since they are no longer required here.
  • Loading a checkpoint is now one line of code (previously two lines).
# 1. initialize trainer as always with a model and a corpus
from flair.trainers import ModelTrainer
trainer: ModelTrainer = ModelTrainer(model, corpus)

# 2. train your model for 2 epochs
trainer.train(
    'experiment/folder',
    max_epochs=2,
    # example checkpointing
    checkpoint=True,
)

# 3. load last checkpoint with one line of code
trainer = ModelTrainer.load_checkpoint('experiment/folder/checkpoint.pt', corpus)

# 4. continue training for 2 extra epochs
trainer.train('experiment/folder_2',  max_epochs=4) 

Refactor data sampling during training (#1154)

Adds a FlairSampler interface to better enable passing custom samplers to the ModelTrainer.

For instance, if you want to always shuffle your dataset in chunks of 5 to 10 sentences, you provide a sampler like this:

# your trainer
trainer: ModelTrainer = ModelTrainer(tagger, corpus)

# execute training run
trainer.train('path/to/experiment/folder',
              max_epochs=150,
              # sample data in chunks of 5 to 10
              sampler=ChunkSampler(block_size=5, plus_window=5)
              )

Other refactorings

  • Switch everything to batch first mode (#1077)

  • Refactor classification to be more consistent with SequenceTagger (#1151)

  • PyTorch-Transformers -> Transformers #1163

  • In-place transpose of tensors (#1047)

Enhancements

Documentation fixes (#1045 #1098 #1121 #1157 #1160 #1168 )

Add option to set rnn_type used in SequenceTagger (#1113)

Accept string as input in NER predict (#1142)

Example usage:

# init tagger
tagger= SequenceTagger.load('ner')

# predict over list of strings
sentences = tagger.predict(
    [
        'George Washington went to Berlin .', 
        'George Berlin lived in Washington .'
    ]
)

# output predictions
for sentence in sentences:
    print(sentence.to_tagged_string())

Enable One-hot Embeddings of other Tags (#1191)

Bug fixes

  • Fix the learning rate finder (#1119)
  • Fix OneHotEmbeddings on Cuda (#1147)
  • Fix encoding error in CSVClassificationDataset (#1055)
  • Fix encoding errors related to old windows chars (#1135)
  • Fix length error in CharacterEmbeddings (#1088 )
  • Fix tokenizer insert empty token to sentence object (#1226)
  • Ensure StackedEmbeddings always has the same embedding order (#1114)
  • Use $HOME instead of ~ for cache_root (#1134)

Release 0.4.3

26 Aug 18:26
ff9846d
Compare
Choose a tag to compare

Release 0.4.3 includes a host of new features including transformer-based embeddings (roBERTa, XLNet, XLM, etc.), fine-tuneable FlairEmbeddings, crosslingual MUSE embeddings, new data loading/sampling methods, speed/memory optimizations, bug fixes and enhancements. It also begins a refactoring of interfaces that prepares more general applicability of Flair to other types of downstream tasks.

Embeddings

Transformer embeddings (#941 #972 #993)

Updates the old pytorch-pretrained-BERT library to the latest version of pytorch-transformers to support various new Transformer-based architectures for embeddings.

A total of 7 (new/updated) transformer-based embeddings can be used in Flair now:

from flair.embeddings import (
    BertEmbeddings,
    OpenAIGPTEmbeddings,
    OpenAIGPT2Embeddings,
    TransformerXLEmbeddings,
    XLNetEmbeddings,
    XLMEmbeddings,
    RoBERTaEmbeddings,
)

bert_embeddings = BertEmbeddings()
gpt1_embeddings = OpenAIGPTEmbeddings()
gpt2_embeddings = OpenAIGPT2Embeddings()
txl_embeddings = TransformerXLEmbeddings()
xlnet_embeddings = XLNetEmbeddings()
xlm_embeddings = XLMEmbeddings()
roberta_embeddings = RoBERTaEmbeddings()

Detailed benchmarks on the downsampled CoNLL-2003 NER dataset for English can be found in #873 .

Crosslingual MUSE Embeddings (#853)

Use the new MuseCrosslingualEmbeddings class to embed any sentence in one of 30 languages into the same embedding space. Behind the scenes the class first does language detection of the sentence to be embedded, and then embeds it with the appropriate language embeddings. If you train a classifier or sequence labeler with (only) this class, it will automatically work across all 30 languages, though quality may widely vary.

Here's how to embed:

# initialize embeddings
embeddings = MuseCrosslingualEmbeddings()

# two sentences in different languages
sentence_1 = Sentence("This red shoe is new .")
sentence_2 = Sentence("Dieser rote Schuh ist rot .")

# language code is auto-detected
print(sentence_1.get_language_code())
print(sentence_2.get_language_code())

# embed sentences
embeddings.embed([sentence_1, sentence_2])

# print similarities
cos = torch.nn.CosineSimilarity(dim=0, eps=1e-6)
for token_1, token_2 in zip (sentence_1, sentence_2):
    print(f"'{token_1.text}' and '{token_2.text}' similarity: {cos(token_1.embedding, token_2.embedding)}")

FastTextEmbeddings (#879 )

Adds FastTextEmbeddings capable of handling for oov words. Be warned though that these embeddings are huge. BytePairEmbeddings are much smaller and reportedly of similar quality so it is probably advisable to use those instead.

Fine-tuneable FlairEmbeddings (#922)

You can now fine-tune FlairEmbeddings on downstream tasks. You can fine-tune an existing LM by simply passing the fine_tune parameter in the FlairEmbeddings constructor, like this:

embeddings = FlairEmbeddings('news-foward', fine_tune=True)

You can also use this option to task-train a wholly new language model by passing an empty LanguageModel to the FlairEmbeddings constructor and the fine_tune parameter, like this:

# make an empty language model
language_model = LanguageModel(
    Dictionary.load('chars'),
    is_forward_lm=True,
    hidden_size=256,
    nlayers=1)

# init FlairEmbeddings to task-train this model
embeddings = FlairEmbeddings(language_model, fine_tune=True)

Optimizations

Automatic mixed precision support (#934)

Mixed precision training can significantly speed up training. It can now be enabled by setting use_amp=True in the trainer classes. For instance for training language models you can do:

# train your language model
trainer = LanguageModelTrainer(language_model, corpus)

trainer.train('resources/taggers/language_model',
              sequence_length=256,
              mini_batch_size=256,
              max_epochs=10,
              use_amp=True)

In our experiments, we saw 3x speedup of training large language models though results vary depending on your model size and experimental setup.

Control memory / speed tradeoff during training (#891 #809).

This release introduces the embeddings_storage_mode parameter to the ModelTrainer class and predict() methods. This parameter can be one of 'none', 'cpu' and 'gpu' and allows you to control the tradeoff between memory usage and speed during training:

  • If set to 'none' all embeddings are deleted after usage - this has lowest memory requirements but means that embeddings need to be recomputed at each epoch of training potentially causing a slowdown.
  • If set to 'cpu' all embeddings are moved to CPU memory after usage. During training, this means that they only need to be moved back to GPU for the forward pass, and not recomputed so in many cases this is faster, but requires memory.
  • If set to 'gpu' all embeddings stay on GPU memory after computation. This eliminates memory shuffling during training, causing a speedup. However this option requires enough GPU memory to be available for all embeddings of the dataset.

To use this option during training, simply set the parameter:

        # initialize trainer
        trainer: ModelTrainer = ModelTrainer(tagger, corpus)
        trainer.train(
            "path/to/your/model",
            embeddings_storage_mode='gpu',
        )

This release also removes the FlairEmbeddings-specific disk-caching mechanism. In the future, a more general caching mechanism applicable to all embedding types may potentially be added as a fourth memory management option.

Speed-ups on in-memory datasets (#792)

A new DataLoader abstract base class used in Flair will speed up data loading for in-memory datasets.

Refactoring of interfaces (#891 #843)

This release also slims down interfaces of flair.nn.Model and adds a new DataPoint interface that is currently implemented by the Token and Sentence classes. The idea is to widen the applicability of Flair to other data types and other tasks. In the future, the DataPoint interface will for example also be implemented by an Image object and new downstream tasks added to Flair.

The release also slims down the evaluate() method in the flair.nn.Model interface to take a DataLoader instead of a group of parameters. And refactors the logging header logic. Both refactorings prepare adding new new downstream tasks to Flair in the near future.

Other features

Training Classifiers with CSV files (#826 #952 #967)

Adds the CSVClassificationCorpus so you can train classifiers directly from CSVs instead of first having to convert to FastText format. To load a CSV, you need to pass a column_name_map (like in ColumnCorpus), which indicates which column(s) in the CSV holds the text and which field(s) the label(s):

corpus = CSVClassificationCorpus(
    # path to the data folder containing train / test / dev files
    data_folder='path/to/data',
    # indicates which columns are text and labels
    column_name_map={4: "text", 1: "label_topic", 2: "label_subtopic"},
    # if CSV has a header, you can skip it
    skip_header=True)

Data sampling (#908)

We added the first (of many) data samplers that can be passed to the ModelTrainer to influence training. The ImbalancedClassificationDatasetSampler for instance will upsample rare classes and downsample common classes in a classification dataset. It may potentially help with imbalanced datasets. Call like this:

    # initialize trainer
    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    trainer.train(
        'path/to/folder',
        learning_rate=0.1,
        mini_batch_size=32,
        sampler=ImbalancedClassificationDatasetSampler,
    )

There are two experimental chunk samplers (ChunkSampler and ExpandingChunkSampler) split a dataset into chunks and shuffle them. This preserves some ordering of the original data while also randomizing the data.

Visualization

  • Adds HTML vizualization of sequence labeling (#933). Call like this:
from flair.visual.ner_html import render_ner_html

tagger = SequenceTagger.load('ner')

sentence = Sentence(
    "Thibaut Pinot's challenge ended on Friday due to injury, and then Julian Alaphilippe saw "
    "his lead fall away. The BBC's Hugh Schofield in Paris reflects on 34 years of hurt."
)

tagger.predict(sentence)
html = render_ner_html(sentence)

with open("sentence.html", "w") as writer:
    writer.write(html)
  • Plotter now returns images for use in iPython notebooks (#943)
  • Initial TensorBoard support (#924)
  • Add pointer to Flair Visualizer (#1014)

Additional parameterization options

  • CharacterEmbeddings now let you specify number of hidden states and embedding size (#834)
embedding = CharacterEmbedding(char_embedding_dim=64, hidden_size_char=64)
  • Adds configuration option for minimal learning rate stopping criterion (#871)
  • num_workers is a parameter of LanguageModelTrainer (#962 )

Bug fixes / enhancements

  • Updates old pretrained models to remove old bugs / performance issues (#1017)
  • Fix error in RNN initialization in DocumentRNNEmbeddings (#793)
  • ELMoEmbeddings now use flair.device param (#825)
  • Fix download of TREC_6 dataset (#896)
  • Fix download of UD_GERMAN-HDT (#980)
  • Fix download of WikiNER_German (#1006)
  • Fix error in ColumnCorpus in which words that begin with hashtags were skipped as comments (#956)
  • Fix max_tokens_per_doc param in ClassificationCorpus (#991)
  • Simplify split rule in ColumnCorpus (#990)
  • Fix import error message for ELMoEmbeddings (#1019)
  • References to Persian language unified across embedd...
Read more