GH-1492: added new BERT embeddings implementation #1494

kishaloyhalder · 2020-03-26T15:23:36Z

Sources for the simplified BertEmbeddings class.

alanakbik · 2020-03-26T18:32:35Z

flair/embeddings.py

-                                sentence_index
-                            ]
+                    # check if this token initiates a new word, internal subtokens start with ##
+                    if not subtoken.startswith('##') and len(token_subtoken_embeddings) > 0:


This does not always work. For instance the code

train_sentence = Sentence('JAKARTATown 1996-08-27') embeddings = BertEmbeddings(pooling_operation='first') embeddings.embed(train_sentence)

Will break because Flair would consider this to be two tokens ("JAKARTATown" and "1996-08-27"), while BERT subtokenizes this as ['[CLS]', 'jakarta', '##town', '1996', '-', '08', '-', '27', '[SEP]']. So 1996-08-27 is split but there are no ## prefixes to indicate this.

@stefan-it you use the method _get_transformer_sentence_embeddings to go from subtokens to tokens for all transformer embeddings except for BERT. Can this method also be used for BERT or is there something different here?

@alanakbik yes, it can be used here. The method should be robust enough to handle cases, where the BERT tokenizer discards tokens (e.g. special characters, in this case the <unk> token is used to get a correct embedding for the "real" Flair token).

Ah cool - I'll refactor it in, then! Do you have an example at hand for a case in which BERT tokenizer discards tokens?

At the moment I've only seen '\x96', '\u200e', '\x95', '\xad' or '\x80' as problematic tokens (or better say control characters):

from transformers import BertTokenizer tokenizer = BertTokenizer.from_pretrained("bert-base-uncased") tokenizer.tokenize("\x96") # returns []

I found these tokens in the GermEval 2014 dataset (yes, this is bad) when fine-tuning BERT models with the Transformers NER code.

I think we can heavily refactor the Transformer-base model inputs (like input ids, masks), because they're almost the same across all architectures. A good reference is the Hugging Face fine-tuning code for sequence labeling:

https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py#L172-L178

Crucial part is always the different ways of tokenization - GPT-2 tokenization is horrible...

Yes, I was just looking at the HuggingFace AutoModel classes: https://huggingface.co/transformers/model_doc/auto.html - based on this I'm trying to write a singular TransformerWordEmbeddings class for all transformer embeddings, instantiated with the model name. However, as you write, the big problem is matching all the different subtokenization schemes and their special symbols to our own tokens.

alanakbik · 2020-03-28T01:23:11Z

Hello @stefan-it @kishaloyhalder based on our discussions I added a TransformerWordEmbeddings class adapting @stefan-it 's code for the various transformer embeddings classes. It uses the AutoModel classes from transformers, plus some logic to get from subtokens to tokens.

You can initialize various embeddings by passing the Huggingface model name, i.e.:

# example sentence
sentence = Sentence('The grass is green')

# a BERT model
embeddings = TransformerWordEmbeddings(model="bert-base-uncased", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# a roBERTa model
embeddings = TransformerWordEmbeddings(model="distilroberta-base", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# GPT-2
embeddings = TransformerWordEmbeddings(model="gpt2", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# a T5 model
embeddings = TransformerWordEmbeddings(model="t5-small", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

Something like this would have the advantage that if mode models are added to transformers and they are supported through the AutoModel logic (and don't have a crazy new tokenization scheme), we would not need to update the code to support it. What do you think?

If we prefer the model to be reflected in the class name, we could additionally create subclasses - I put in the example subclass BERTEmbeddings in the code for discussion.

@kishaloyhalder this class also only does word level embeddings, but a similar class TransformerDocumentEmbeddings could be written minus all the tokenization logic, but plus a sentence classification token logic.

…ifferent CLS tokens

stefan-it · 2020-03-30T08:15:25Z

Thanks @alanakbik ,

thanks, I will test it :)

One question: could we also refactor the unit tests, I wrote for these embeddings: tests/test_transformer_embeddings.py.

Tests mainly check, if the alignment between Flair token and subword token(s) is working and if the correct embedding is used from Transformers. I can help with that!

alanakbik · 2020-03-30T09:39:17Z

@stefan-it thanks! Yes help would be very much appreciated here!

… add truncation options

into GH-1492-transformers

… add truncation options

alanakbik · 2020-04-02T21:27:43Z

@stefan-it I'm thinking of merging this PR already since I've started testing training classifiers with fine-tuning and made some modifications to other classes (datasets and model trainer) on the way. So the idea would be to merge now and add more things to the transformer embeddings in later PRs or fix problems as they become apparent when using this to train classifiers. For instance, a TODO is a better truncation logic for longer sentences. This would leave the original transformer embeddings classes as they are for now (so that all models trained with them still work) but move them into deprecated state at some point. Is this ok from your side?

There is a also refactoring coming up of the embeddings.py (which is way too large) that will turn it into a folder embeddings/ with submodules word.py (all word embedings), document.py (all document embeddings), image.py (image embeddings) and legacy.py (all deprecated embeddings, which would then include the old transformer embedding classes).

stefan-it · 2020-04-02T21:53:27Z

@alanakbik This is totally ok! I'm currently working on a refactoring of the unit tests for these Transformer-based embeddings.

Refactoring of the current embeddigs.py is also a great idea 👍

stefan-it · 2020-04-03T17:21:33Z

Currently testing the PR on CoNLL with bert-base-cased and scalar mix - after the training has finished I'll train another model with the new fine-tuning option for comparison.

alanakbik · 2020-04-03T20:23:41Z

@stefan-it awesome! To set Adam optimizer and smaller learning rates you can use parameters like this:

trainer = ModelTrainer(model, corpus, optimizer=torch.optim.Adam)
trainer.train(
    f'path/to/output/folder',
    learning_rate=3e-5,  # fine-tuning always uses very small learning rate
    min_learning_rate=3e-6, # needs to be set otherwise it quits immediately
    mini_batch_size=256, # set this high if corpus is large, otherwise lower
    mini_batch_chunk_size=2, # set if mini-batch size is too large for memory
    anneal_with_restarts=True,
    anneal_factor=0.1,
    patience=1,
    max_epochs=20,
)

stefan-it · 2020-04-07T08:18:31Z

Just an update:

Model without fine-tuning reaches 94.34 (dev) and 90.67 (test). It's ~0.7% behind the old implementation.

For tine-tuning I used the same parameters as above: 93.75 (dev) and 89.91 (test) with the BERT base cased model.

But I'll do more experiments soon :)

alanakbik · 2020-04-07T09:23:54Z

Oha that is not good :/ thanks for looking into this!

stefan-it · 2020-04-07T09:43:09Z

Just came up with the following idea for an integration test 😅

I will use the CoNLL dataset and embed each sentence with the old and new implementation. Then the embeddings for each token can be compared to spot any differences.

Will report back today when I've finished the comparison.

After that I could look into the fine-tuning code.

Btw: great resource for fine-tuning Transformer-based models: "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" :)

alanakbik · 2020-04-07T09:47:27Z

Great idea, thanks! Yes, maybe the new way of determining tokens does not work as well as before.

Also thanks for the pointer to the paper - exactly what we were looking for!

alanakbik · 2020-04-08T23:34:04Z

@stefan-it did you have a chance to do the comparison?

stefan-it · 2020-04-09T09:42:53Z

@alanakbik

I used the following script for comparison:

from typing import List

import torch

import flair.datasets
from flair.data import Corpus
from flair.embeddings import (
    BertEmbeddings,
    TransformerWordEmbeddings,
)

# 1. get the corpus
corpus_old: Corpus = flair.datasets.CONLL_03()

bert_model = "bert-base-cased"
layers = "0,1,2,3,4,5,6,7,8,9,10,11,12"
use_scalar_mix = True

embeddings_old = BertEmbeddings(bert_model_or_path=bert_model, layers=layers, use_scalar_mix=use_scalar_mix)
embeddings_new = TransformerWordEmbeddings(model=bert_model, layers=layers, use_scalar_mix=use_scalar_mix)

corpus_new: Corpus = flair.datasets.CONLL_03()

def compare(corpus_old_split, corpus_new_split):

    for sentence_old, sentence_new in zip(corpus_old_split, corpus_new_split):
        embeddings_old.embed(sentence_old)
        embeddings_new.embed(sentence_new)

        mismatched_tokens = []

        for token_old, token_new in zip(sentence_old.tokens, sentence_new.tokens):
            #print(token_old.text, token_new.text)
            assert token_old.text == token_new.text

            if not torch.equal(token_old.embedding, token_new.embedding):
                mismatched_tokens.append(token_old.text)

        if mismatched_tokens:
            print("Mismatch for sentence:", sentence_old)
            print("Mismatched tokens:", mismatched_tokens)
            print("")

compare(corpus_old.train, corpus_new.train)

It's running at the moment 😅

stefan-it · 2020-04-09T12:28:14Z

BERT cased and uncased embeddings are identical 👍

OpenGPT is not working. I'll do experiments for all Transformer-based embeddings and report back the results here:

Model	Identical
BERT, base, cased	✅
BERT, base, uncased	✅
OpenAI GPT	❎ (errors, not working)
openAI GPT-2	❎ (mismatches)
Transformer-XL	❎ (`attention_mask` error)
XLNet	❎ (mismatches)
RoBERTa	❎ (mismatches)
CamemBERT	❎ (2 mismatches in train set)
XLM-RoBERTa	✅

alanakbik · 2020-04-09T12:39:31Z

@stefan-it thanks for looking into this! Good to hear that at least for BERT it is working, but strange then that the results on NER differ. Any ideas why this could be the case?

stefan-it · 2020-04-12T12:00:37Z

@alanakbik Maybe this is batch-size related, I'm currently running another run with the same hyper-parameters that I've used for previous experiments.

After that I will run experiments with the other Transformer-based models to see if the new implementation needs some adjustments :)

stefan-it · 2020-04-12T19:55:50Z

Another run for BERT (base, cased): 94.47 (dev) and 91.07 (test).

RoBERTa (large) does not look OK: 94.99 (dev) and 91.00 (test), compared to previous 96.31 (dev) and 92.31 (test).

alanakbik · 2020-04-12T21:38:22Z

Thanks for looking into this. Checking on RoBERTa it seems that the difference is how they are tokenized. The original RoBERTaEmbeddings class uses the tokenizer.tokenize() method and then adds begin and end markers to the subtokens. The TransformerWordEmbeddings classes uses tokenizer.encode() with add_special_tokens set to True (default), so that begin and end tokens get added by the tokenizer.

The result is neary exactly the same, except for the first subtoken, which in the old method does not get a Ġ-prefix, but in the new one does. See this code:

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# old way
print(tokenizer.tokenize("CRICKET MATCH"))

# new wy
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("CRICKET MATCH")))

Old way outputs ['CR', 'ICK', 'ET', 'ĠM', 'ATCH'] ('<s>' and '</s>' get added later)
New way outputs ['<s>', 'ĠCR', 'ICK', 'ET', 'ĠM', 'ATCH', '</s>']

So the difference is the first subtoken. Intuitively, the second seems more consistent in that each new word gets prefixed with a Ġ-, but it seems the old one gives better results. Any ideas which tokenization scheme is correct?

alanakbik · 2020-04-12T21:59:18Z

It is similar for XLNetEmbeddings. The original class will give us the following tokenization:

['<s>', '▁CR', 'ICK', 'ET', '▁M', 'ATCH', '</s>']

while the new class will give us the following:

['▁CR', 'ICK', 'ET', '▁M', 'ATCH', '<sep>', '<cls>']

The question is: which one is correct?

stefan-it · 2020-04-13T10:32:57Z

@alanakbik I've reported this behavior a while ago here: huggingface/transformers#1196

Btw: I did remove the if isinstance(self.tokenizer, GPT2Tokenizer): check and re-run the experiment with RoBERTa: 95.68 (dev) and 92.01 (test), which is ~0.3% behind older experiments now :)

GH-1494: transformer errors

Kishaloy Halder added 2 commits March 26, 2020 14:56

GH-1492: added new BERT embeddings implementation

5a081a2

GH-1492: added new BERT embeddings implementation, with original name

4a9e629

alanakbik reviewed Mar 26, 2020

View reviewed changes

alanakbik added 2 commits March 28, 2020 02:09

GH-1492: suggestion for unified TransformerWordEmbeddings class

5371ed2

GH-1492: suggestion for unified TransformerWordEmbeddings class

4875585

alanakbik added 5 commits March 29, 2020 00:02

GH-1492: suggestion for unified TransformerWordEmbeddings class

44e0fc0

GH-1492: add fine-tuneable transformer embeddings for documents

244d4ff

GH-1492: add mirco-batching to transformer embeddings | handling of d…

5f118a7

…ifferent CLS tokens

GH-1492: truncation to 512 subtokens in TransformerDocumentEmbeddings

03ed8c5

GH-1492: truncation to 512 subtokens in TransformerDocumentEmbeddings

cad2e8a

stefan-it and others added 9 commits March 30, 2020 12:06

pip: bump transformers version to >= 2.6

324b743

GH-1503: unify sentiment labels and switch to ClassificationDataset

691df60

GH-1503: unify sentiment labels and switch to ClassificationDataset |…

eb3b839

… add truncation options

Merge branch 'GH-1492-transformers' of https://github.com/flairNLP/flair

c4772a7

into GH-1492-transformers

GH-1503: unify sentiment labels and switch to ClassificationDataset |…

c269470

… add truncation options

GH-1503: adapt annealing logic for transformers

e787065

GH-1492: clean up code

745cc4f

GH-1492: clean up code

691fc4a

GH-1492: revert BertEmbeddings for future deprecation

181ce16

alanakbik merged commit e9b5c2a into master Apr 3, 2020

alanakbik mentioned this pull request Apr 8, 2020

NER task using Flair BertEmbeddings VS HuggingFace scripts #1508

Closed

alanakbik mentioned this pull request Apr 9, 2020

Problem with max_sequence_length in BertEmbeddings #1519

Closed

alanakbik mentioned this pull request Apr 14, 2020

Inconsistencies and possible bugs in different tokenizers huggingface/transformers#3788

Closed

alanakbik added a commit that referenced this pull request Apr 15, 2020

GH-1494: fix handling of unknown tokens and RoBERTa offsets

3b78789

alanakbik deleted the GH-1492-transformers branch April 17, 2020 22:31

alanakbik added a commit that referenced this pull request Apr 20, 2020

GH-1494: fix handling of unknown tokens and RoBERTa offsets

53d02d5

alanakbik added a commit that referenced this pull request Apr 20, 2020

GH-1494: Feature-based transformers should not use dropout

0a8d100

alanakbik added a commit that referenced this pull request Apr 26, 2020

GH-1494: add all layer option

8d89c8c

alanakbik added a commit that referenced this pull request Apr 26, 2020

GH-1494: remove commented out code

6acd36f

alanakbik added a commit that referenced this pull request Apr 26, 2020

GH-1494: remove commented out code

7bc791e

alanakbik mentioned this pull request Apr 26, 2020

GH-1494: transformer errors #1544

Merged

alanakbik added a commit that referenced this pull request Apr 26, 2020

Merge pull request #1544 from flairNLP/GH-1494-transformer-errors

e944997

GH-1494: transformer errors

This was referenced Apr 29, 2020

Accessing [CLS] Token Embeddings #1259

Closed

GPT-2 XL Model #1166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-1492: added new BERT embeddings implementation #1494

GH-1492: added new BERT embeddings implementation #1494

kishaloyhalder commented Mar 26, 2020

alanakbik Mar 26, 2020

stefan-it Mar 26, 2020 •

edited

Loading

alanakbik Mar 26, 2020

stefan-it Mar 27, 2020 •

edited

Loading

stefan-it Mar 27, 2020

alanakbik Mar 27, 2020

alanakbik commented Mar 28, 2020

stefan-it commented Mar 30, 2020

alanakbik commented Mar 30, 2020

alanakbik commented Apr 2, 2020

stefan-it commented Apr 2, 2020

stefan-it commented Apr 3, 2020 •

edited

Loading

alanakbik commented Apr 3, 2020

stefan-it commented Apr 7, 2020

alanakbik commented Apr 7, 2020

stefan-it commented Apr 7, 2020

alanakbik commented Apr 7, 2020

alanakbik commented Apr 8, 2020

stefan-it commented Apr 9, 2020 •

edited

Loading

stefan-it commented Apr 9, 2020 •

edited

Loading

alanakbik commented Apr 9, 2020

stefan-it commented Apr 12, 2020

stefan-it commented Apr 12, 2020

alanakbik commented Apr 12, 2020

alanakbik commented Apr 12, 2020

stefan-it commented Apr 13, 2020

GH-1492: added new BERT embeddings implementation #1494

GH-1492: added new BERT embeddings implementation #1494

Conversation

kishaloyhalder commented Mar 26, 2020

alanakbik Mar 26, 2020

Choose a reason for hiding this comment

stefan-it Mar 26, 2020 • edited Loading

Choose a reason for hiding this comment

alanakbik Mar 26, 2020

Choose a reason for hiding this comment

stefan-it Mar 27, 2020 • edited Loading

Choose a reason for hiding this comment

stefan-it Mar 27, 2020

Choose a reason for hiding this comment

alanakbik Mar 27, 2020

Choose a reason for hiding this comment

alanakbik commented Mar 28, 2020

stefan-it commented Mar 30, 2020

alanakbik commented Mar 30, 2020

alanakbik commented Apr 2, 2020

stefan-it commented Apr 2, 2020

stefan-it commented Apr 3, 2020 • edited Loading

alanakbik commented Apr 3, 2020

stefan-it commented Apr 7, 2020

alanakbik commented Apr 7, 2020

stefan-it commented Apr 7, 2020

alanakbik commented Apr 7, 2020

alanakbik commented Apr 8, 2020

stefan-it commented Apr 9, 2020 • edited Loading

stefan-it commented Apr 9, 2020 • edited Loading

alanakbik commented Apr 9, 2020

stefan-it commented Apr 12, 2020

stefan-it commented Apr 12, 2020

alanakbik commented Apr 12, 2020

alanakbik commented Apr 12, 2020

stefan-it commented Apr 13, 2020

stefan-it Mar 26, 2020 •

edited

Loading

stefan-it Mar 27, 2020 •

edited

Loading

stefan-it commented Apr 3, 2020 •

edited

Loading

stefan-it commented Apr 9, 2020 •

edited

Loading

stefan-it commented Apr 9, 2020 •

edited

Loading