Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-1492: added new BERT embeddings implementation #1494

Merged
merged 18 commits into from
Apr 3, 2020

Conversation

kishaloyhalder
Copy link
Collaborator

Sources for the simplified BertEmbeddings class.

sentence_index
]
# check if this token initiates a new word, internal subtokens start with ##
if not subtoken.startswith('##') and len(token_subtoken_embeddings) > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not always work. For instance the code

train_sentence = Sentence('JAKARTATown 1996-08-27')
embeddings = BertEmbeddings(pooling_operation='first')
embeddings.embed(train_sentence)

Will break because Flair would consider this to be two tokens ("JAKARTATown" and "1996-08-27"), while BERT subtokenizes this as ['[CLS]', 'jakarta', '##town', '1996', '-', '08', '-', '27', '[SEP]']. So 1996-08-27 is split but there are no ## prefixes to indicate this.

@stefan-it you use the method _get_transformer_sentence_embeddings to go from subtokens to tokens for all transformer embeddings except for BERT. Can this method also be used for BERT or is there something different here?

Copy link
Member

@stefan-it stefan-it Mar 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanakbik yes, it can be used here. The method should be robust enough to handle cases, where the BERT tokenizer discards tokens (e.g. special characters, in this case the <unk> token is used to get a correct embedding for the "real" Flair token).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah cool - I'll refactor it in, then! Do you have an example at hand for a case in which BERT tokenizer discards tokens?

Copy link
Member

@stefan-it stefan-it Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment I've only seen '\x96', '\u200e', '\x95', '\xad' or '\x80' as problematic tokens (or better say control characters):

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokenizer.tokenize("\x96")

# returns []

I found these tokens in the GermEval 2014 dataset (yes, this is bad) when fine-tuning BERT models with the Transformers NER code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can heavily refactor the Transformer-base model inputs (like input ids, masks), because they're almost the same across all architectures. A good reference is the Hugging Face fine-tuning code for sequence labeling:

https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py#L172-L178

Crucial part is always the different ways of tokenization - GPT-2 tokenization is horrible...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was just looking at the HuggingFace AutoModel classes: https://huggingface.co/transformers/model_doc/auto.html - based on this I'm trying to write a singular TransformerWordEmbeddings class for all transformer embeddings, instantiated with the model name. However, as you write, the big problem is matching all the different subtokenization schemes and their special symbols to our own tokens.

@alanakbik
Copy link
Collaborator

Hello @stefan-it @kishaloyhalder based on our discussions I added a TransformerWordEmbeddings class adapting @stefan-it 's code for the various transformer embeddings classes. It uses the AutoModel classes from transformers, plus some logic to get from subtokens to tokens.

You can initialize various embeddings by passing the Huggingface model name, i.e.:

# example sentence
sentence = Sentence('The grass is green')

# a BERT model
embeddings = TransformerWordEmbeddings(model="bert-base-uncased", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# a roBERTa model
embeddings = TransformerWordEmbeddings(model="distilroberta-base", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# GPT-2
embeddings = TransformerWordEmbeddings(model="gpt2", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# a T5 model
embeddings = TransformerWordEmbeddings(model="t5-small", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

Something like this would have the advantage that if mode models are added to transformers and they are supported through the AutoModel logic (and don't have a crazy new tokenization scheme), we would not need to update the code to support it. What do you think?

If we prefer the model to be reflected in the class name, we could additionally create subclasses - I put in the example subclass BERTEmbeddings in the code for discussion.

@kishaloyhalder this class also only does word level embeddings, but a similar class TransformerDocumentEmbeddings could be written minus all the tokenization logic, but plus a sentence classification token logic.

@stefan-it
Copy link
Member

Thanks @alanakbik ,

thanks, I will test it :)

One question: could we also refactor the unit tests, I wrote for these embeddings: tests/test_transformer_embeddings.py.

Tests mainly check, if the alignment between Flair token and subword token(s) is working and if the correct embedding is used from Transformers. I can help with that!

@alanakbik
Copy link
Collaborator

@stefan-it thanks! Yes help would be very much appreciated here!

@alanakbik
Copy link
Collaborator

@stefan-it I'm thinking of merging this PR already since I've started testing training classifiers with fine-tuning and made some modifications to other classes (datasets and model trainer) on the way. So the idea would be to merge now and add more things to the transformer embeddings in later PRs or fix problems as they become apparent when using this to train classifiers. For instance, a TODO is a better truncation logic for longer sentences. This would leave the original transformer embeddings classes as they are for now (so that all models trained with them still work) but move them into deprecated state at some point. Is this ok from your side?

There is a also refactoring coming up of the embeddings.py (which is way too large) that will turn it into a folder embeddings/ with submodules word.py (all word embedings), document.py (all document embeddings), image.py (image embeddings) and legacy.py (all deprecated embeddings, which would then include the old transformer embedding classes).

@stefan-it
Copy link
Member

@alanakbik This is totally ok! I'm currently working on a refactoring of the unit tests for these Transformer-based embeddings.

Refactoring of the current embeddigs.py is also a great idea 👍

@alanakbik alanakbik merged commit e9b5c2a into master Apr 3, 2020
@stefan-it
Copy link
Member

stefan-it commented Apr 3, 2020

Currently testing the PR on CoNLL with bert-base-cased and scalar mix - after the training has finished I'll train another model with the new fine-tuning option for comparison.

@alanakbik
Copy link
Collaborator

@stefan-it awesome! To set Adam optimizer and smaller learning rates you can use parameters like this:

trainer = ModelTrainer(model, corpus, optimizer=torch.optim.Adam)
trainer.train(
    f'path/to/output/folder',
    learning_rate=3e-5,  # fine-tuning always uses very small learning rate
    min_learning_rate=3e-6, # needs to be set otherwise it quits immediately
    mini_batch_size=256, # set this high if corpus is large, otherwise lower
    mini_batch_chunk_size=2, # set if mini-batch size is too large for memory
    anneal_with_restarts=True,
    anneal_factor=0.1,
    patience=1,
    max_epochs=20,
)

@stefan-it
Copy link
Member

Just an update:

Model without fine-tuning reaches 94.34 (dev) and 90.67 (test). It's ~0.7% behind the old implementation.

For tine-tuning I used the same parameters as above: 93.75 (dev) and 89.91 (test) with the BERT base cased model.

But I'll do more experiments soon :)

@alanakbik
Copy link
Collaborator

Oha that is not good :/ thanks for looking into this!

@stefan-it
Copy link
Member

Just came up with the following idea for an integration test 😅

I will use the CoNLL dataset and embed each sentence with the old and new implementation. Then the embeddings for each token can be compared to spot any differences.

Will report back today when I've finished the comparison.

After that I could look into the fine-tuning code.

Btw: great resource for fine-tuning Transformer-based models: "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" :)

@alanakbik
Copy link
Collaborator

Great idea, thanks! Yes, maybe the new way of determining tokens does not work as well as before.

Also thanks for the pointer to the paper - exactly what we were looking for!

@alanakbik
Copy link
Collaborator

@stefan-it did you have a chance to do the comparison?

@stefan-it
Copy link
Member

stefan-it commented Apr 9, 2020

@alanakbik

I used the following script for comparison:

from typing import List

import torch

import flair.datasets
from flair.data import Corpus
from flair.embeddings import (
    BertEmbeddings,
    TransformerWordEmbeddings,
)

# 1. get the corpus
corpus_old: Corpus = flair.datasets.CONLL_03()

bert_model = "bert-base-cased"
layers = "0,1,2,3,4,5,6,7,8,9,10,11,12"
use_scalar_mix = True

embeddings_old = BertEmbeddings(bert_model_or_path=bert_model, layers=layers, use_scalar_mix=use_scalar_mix)
embeddings_new = TransformerWordEmbeddings(model=bert_model, layers=layers, use_scalar_mix=use_scalar_mix)

corpus_new: Corpus = flair.datasets.CONLL_03()

def compare(corpus_old_split, corpus_new_split):

    for sentence_old, sentence_new in zip(corpus_old_split, corpus_new_split):
        embeddings_old.embed(sentence_old)
        embeddings_new.embed(sentence_new)

        mismatched_tokens = []

        for token_old, token_new in zip(sentence_old.tokens, sentence_new.tokens):
            #print(token_old.text, token_new.text)
            assert token_old.text == token_new.text

            if not torch.equal(token_old.embedding, token_new.embedding):
                mismatched_tokens.append(token_old.text)

        if mismatched_tokens:
            print("Mismatch for sentence:", sentence_old)
            print("Mismatched tokens:", mismatched_tokens)
            print("")

compare(corpus_old.train, corpus_new.train)

It's running at the moment 😅

@stefan-it
Copy link
Member

stefan-it commented Apr 9, 2020

BERT cased and uncased embeddings are identical 👍

OpenGPT is not working. I'll do experiments for all Transformer-based embeddings and report back the results here:

Model Identical
BERT, base, cased
BERT, base, uncased
OpenAI GPT ❎ (errors, not working)
openAI GPT-2 ❎ (mismatches)
Transformer-XL ❎ (attention_mask error)
XLNet ❎ (mismatches)
RoBERTa ❎ (mismatches)
CamemBERT ❎ (2 mismatches in train set)
XLM-RoBERTa

@alanakbik
Copy link
Collaborator

@stefan-it thanks for looking into this! Good to hear that at least for BERT it is working, but strange then that the results on NER differ. Any ideas why this could be the case?

@stefan-it
Copy link
Member

@alanakbik Maybe this is batch-size related, I'm currently running another run with the same hyper-parameters that I've used for previous experiments.

After that I will run experiments with the other Transformer-based models to see if the new implementation needs some adjustments :)

@stefan-it
Copy link
Member

Another run for BERT (base, cased): 94.47 (dev) and 91.07 (test).

RoBERTa (large) does not look OK: 94.99 (dev) and 91.00 (test), compared to previous 96.31 (dev) and 92.31 (test).

@alanakbik
Copy link
Collaborator

Thanks for looking into this. Checking on RoBERTa it seems that the difference is how they are tokenized. The original RoBERTaEmbeddings class uses the tokenizer.tokenize() method and then adds begin and end markers to the subtokens. The TransformerWordEmbeddings classes uses tokenizer.encode() with add_special_tokens set to True (default), so that begin and end tokens get added by the tokenizer.

The result is neary exactly the same, except for the first subtoken, which in the old method does not get a Ġ-prefix, but in the new one does. See this code:

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# old way
print(tokenizer.tokenize("CRICKET MATCH"))

# new wy
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("CRICKET MATCH")))

Old way outputs ['CR', 'ICK', 'ET', 'ĠM', 'ATCH'] ('<s>' and '</s>' get added later)
New way outputs ['<s>', 'ĠCR', 'ICK', 'ET', 'ĠM', 'ATCH', '</s>']

So the difference is the first subtoken. Intuitively, the second seems more consistent in that each new word gets prefixed with a Ġ-, but it seems the old one gives better results. Any ideas which tokenization scheme is correct?

@alanakbik
Copy link
Collaborator

It is similar for XLNetEmbeddings. The original class will give us the following tokenization:

['<s>', '▁CR', 'ICK', 'ET', '▁M', 'ATCH', '</s>']

while the new class will give us the following:

['▁CR', 'ICK', 'ET', '▁M', 'ATCH', '<sep>', '<cls>']

The question is: which one is correct?

@stefan-it
Copy link
Member

@alanakbik I've reported this behavior a while ago here: huggingface/transformers#1196

Btw: I did remove the if isinstance(self.tokenizer, GPT2Tokenizer): check and re-run the experiment with RoBERTa: 95.68 (dev) and 92.01 (test), which is ~0.3% behind older experiments now :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants