-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-1492: added new BERT embeddings implementation #1494
Conversation
flair/embeddings.py
Outdated
sentence_index | ||
] | ||
# check if this token initiates a new word, internal subtokens start with ## | ||
if not subtoken.startswith('##') and len(token_subtoken_embeddings) > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not always work. For instance the code
train_sentence = Sentence('JAKARTATown 1996-08-27')
embeddings = BertEmbeddings(pooling_operation='first')
embeddings.embed(train_sentence)
Will break because Flair would consider this to be two tokens ("JAKARTATown
" and "1996-08-27
"), while BERT subtokenizes this as ['[CLS]', 'jakarta', '##town', '1996', '-', '08', '-', '27', '[SEP]']
. So 1996-08-27
is split but there are no ##
prefixes to indicate this.
@stefan-it you use the method _get_transformer_sentence_embeddings
to go from subtokens to tokens for all transformer embeddings except for BERT. Can this method also be used for BERT or is there something different here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanakbik yes, it can be used here. The method should be robust enough to handle cases, where the BERT tokenizer discards tokens (e.g. special characters, in this case the <unk>
token is used to get a correct embedding for the "real" Flair token).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah cool - I'll refactor it in, then! Do you have an example at hand for a case in which BERT tokenizer discards tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment I've only seen '\x96', '\u200e', '\x95', '\xad' or '\x80'
as problematic tokens (or better say control characters):
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("\x96")
# returns []
I found these tokens in the GermEval 2014 dataset (yes, this is bad) when fine-tuning BERT models with the Transformers NER code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can heavily refactor the Transformer-base model inputs (like input ids, masks), because they're almost the same across all architectures. A good reference is the Hugging Face fine-tuning code for sequence labeling:
https://github.com/huggingface/transformers/blob/master/examples/ner/run_ner.py#L172-L178
Crucial part is always the different ways of tokenization - GPT-2 tokenization is horrible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was just looking at the HuggingFace AutoModel classes: https://huggingface.co/transformers/model_doc/auto.html - based on this I'm trying to write a singular TransformerWordEmbeddings
class for all transformer embeddings, instantiated with the model name. However, as you write, the big problem is matching all the different subtokenization schemes and their special symbols to our own tokens.
Hello @stefan-it @kishaloyhalder based on our discussions I added a You can initialize various embeddings by passing the Huggingface model name, i.e.: # example sentence
sentence = Sentence('The grass is green')
# a BERT model
embeddings = TransformerWordEmbeddings(model="bert-base-uncased", layers="-1", pooling_operation='first')
embeddings.embed(sentence)
# a roBERTa model
embeddings = TransformerWordEmbeddings(model="distilroberta-base", layers="-1", pooling_operation='first')
embeddings.embed(sentence)
# GPT-2
embeddings = TransformerWordEmbeddings(model="gpt2", layers="-1", pooling_operation='first')
embeddings.embed(sentence)
# a T5 model
embeddings = TransformerWordEmbeddings(model="t5-small", layers="-1", pooling_operation='first')
embeddings.embed(sentence) Something like this would have the advantage that if mode models are added to transformers and they are supported through the AutoModel logic (and don't have a crazy new tokenization scheme), we would not need to update the code to support it. What do you think? If we prefer the model to be reflected in the class name, we could additionally create subclasses - I put in the example subclass @kishaloyhalder this class also only does word level embeddings, but a similar class |
…ifferent CLS tokens
Thanks @alanakbik , thanks, I will test it :) One question: could we also refactor the unit tests, I wrote for these embeddings: Tests mainly check, if the alignment between Flair token and subword token(s) is working and if the correct embedding is used from Transformers. I can help with that! |
@stefan-it thanks! Yes help would be very much appreciated here! |
… add truncation options
… add truncation options
@stefan-it I'm thinking of merging this PR already since I've started testing training classifiers with fine-tuning and made some modifications to other classes (datasets and model trainer) on the way. So the idea would be to merge now and add more things to the transformer embeddings in later PRs or fix problems as they become apparent when using this to train classifiers. For instance, a TODO is a better truncation logic for longer sentences. This would leave the original transformer embeddings classes as they are for now (so that all models trained with them still work) but move them into deprecated state at some point. Is this ok from your side? There is a also refactoring coming up of the |
@alanakbik This is totally ok! I'm currently working on a refactoring of the unit tests for these Transformer-based embeddings. Refactoring of the current |
Currently testing the PR on CoNLL with |
@stefan-it awesome! To set Adam optimizer and smaller learning rates you can use parameters like this: trainer = ModelTrainer(model, corpus, optimizer=torch.optim.Adam)
trainer.train(
f'path/to/output/folder',
learning_rate=3e-5, # fine-tuning always uses very small learning rate
min_learning_rate=3e-6, # needs to be set otherwise it quits immediately
mini_batch_size=256, # set this high if corpus is large, otherwise lower
mini_batch_chunk_size=2, # set if mini-batch size is too large for memory
anneal_with_restarts=True,
anneal_factor=0.1,
patience=1,
max_epochs=20,
) |
Just an update: Model without fine-tuning reaches 94.34 (dev) and 90.67 (test). It's ~0.7% behind the old implementation. For tine-tuning I used the same parameters as above: 93.75 (dev) and 89.91 (test) with the BERT base cased model. But I'll do more experiments soon :) |
Oha that is not good :/ thanks for looking into this! |
Just came up with the following idea for an integration test 😅 I will use the CoNLL dataset and embed each sentence with the old and new implementation. Then the embeddings for each token can be compared to spot any differences. Will report back today when I've finished the comparison. After that I could look into the fine-tuning code. Btw: great resource for fine-tuning Transformer-based models: "Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping" :) |
Great idea, thanks! Yes, maybe the new way of determining tokens does not work as well as before. Also thanks for the pointer to the paper - exactly what we were looking for! |
@stefan-it did you have a chance to do the comparison? |
I used the following script for comparison: from typing import List
import torch
import flair.datasets
from flair.data import Corpus
from flair.embeddings import (
BertEmbeddings,
TransformerWordEmbeddings,
)
# 1. get the corpus
corpus_old: Corpus = flair.datasets.CONLL_03()
bert_model = "bert-base-cased"
layers = "0,1,2,3,4,5,6,7,8,9,10,11,12"
use_scalar_mix = True
embeddings_old = BertEmbeddings(bert_model_or_path=bert_model, layers=layers, use_scalar_mix=use_scalar_mix)
embeddings_new = TransformerWordEmbeddings(model=bert_model, layers=layers, use_scalar_mix=use_scalar_mix)
corpus_new: Corpus = flair.datasets.CONLL_03()
def compare(corpus_old_split, corpus_new_split):
for sentence_old, sentence_new in zip(corpus_old_split, corpus_new_split):
embeddings_old.embed(sentence_old)
embeddings_new.embed(sentence_new)
mismatched_tokens = []
for token_old, token_new in zip(sentence_old.tokens, sentence_new.tokens):
#print(token_old.text, token_new.text)
assert token_old.text == token_new.text
if not torch.equal(token_old.embedding, token_new.embedding):
mismatched_tokens.append(token_old.text)
if mismatched_tokens:
print("Mismatch for sentence:", sentence_old)
print("Mismatched tokens:", mismatched_tokens)
print("")
compare(corpus_old.train, corpus_new.train) It's running at the moment 😅 |
BERT cased and uncased embeddings are identical 👍 OpenGPT is not working. I'll do experiments for all Transformer-based embeddings and report back the results here:
|
@stefan-it thanks for looking into this! Good to hear that at least for BERT it is working, but strange then that the results on NER differ. Any ideas why this could be the case? |
@alanakbik Maybe this is batch-size related, I'm currently running another run with the same hyper-parameters that I've used for previous experiments. After that I will run experiments with the other Transformer-based models to see if the new implementation needs some adjustments :) |
Another run for BERT (base, cased): 94.47 (dev) and 91.07 (test). RoBERTa (large) does not look OK: 94.99 (dev) and 91.00 (test), compared to previous 96.31 (dev) and 92.31 (test). |
Thanks for looking into this. Checking on RoBERTa it seems that the difference is how they are tokenized. The original The result is neary exactly the same, except for the first subtoken, which in the old method does not get a Ġ-prefix, but in the new one does. See this code: tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
# old way
print(tokenizer.tokenize("CRICKET MATCH"))
# new wy
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("CRICKET MATCH"))) Old way outputs ['CR', 'ICK', 'ET', 'ĠM', 'ATCH'] ('<s>' and '</s>' get added later) So the difference is the first subtoken. Intuitively, the second seems more consistent in that each new word gets prefixed with a Ġ-, but it seems the old one gives better results. Any ideas which tokenization scheme is correct? |
It is similar for XLNetEmbeddings. The original class will give us the following tokenization:
while the new class will give us the following:
The question is: which one is correct? |
@alanakbik I've reported this behavior a while ago here: huggingface/transformers#1196 Btw: I did remove the |
Sources for the simplified BertEmbeddings class.