Problem with max_sequence_length in BertEmbeddings #1519

ayushjaiswal · 2020-04-09T00:01:05Z

Currently, BertEmbeddings does not account for the maximum sequence length supported by the underlying (transformers) BertModel. Since BERT creates subtokens, it becomes somewhat challenging to check sequence-length and trim sentence externally before feeding it to BertEmbeddings in flair.

I see a problem in https://github.com/flairNLP/flair/blob/master/flair/embeddings.py#L2678--L2687

        # first, find longest sentence in batch
        longest_sentence_in_batch: int = len(
            max(
                [
                    self.tokenizer.tokenize(sentence.to_tokenized_string())
                    for sentence in sentences
                ],
                key=len,
            )
        )

This is passed to

        # prepare id maps for BERT model
        features = self._convert_sentences_to_features(
            sentences, longest_sentence_in_batch
        )

which sets max_sequence_length in:

https://github.com/flairNLP/flair/blob/master/flair/embeddings.py#L2620-L2622

    _convert_sentences_to_features(
        self, sentences, max_sequence_length: int
    )

But this does not account for or check the max-sequence-length supported by the BERT model, which is accessible in either of the above functions through self.model.config.max_position_embeddings.

The text was updated successfully, but these errors were encountered:

alanakbik · 2020-04-09T09:11:06Z

Hi @ayushjaiswal we are in the process of refactoring the transformer-based embeddings classes. See #1494. Instead of separate classes for each transformer embedding, we will have a unified class that gets the transformer model as string in the constructor. So initialization will be like this:

# example sentence
sentence = Sentence('The grass is green')

# a BERT model
embeddings = TransformerWordEmbeddings(model="bert-base-uncased", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# a roBERTa model
embeddings = TransformerWordEmbeddings(model="distilroberta-base", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

There is now also a corresponding TransformerDocumentEmbeddings class in case you want document embeddings out of the transformer.

We're also looking at different ways for handling overlong sequences as part of the refactoring. We will add handling for this soon.

ayushjaiswal · 2020-04-09T10:21:07Z

@alanakbik Thanks for the quick response! Great to hear about the refactoring and handling of overlong sequences. self.model.config.max_position_embeddings definitely needs to be accounted for so that the input sequence during forward pass of the BertModel does not encounter sequences of length greater than that. Currently, when the length does exceed the limit, a RuntimeError occurs caused by a CUDA AssertionError which corrupts the CUDA context and requires re-initialization of the CUDA session. Even if the input sequence is trimmed, I guess it will create a problem with assigning embeddings to Sentence tokens. It seems somewhat tricky 😅

plc-dev · 2020-04-15T11:01:44Z

@alanakbik
Maybe a sliding window approach, as implemented here , might be a good idea to tackle the length limitation of BERT.
I've resorted a lot to using the linked package instead of flair, solely for this feature, as the results seem to be better compared to simply truncating the sentences.

Would love to see this feature in flair!

alanakbik · 2020-04-15T12:27:09Z

Thanks for the pointer - yes this looks promising so we might integrate it!

ayushjaiswal · 2020-04-16T03:04:35Z

Looking forward to this 😄

ayushjaiswal · 2020-05-05T03:02:06Z

@alanakbik is there any update on this? 🙂

alanakbik · 2020-06-08T14:48:39Z

Unfortunately, we haven't gotten around to this yet. But you could try the recently added "longformer" models which can handle longer sequences:

embeddings = TransformerWordEmbeddings('allenai/longformer-base-4096')

embeddings.embed(sentence)

ayushjaiswal added the bug Something isn't working label Apr 9, 2020

schelv mentioned this issue Jun 9, 2020

Option for TransformerWordEmbeddings to process long sentences. #1680

Merged

alanakbik closed this as completed in #1680 Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with max_sequence_length in BertEmbeddings #1519

Problem with max_sequence_length in BertEmbeddings #1519

ayushjaiswal commented Apr 9, 2020

alanakbik commented Apr 9, 2020

ayushjaiswal commented Apr 9, 2020

plc-dev commented Apr 15, 2020 •

edited

Loading

alanakbik commented Apr 15, 2020

ayushjaiswal commented Apr 16, 2020

ayushjaiswal commented May 5, 2020

alanakbik commented Jun 8, 2020

Problem with max_sequence_length in BertEmbeddings #1519

Problem with max_sequence_length in BertEmbeddings #1519

Comments

ayushjaiswal commented Apr 9, 2020

alanakbik commented Apr 9, 2020

ayushjaiswal commented Apr 9, 2020

plc-dev commented Apr 15, 2020 • edited Loading

alanakbik commented Apr 15, 2020

ayushjaiswal commented Apr 16, 2020

ayushjaiswal commented May 5, 2020

alanakbik commented Jun 8, 2020

plc-dev commented Apr 15, 2020 •

edited

Loading