Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with max_sequence_length in BertEmbeddings #1519

Closed
ayushjaiswal opened this issue Apr 9, 2020 · 7 comments · Fixed by #1680
Closed

Problem with max_sequence_length in BertEmbeddings #1519

ayushjaiswal opened this issue Apr 9, 2020 · 7 comments · Fixed by #1680
Labels
bug Something isn't working

Comments

@ayushjaiswal
Copy link

Currently, BertEmbeddings does not account for the maximum sequence length supported by the underlying (transformers) BertModel. Since BERT creates subtokens, it becomes somewhat challenging to check sequence-length and trim sentence externally before feeding it to BertEmbeddings in flair.

I see a problem in https://github.com/flairNLP/flair/blob/master/flair/embeddings.py#L2678--L2687

        # first, find longest sentence in batch
        longest_sentence_in_batch: int = len(
            max(
                [
                    self.tokenizer.tokenize(sentence.to_tokenized_string())
                    for sentence in sentences
                ],
                key=len,
            )
        )

This is passed to

        # prepare id maps for BERT model
        features = self._convert_sentences_to_features(
            sentences, longest_sentence_in_batch
        )

which sets max_sequence_length in:

https://github.com/flairNLP/flair/blob/master/flair/embeddings.py#L2620-L2622

    _convert_sentences_to_features(
        self, sentences, max_sequence_length: int
    )

But this does not account for or check the max-sequence-length supported by the BERT model, which is accessible in either of the above functions through self.model.config.max_position_embeddings.

@ayushjaiswal ayushjaiswal added the bug Something isn't working label Apr 9, 2020
@alanakbik
Copy link
Collaborator

Hi @ayushjaiswal we are in the process of refactoring the transformer-based embeddings classes. See #1494. Instead of separate classes for each transformer embedding, we will have a unified class that gets the transformer model as string in the constructor. So initialization will be like this:

# example sentence
sentence = Sentence('The grass is green')

# a BERT model
embeddings = TransformerWordEmbeddings(model="bert-base-uncased", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

# a roBERTa model
embeddings = TransformerWordEmbeddings(model="distilroberta-base", layers="-1", pooling_operation='first')
embeddings.embed(sentence)

There is now also a corresponding TransformerDocumentEmbeddings class in case you want document embeddings out of the transformer.

We're also looking at different ways for handling overlong sequences as part of the refactoring. We will add handling for this soon.

@ayushjaiswal
Copy link
Author

@alanakbik Thanks for the quick response! Great to hear about the refactoring and handling of overlong sequences. self.model.config.max_position_embeddings definitely needs to be accounted for so that the input sequence during forward pass of the BertModel does not encounter sequences of length greater than that. Currently, when the length does exceed the limit, a RuntimeError occurs caused by a CUDA AssertionError which corrupts the CUDA context and requires re-initialization of the CUDA session. Even if the input sequence is trimmed, I guess it will create a problem with assigning embeddings to Sentence tokens. It seems somewhat tricky 😅

@plc-dev
Copy link

plc-dev commented Apr 15, 2020

@alanakbik
Maybe a sliding window approach, as implemented here , might be a good idea to tackle the length limitation of BERT.
I've resorted a lot to using the linked package instead of flair, solely for this feature, as the results seem to be better compared to simply truncating the sentences.

Would love to see this feature in flair!

@alanakbik
Copy link
Collaborator

Thanks for the pointer - yes this looks promising so we might integrate it!

@ayushjaiswal
Copy link
Author

Looking forward to this 😄

@ayushjaiswal
Copy link
Author

@alanakbik is there any update on this? 🙂

@alanakbik
Copy link
Collaborator

Unfortunately, we haven't gotten around to this yet. But you could try the recently added "longformer" models which can handle longer sequences:

embeddings = TransformerWordEmbeddings('allenai/longformer-base-4096')

embeddings.embed(sentence)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants