Embedding time cost for charlm_embedding #50

petermartigny · 2018-08-07T16:33:45Z

I have started playing with the embeddings tutorials, and noticed that when using only glove vectors it was very quick to get the embedding (it's only a lookup table without context), so it's usable for applications. However when we use charlm_embedding_forward and/or charlm_embedding_backward (using context) it's very much time consuming. This might be a bottleneck when dealing with long texts with lots of sentences to deal with.

Example:

start = time()
sentence = Sentence('The grass is green .')
stacked_embeddings = StackedEmbeddings(embeddings=[glove_embedding])
stacked_embeddings.embed(sentence)
print(time() - start)

start = time()
sentence = Sentence('The grass is green .')
stacked_embeddings = StackedEmbeddings(embeddings=[glove_embedding, charlm_embedding_forward])
stacked_embeddings.embed(sentence)
print(time() - start)

start = time()
sentence = Sentence('The grass is green .')
stacked_embeddings = StackedEmbeddings(embeddings=[glove_embedding, charlm_embedding_forward,
charlm_embedding_backward])
stacked_embeddings.embed(sentence)
print(time() - start)

This pints out on my machine:

0.000461578369140625
0.017933368682861328
0.03269362449645996

Besides, in long texts sentences are typically much more longer than this example.

alanakbik · 2018-08-07T16:58:04Z

Hello Peter,

thanks for your interest! Does your machine have a GPU? If not, you could try the embeddings we've included for faster CPU processing. You can initialize them as follows (add '-fast'):

charlm_embedding_forward = CharLMEmbeddings('news-forward-fast')
charlm_embedding_backward = CharLMEmbeddings('news-backward-fast')

They should be a lot faster. If you have a GPU, you can make use of batch processing for long texts, so you can split up your text into batches of, say, 16 or 32 and pass them batch-wise into the embed method. This should also considerably improve speed.

sentence_1 = Sentence('The grass is green .')
...
sentence_32 = Sentence('The sky is blue .')
sentences = [sentence_1, ... , sentence_32]
stacked_embeddings.embed(sentences)

Also see #7 for batching during inference of trained models. Hope this helps!

petermartigny · 2018-08-07T19:20:58Z

Hi Alan,

Thank you very much for your quick feedback!
Following your suggestions I can play with the embeddings very quickly over long texts.

nilansaha · 2020-04-14T05:19:00Z

This does not take care of padding which is a problem if we are using this for downstream NLP tasks and using our own models. Is there anyway to pad and get the embeddings for batches together ? Is there are any separate embedding for pad tokens ?

TLDR - How to embed sentences in batches so they have the same number of embeddings(padding)

alanakbik · 2020-04-15T12:43:16Z

You would have to do the padding as an extra step. Here is an example of how we do it:

flair/flair/embeddings.py

Lines 3288 to 3319 in 323d60b

    
           # embed words in the sentence 
        
           self.embeddings.embed(sentences) 
        
           lengths: List[int] = [len(sentence.tokens) for sentence in sentences] 
        
           longest_token_sequence_in_batch: int = max(lengths) 
        
           pre_allocated_zero_tensor = torch.zeros( 
        
               self.embeddings.embedding_length * longest_token_sequence_in_batch, 
        
               dtype=torch.float, 
        
               device=flair.device, 
        
           ) 
        
           all_embs: List[torch.Tensor] = list() 
        
           for sentence in sentences: 
        
               all_embs += [ 
        
                   emb for token in sentence for emb in token.get_each_embedding() 
        
               ] 
        
               nb_padding_tokens = longest_token_sequence_in_batch - len(sentence) 
        
               if nb_padding_tokens > 0: 
        
                   t = pre_allocated_zero_tensor[ 
        
                       : self.embeddings.embedding_length * nb_padding_tokens 
        
                   ] 
        
                   all_embs.append(t) 
        
           sentence_tensor = torch.cat(all_embs).view( 
        
               [ 
        
                   len(sentences), 
        
                   longest_token_sequence_in_batch, 
        
                   self.embeddings.embedding_length, 
        
               ] 
        
           )

Basically, we first initialize a tensor of all zeroes for the entire batch, then fill in the embeddings for all words at the respective positions. This leaves some parts zero where there are no words, so the resulting tensor is then zero-padded.

nilansaha · 2020-04-15T16:28:39Z

Ah I see. Just to confirm you guys just use zeros as the embedding for the pad token. @alanakbik

petermartigny closed this as completed Aug 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embedding time cost for charlm_embedding #50

Embedding time cost for charlm_embedding #50

petermartigny commented Aug 7, 2018

alanakbik commented Aug 7, 2018 •

edited

Loading

petermartigny commented Aug 7, 2018

nilansaha commented Apr 14, 2020

alanakbik commented Apr 15, 2020

nilansaha commented Apr 15, 2020

Embedding time cost for charlm_embedding #50

Embedding time cost for charlm_embedding #50

Comments

petermartigny commented Aug 7, 2018

alanakbik commented Aug 7, 2018 • edited Loading

petermartigny commented Aug 7, 2018

nilansaha commented Apr 14, 2020

alanakbik commented Apr 15, 2020

nilansaha commented Apr 15, 2020

alanakbik commented Aug 7, 2018 •

edited

Loading