Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedding time cost for charlm_embedding #50

Closed
petermartigny opened this issue Aug 7, 2018 · 5 comments
Closed

Embedding time cost for charlm_embedding #50

petermartigny opened this issue Aug 7, 2018 · 5 comments

Comments

@petermartigny
Copy link

I have started playing with the embeddings tutorials, and noticed that when using only glove vectors it was very quick to get the embedding (it's only a lookup table without context), so it's usable for applications. However when we use charlm_embedding_forward and/or charlm_embedding_backward (using context) it's very much time consuming. This might be a bottleneck when dealing with long texts with lots of sentences to deal with.

Example:

start = time()
sentence = Sentence('The grass is green .')
stacked_embeddings = StackedEmbeddings(embeddings=[glove_embedding])
stacked_embeddings.embed(sentence)
print(time() - start)

start = time()
sentence = Sentence('The grass is green .')
stacked_embeddings = StackedEmbeddings(embeddings=[glove_embedding, charlm_embedding_forward])
stacked_embeddings.embed(sentence)
print(time() - start)

start = time()
sentence = Sentence('The grass is green .')
stacked_embeddings = StackedEmbeddings(embeddings=[glove_embedding, charlm_embedding_forward,
charlm_embedding_backward])
stacked_embeddings.embed(sentence)
print(time() - start)

This pints out on my machine:

0.000461578369140625
0.017933368682861328
0.03269362449645996

Besides, in long texts sentences are typically much more longer than this example.

@alanakbik
Copy link
Collaborator

alanakbik commented Aug 7, 2018

Hello Peter,

thanks for your interest! Does your machine have a GPU? If not, you could try the embeddings we've included for faster CPU processing. You can initialize them as follows (add '-fast'):

charlm_embedding_forward = CharLMEmbeddings('news-forward-fast')
charlm_embedding_backward = CharLMEmbeddings('news-backward-fast')

They should be a lot faster. If you have a GPU, you can make use of batch processing for long texts, so you can split up your text into batches of, say, 16 or 32 and pass them batch-wise into the embed method. This should also considerably improve speed.

sentence_1 = Sentence('The grass is green .')
...
sentence_32 = Sentence('The sky is blue .')
sentences = [sentence_1, ... , sentence_32]
stacked_embeddings.embed(sentences)

Also see #7 for batching during inference of trained models. Hope this helps!

@petermartigny
Copy link
Author

Hi Alan,

Thank you very much for your quick feedback!
Following your suggestions I can play with the embeddings very quickly over long texts.

@nilansaha
Copy link

This does not take care of padding which is a problem if we are using this for downstream NLP tasks and using our own models. Is there anyway to pad and get the embeddings for batches together ? Is there are any separate embedding for pad tokens ?

TLDR - How to embed sentences in batches so they have the same number of embeddings(padding)

@alanakbik
Copy link
Collaborator

You would have to do the padding as an extra step. Here is an example of how we do it:

flair/flair/embeddings.py

Lines 3288 to 3319 in 323d60b

# embed words in the sentence
self.embeddings.embed(sentences)
lengths: List[int] = [len(sentence.tokens) for sentence in sentences]
longest_token_sequence_in_batch: int = max(lengths)
pre_allocated_zero_tensor = torch.zeros(
self.embeddings.embedding_length * longest_token_sequence_in_batch,
dtype=torch.float,
device=flair.device,
)
all_embs: List[torch.Tensor] = list()
for sentence in sentences:
all_embs += [
emb for token in sentence for emb in token.get_each_embedding()
]
nb_padding_tokens = longest_token_sequence_in_batch - len(sentence)
if nb_padding_tokens > 0:
t = pre_allocated_zero_tensor[
: self.embeddings.embedding_length * nb_padding_tokens
]
all_embs.append(t)
sentence_tensor = torch.cat(all_embs).view(
[
len(sentences),
longest_token_sequence_in_batch,
self.embeddings.embedding_length,
]
)

Basically, we first initialize a tensor of all zeroes for the entire batch, then fill in the embeddings for all words at the respective positions. This leaves some parts zero where there are no words, so the resulting tensor is then zero-padded.

@nilansaha
Copy link

Ah I see. Just to confirm you guys just use zeros as the embedding for the pad token. @alanakbik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants