Skip to content

Commit

Permalink
dont project maske tokens for mlm loss (#859)
Browse files Browse the repository at this point in the history
Summary:
This saves ~4-5gb gpu memory while training roberta large with `seq_len=512`.

I am able to fit `--max-sentences=16` on `volta32gb` for `roberta-large`
Pull Request resolved: fairinternal/fairseq-py#859

Differential Revision: D17435814

fbshipit-source-id: 2663909768fac0ef0102107613770ee01b1f8c00
  • Loading branch information
Naman Goyal authored and facebook-github-bot committed Sep 18, 2019
1 parent 31dd13f commit 718677e
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 9 deletions.
8 changes: 6 additions & 2 deletions fairseq/criterions/masked_lm.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,11 @@ def forward(self, model, sample, reduce=True):
3) logging outputs to display while training
"""
# compute MLM loss
logits = model(**sample['net_input'], return_all_hiddens=False)[0]
masked_tokens = sample['target'].ne(self.padding_idx)
logits = model(**sample['net_input'], masked_tokens=masked_tokens)[0]
targets = model.get_targets(sample, [logits])
targets = targets[masked_tokens]

loss = F.nll_loss(
F.log_softmax(
logits.view(-1, logits.size(-1)),
Expand All @@ -43,7 +46,7 @@ def forward(self, model, sample, reduce=True):
ignore_index=self.padding_idx,
)

sample_size = targets.ne(self.padding_idx).int().sum().item()
sample_size = masked_tokens.int().sum().item()

logging_output = {
'loss': utils.item(loss.data) if reduce else loss.data,
Expand All @@ -64,6 +67,7 @@ def aggregate_logging_outputs(logging_outputs):

agg_output = {
'loss': loss / sample_size / math.log(2),
'nll_loss': sum(log.get('nll_loss', 0) for log in logging_outputs) / sample_size / math.log(2) if ntokens > 0 else 0.,
'ntokens': ntokens,
'nsentences': nsentences,
'sample_size': sample_size,
Expand Down
17 changes: 10 additions & 7 deletions fairseq/models/roberta/model.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,14 +201,17 @@ def __init__(self, embed_dim, output_dim, activation_fn, weight=None):
self.weight = weight
self.bias = nn.Parameter(torch.zeros(output_dim))

def forward(self, features, **kwargs):
def forward(self, features, masked_tokens=None, **kwargs):
# Only project the unmasked tokens while training,
# saves both memory and computation
if masked_tokens is not None:
features = features[masked_tokens, :]

x = self.dense(features)
x = self.activation_fn(x)
x = self.layer_norm(x)

# project back to size of vocabulary with bias
x = F.linear(x, self.weight) + self.bias

return x


Expand Down Expand Up @@ -265,7 +268,7 @@ def __init__(self, args, dictionary):
weight=self.sentence_encoder.embed_tokens.weight,
)

def forward(self, src_tokens, features_only=False, return_all_hiddens=False, **unused):
def forward(self, src_tokens, features_only=False, return_all_hiddens=False, masked_tokens=None, **unused):
"""
Args:
src_tokens (LongTensor): input tokens of shape `(batch, src_len)`
Expand All @@ -283,7 +286,7 @@ def forward(self, src_tokens, features_only=False, return_all_hiddens=False, **u
"""
x, extra = self.extract_features(src_tokens, return_all_hiddens)
if not features_only:
x = self.output_layer(x)
x = self.output_layer(x, masked_tokens=masked_tokens)
return x, extra

def extract_features(self, src_tokens, return_all_hiddens=False, **unused):
Expand All @@ -293,8 +296,8 @@ def extract_features(self, src_tokens, return_all_hiddens=False, **unused):
features = inner_states[-1]
return features, {'inner_states': inner_states if return_all_hiddens else None}

def output_layer(self, features, **unused):
return self.lm_head(features)
def output_layer(self, features, masked_tokens=None, **unused):
return self.lm_head(features, masked_tokens)

def max_positions(self):
"""Maximum output length supported by the encoder."""
Expand Down

5 comments on commit 718677e

@kalpitdixit
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
The ELECTRA paper (https://openreview.net/pdf?id=r1xMH1BtvB) on page 8, talks about ALL-TOKENS, which learns from the MASK-TOKENS and the OTHER-TOKENS as well. They show that it gets better performance than just learning from MASK-TOKENS.

I think thats what this code-base had before this code commit. This code commit shifts using ALL-TOKENS to OTHER-TOKENS.

  1. Is that also how you see this code commit?
  2. Have you tried pretraining runs comparing the two methods?
  3. saving 4-5 GB is good and higher batch size is better. Did you get a chance to see how that affects downstream performance?

Thanks,
Kalpit

@ngoyal2707
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @kalpitdixit

  1. Not really, our objective always used to apply loss on 15% of the tokens, same as original BERT.
  2. No, we haven't. I agree it'd be great to compare them in apples to apples (compute restricted settings). If you happen to have such comparison, feel free add the numbers or make a PR with making it configurable, I am happy to review and merge it.
  3. Since our objective always applied to 15% of the tokens, this change was pure performance optimization and didn't change downstream performance.

@kalpitdixit
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,

  1. ok makes sense. I went through the PR again and see what you did there. Its a neat change, to not compute the final |Vocab| sized output for the un-masked tokens which wont be used in backprop anyway.
  2. I have used ALL-TOKENS but not in a comparable setting. It did pay-off for downstream performance.

Another unrelated thing that surprised me in the RoBERTa paper was that you were able to use 1,024 GPUs in parallel to complete 1M steps in just a day i.e. I am surprised by how fast each iteration is. How did you manage to reduce the time-per-iteration so much (I'd imagine that the GPU-GPU sync time wouldn't allow such speed) ?

@ngoyal2707
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I see, I might give that a shot at some point of time, but no promises.

Sorry if it was unclear in Roberta paper, but 1 day is for 100k updates. The final model was about 5 days on 1024 gpus for 500k updates.
We have done several optimizations in code since then and are able to bring the whole training time down to 3-4 days on 512 gpus.

@kalpitdixit
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I understood differently from the paper. Still, its impressive speed. Do you guys use some custom architecture for this? Can you share any details ?

Please sign in to comment.