-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
dont project maske tokens for mlm loss (#859)
Summary: This saves ~4-5gb gpu memory while training roberta large with `seq_len=512`. I am able to fit `--max-sentences=16` on `volta32gb` for `roberta-large` Pull Request resolved: fairinternal/fairseq-py#859 Differential Revision: D17435814 fbshipit-source-id: 2663909768fac0ef0102107613770ee01b1f8c00
- Loading branch information
1 parent
31dd13f
commit 718677e
Showing
2 changed files
with
16 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
718677e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
The ELECTRA paper (https://openreview.net/pdf?id=r1xMH1BtvB) on page 8, talks about ALL-TOKENS, which learns from the MASK-TOKENS and the OTHER-TOKENS as well. They show that it gets better performance than just learning from MASK-TOKENS.
I think thats what this code-base had before this code commit. This code commit shifts using ALL-TOKENS to OTHER-TOKENS.
Thanks,
Kalpit
718677e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @kalpitdixit
15%
of the tokens, same as originalBERT
.718677e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
Another unrelated thing that surprised me in the RoBERTa paper was that you were able to use 1,024 GPUs in parallel to complete 1M steps in just a day i.e. I am surprised by how fast each iteration is. How did you manage to reduce the time-per-iteration so much (I'd imagine that the GPU-GPU sync time wouldn't allow such speed) ?
718677e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if it was unclear in Roberta paper, but 1 day is for 100k updates. The final model was about 5 days on 1024 gpus for 500k updates.
We have done several optimizations in code since then and are able to bring the whole training time down to 3-4 days on 512 gpus.
718677e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I understood differently from the paper. Still, its impressive speed. Do you guys use some custom architecture for this? Can you share any details ?