-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved LAMB optimizer. #1334
Improved LAMB optimizer. #1334
Conversation
Using Fused Adam with weight decay kernel.
You are owner of some files modified in this pull request. |
@spidydev thank you for the pull request! you might want to modify Also could you provide the speedup in percent of total time? For the CPU and the GPU version? With eager, no eager and XLA? See #1156 Would it be possible to keep the python version too? See how we do for activations. It's to be able to run the optimizer on TPUs and ROCm. It's also useful because when we compile things against a specific tensorflow version and we then cannot use it on any other tensorflow version (see #1317). For example, people compiling TF from source won't be able to use your implementation. As you can see, using a custom op for something in tensorflow addons has a high UX and maintainance cost. We need to make sure the speedup is worth it before accepting to maintain a custom op. |
@gabrieldemarmiesse Thanks for all the pointers. Will get back with more info/updates. |
@gabrieldemarmiesse added some number in the description. Thnx !! |
@tensorflow/sig-addons-maintainers are the speedup significant enough for merging? Even if we merge, is it worth it to have a C++ version? As only a minority of people in their sane mind would train a model on CPU and expect maximum speed. @spidydev , we'll do a vote in private and then tell you if we think the speedup is worth it compared to the maintenance cost. |
@spidydev , in the end, we'll keep the proposed CUDA implementation, but we'll prefer to keep the pure python implementation for CPU. @Squadrick proposed himself to review your pull request. |
This PR is blocked until that is resolved. |
@gabrieldemarmiesse @Squadrick Quick question about the comment "we'll prefer to keep the pure python implementation for CPU". Can you give some pointers how do we do this ? AFAIK , if we do not register a CPU kernel, TF is going to throw error:"No kernel registered" |
@spidydev , in theory, we'd like to do that but I have no idea how to do it in practice. I hope it's possible. We would need it too for activation functions. |
@spidydev Thank for your contribution. @tomerk As this optimizer is used by the Tensorflow model SIG, if they are not interested to review these changes I will close the PR. |
Thanks you for the contribution. I suggest to move this PR to https://github.com/keras-team/keras-nlp |
Using Fused Adam with weight decay kernel.
Results training Bert-Large Trainable params: 334,882,823 (execution time in seconds) :