-
Notifications
You must be signed in to change notification settings - Fork 614
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Novograd Optimizer #836
Merged
Merged
Changes from all commits
Commits
Show all changes
62 commits
Select commit
Hold shift + click to select a range
fdc6e89
Add initial novograd
0xshreyash 60bee25
Add tests from rectified adam
0xshreyash b692a70
Add build and __init__
0xshreyash b622e2a
Code format
0xshreyash b3ea7ef
Fix errors
0xshreyash d1f2582
More fixes
0xshreyash fa184d7
Add back one - beta_1_t
0xshreyash 863b4dc
Fix some sparse errors
0xshreyash 53fca2c
Fix some sparse errors
0xshreyash 54e38ec
More fixes
0xshreyash 156af57
More sparse fixes
0xshreyash b2ed149
Change tests
0xshreyash ceb97ba
Fix ordering
0xshreyash 0fe393d
More test fixes
0xshreyash 731f132
Account for learning rate
0xshreyash fa17484
Fix error
0xshreyash 2071669
Sparse fix
0xshreyash f725cdd
Fix weight decay dense
0xshreyash a989460
More complete testing for desne resource apply
0xshreyash 1b60c3c
Add linear model test
0xshreyash 5f7a005
Increase number of epochs for novograd
0xshreyash 993f335
Increae error threshold
0xshreyash 2a386c0
More epochs
0xshreyash a06074b
More linear updates
0xshreyash 78c1c47
More changes to linear test
0xshreyash 2749a96
Update another dense test
0xshreyash da62d39
Tests
0xshreyash 6020c2c
Revert change to swa_test
0xshreyash 6cbb605
Possibly fix all tests
0xshreyash 1652150
Documentation and cleanup
0xshreyash e7b15e1
Attempt to reduce tolerance for linear test
0xshreyash 1af1b0b
Reduce even further
0xshreyash 5503694
Even further
0xshreyash c347854
Pushed as far as possible
0xshreyash b31b440
Pylint and sanity check
0xshreyash 4cc524a
More epochs and change beta_1 and beta_2
0xshreyash 23c8594
More epochs and change beta_1 and beta_2
0xshreyash 743e829
More epochs and change beta_1 and beta_2
0xshreyash 372e814
Fix typo
0xshreyash a03c56e
Make current values more important
0xshreyash e7e5f01
Make current values more important
0xshreyash 86b6edf
Increase threshold
0xshreyash 97aa225
Remove learning rate
0xshreyash 8634520
Update tests
0xshreyash 0c24301
Update other tests
0xshreyash 7dd3fb6
Update grad_averaging logic
0xshreyash 10afb3f
Update grad_averaging logic
0xshreyash 6c6012f
Add amsgrad
0xshreyash d66a2e8
Tests update
0xshreyash dfc5fc5
Tests update
0xshreyash f3c6b5e
Tests update
0xshreyash 43925fe
Code format
0xshreyash 5c85a8f
Code format
0xshreyash 515b34b
Use keras training ops
0xshreyash ab08e5f
Address comments
0xshreyash 15dbe26
Address comments
0xshreyash a4d4783
Tests for grad_averaging
0xshreyash e434084
Fix grad_averaging test
0xshreyash 82071be
Test fix
0xshreyash 7a30de0
Change default epsilon value
0xshreyash 185fe92
Fix code format
0xshreyash 977cb92
docs: add TODO
0xshreyash File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
| lazy_adam | Saishruthi Swaminathan | [email protected] | | ||
| lookahead | Zhao Hanguang | [email protected] | | ||
| moving_average | Dheeraj R. Reddy | [email protected] | | ||
| novograd | Shreyash Patodia | [email protected] | | ||
| rectified_adam | Zhao Hanguang | [email protected] | | ||
| stochastic_weight_averaging | Shreyash Patodia | [email protected] | | ||
| weight_decay_optimizers | Phil Jund | [email protected] | | ||
|
@@ -25,6 +26,7 @@ | |
| lazy_adam | LazyAdam | https://arxiv.org/abs/1412.6980 | | ||
| lookahead | Lookahead | https://arxiv.org/abs/1907.08610v1 | | ||
| moving_average | MovingAverage | | | ||
| novograd | NovoGrad | https://nvidia.github.io/OpenSeq2Seq/html/optimizers.html | | ||
| rectified_adam | RectifiedAdam | https://arxiv.org/pdf/1908.03265v1.pdf | | ||
| stochastic_weight_averaging | SWA | https://arxiv.org/abs/1803.05407.pdf | | ||
| weight_decay_optimizers | SGDW, AdamW, extend_with_decoupled_weight_decay | https://arxiv.org/pdf/1711.05101.pdf | | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
# Copyright 2019 The TensorFlow Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
"""NovoGrad for TensorFlow.""" | ||
from __future__ import absolute_import | ||
from __future__ import division | ||
from __future__ import print_function | ||
|
||
import tensorflow as tf | ||
# TODO: Find public API alternatives to these | ||
from tensorflow.python.training import training_ops | ||
|
||
|
||
@tf.keras.utils.register_keras_serializable(package='Addons') | ||
class NovoGrad(tf.keras.optimizers.Optimizer): | ||
"""The NovoGrad Optimizer was first proposed in [Stochastic Gradient | ||
Methods with Layerwise Adaptvie Moments for training of Deep | ||
Networks](https://arxiv.org/pdf/1905.11286.pdf) | ||
|
||
NovoGrad is a first-order SGD-based algorithm, which computes second | ||
moments per layer instead of per weight as in Adam. Compared to Adam, | ||
NovoGrad takes less memory, and has been found to be more numerically | ||
stable. More specifically we compute (for more information on the | ||
computation please refer to this | ||
[link](https://nvidia.github.io/OpenSeq2Seq/html/optimizers.html): | ||
|
||
Second order moment = exponential moving average of Layer-wise square | ||
of grads: | ||
v_t <-- beta_2 * v_{t-1} + (1-beta_2) * (g_t)^2 | ||
First order moment in one of four modes: | ||
1. moment of grads normalized by v_t: | ||
m_t <- beta_1 * m_{t-1} + [ g_t / (sqrt(v_t)+epsilon)] | ||
2. moment similar to Adam: exponential moving average of grads | ||
normalized by v_t (set grad_averaging = True to use this): | ||
m_t <- beta_1 * m_{t-1} + | ||
[(1 - beta_1) * (g_t / (sqrt(v_t) + epsilon))] | ||
3. weight decay adds a w_d term after grads are rescaled by | ||
1/sqrt(v_t) (set weight_decay > 0 to use this0: | ||
m_t <- beta_1 * m_{t-1} + | ||
[(g_t / (sqrt(v_t) + epsilon)) + (w_d * w_{t-1})] | ||
4. weight decay + exponential moving average from Adam: | ||
m_t <- beta_1 * m_{t-1} + | ||
[(1 - beta_1) * ((g_t / (sqrt(v_t + epsilon)) + | ||
(w_d * w_{t-1}))] | ||
Weight update: | ||
w_t <- w_{t-1} - lr_t * m_t | ||
|
||
Example of usage: | ||
```python | ||
opt = tfa.optimizers.NovoGrad( | ||
lr=1e-3, | ||
beta_1=0.9, | ||
beta_2=0.999, | ||
weight_decay=0.001, | ||
grad_averaging=False | ||
) | ||
``` | ||
""" | ||
|
||
def __init__(self, | ||
learning_rate=0.001, | ||
beta_1=0.9, | ||
beta_2=0.999, | ||
epsilon=1e-7, | ||
weight_decay=0.0, | ||
grad_averaging=False, | ||
0xshreyash marked this conversation as resolved.
Show resolved
Hide resolved
|
||
amsgrad=False, | ||
name='NovoGrad', | ||
**kwargs): | ||
r"""Construct a new NovoGrad optimizer. | ||
|
||
Args: | ||
learning_rate: A `Tensor` or a floating point value. or a schedule | ||
that is a `tf.keras.optimizers.schedules.LearningRateSchedule` | ||
The learning rate. | ||
beta_1: A float value or a constant float tensor. | ||
The exponential decay rate for the 1st moment estimates. | ||
beta_2: A float value or a constant float tensor. | ||
The exponential decay rate for the 2nd moment estimates. | ||
epsilon: A small constant for numerical stability. | ||
weight_decay: A floating point value. Weight decay for each param. | ||
grad_averaging: determines whether to use Adam style exponential | ||
moving averaging for the first order moments. | ||
**kwargs: keyword arguments. Allowed to be {`clipnorm`, | ||
`clipvalue`, `lr`, `decay`}. `clipnorm` is clip gradients | ||
by norm; `clipvalue` is clip gradients by value, `decay` is | ||
included for backward compatibility to allow time inverse | ||
decay of learning rate. `lr` is included for backward | ||
compatibility, recommended to use `learning_rate` instead. | ||
""" | ||
super(NovoGrad, self).__init__(name, **kwargs) | ||
if weight_decay < 0.0: | ||
raise ValueError('Weight decay rate cannot be negative') | ||
self._set_hyper('learning_rate', kwargs.get('lr', learning_rate)) | ||
self._set_hyper('decay', self._initial_decay) | ||
self._set_hyper('beta_1', beta_1) | ||
self._set_hyper('beta_2', beta_2) | ||
self._set_hyper('weight_decay', weight_decay) | ||
self._set_hyper('grad_averaging', grad_averaging) | ||
self.amsgrad = amsgrad | ||
self.epsilon = epsilon or tf.keras.backend.epsilon() | ||
|
||
def _create_slots(self, var_list): | ||
# Create slots for the first and second moments. | ||
# Separate for-loops to respect the ordering of slot variables from v1. | ||
for var in var_list: | ||
self.add_slot(var=var, slot_name='m', initializer='zeros') | ||
for var in var_list: | ||
self.add_slot( | ||
var=var, | ||
slot_name='v', | ||
initializer=tf.zeros(shape=[], dtype=var.dtype)) | ||
if self.amsgrad: | ||
for var in var_list: | ||
self.add_slot(var, 'vhat') | ||
|
||
def _prepare_local(self, var_device, var_dtype, apply_state): | ||
super(NovoGrad, self)._prepare_local(var_device, var_dtype, | ||
apply_state) | ||
beta_1_t = tf.identity(self._get_hyper('beta_1', var_dtype)) | ||
beta_2_t = tf.identity(self._get_hyper('beta_2', var_dtype)) | ||
apply_state[(var_device, var_dtype)].update( | ||
dict( | ||
epsilon=tf.convert_to_tensor(self.epsilon, var_dtype), | ||
beta_1_t=beta_1_t, | ||
beta_2_t=beta_2_t, | ||
one_minus_beta_2_t=1 - beta_2_t, | ||
one_minus_beta_1_t=1 - beta_1_t, | ||
)) | ||
|
||
def set_weights(self, weights): | ||
params = self.weights | ||
# If the weights are generated by Keras V1 optimizer, it includes vhats | ||
# even without amsgrad, i.e, V1 optimizer has 3x + 1 variables, while V2 | ||
# optimizer has 2x + 1 variables. Filter vhats out for compatibility. | ||
num_vars = int((len(params) - 1) / 2) | ||
if len(weights) == 3 * num_vars + 1: | ||
weights = weights[:len(params)] | ||
super(NovoGrad, self).set_weights(weights) | ||
|
||
def _resource_apply_dense(self, grad, var, apply_state=None): | ||
var_device, var_dtype = var.device, var.dtype.base_dtype | ||
coefficients = ((apply_state or {}).get((var_device, var_dtype)) | ||
or self._fallback_apply_state(var_device, var_dtype)) | ||
weight_decay = self._get_hyper('weight_decay') | ||
grad_averaging = self._get_hyper('grad_averaging') | ||
|
||
v = self.get_slot(var, 'v') | ||
g_2 = tf.reduce_sum(tf.square(tf.cast(grad, tf.float32))) | ||
v_t = tf.cond( | ||
tf.equal(self.iterations, | ||
0), lambda: g_2, lambda: v * coefficients['beta_2_t'] + | ||
0xshreyash marked this conversation as resolved.
Show resolved
Hide resolved
|
||
g_2 * coefficients['one_minus_beta_2_t']) | ||
v_t = v.assign(v_t, use_locking=self._use_locking) | ||
|
||
if self.amsgrad: | ||
vhat = self.get_slot(var, 'vhat') | ||
vhat_t = vhat.assign( | ||
tf.maximum(vhat, v_t), use_locking=self._use_locking) | ||
grad = grad / (tf.sqrt(vhat_t) + self.epsilon) | ||
else: | ||
grad = grad / (tf.sqrt(v_t) + self.epsilon) | ||
grad = tf.cond( | ||
tf.greater(weight_decay, | ||
0), lambda: grad + weight_decay * var, lambda: grad) | ||
0xshreyash marked this conversation as resolved.
Show resolved
Hide resolved
|
||
grad = tf.cond( | ||
tf.logical_and(grad_averaging, tf.not_equal(self.iterations, 0)), | ||
lambda: grad * coefficients['one_minus_beta_1_t'], lambda: grad) | ||
m = self.get_slot(var, 'm') | ||
return training_ops.resource_apply_keras_momentum( | ||
var.handle, | ||
m.handle, | ||
coefficients['lr_t'], | ||
grad, | ||
coefficients['beta_1_t'], | ||
use_locking=self._use_locking, | ||
use_nesterov=False) | ||
|
||
def _resource_apply_sparse(self, grad, var, indices, apply_state=None): | ||
var_device, var_dtype = var.device, var.dtype.base_dtype | ||
coefficients = ((apply_state or {}).get((var_device, var_dtype)) | ||
or self._fallback_apply_state(var_device, var_dtype)) | ||
weight_decay = self._get_hyper('weight_decay') | ||
grad_averaging = self._get_hyper('grad_averaging') | ||
|
||
v = self.get_slot(var, 'v') | ||
g_2 = tf.reduce_sum(tf.square(tf.cast(grad, tf.float32))) | ||
# v is just a scalar and does not need to involve sparse tensors. | ||
v_t = tf.cond( | ||
tf.equal(self.iterations, | ||
0), lambda: g_2, lambda: v * coefficients['beta_2_t'] + | ||
0xshreyash marked this conversation as resolved.
Show resolved
Hide resolved
|
||
g_2 * coefficients['one_minus_beta_2_t']) | ||
v_t = v.assign(v_t, use_locking=self._use_locking) | ||
|
||
if self.amsgrad: | ||
vhat = self.get_slot(var, 'vhat') | ||
vhat_t = vhat.assign( | ||
tf.maximum(vhat, v_t), use_locking=self._use_locking) | ||
grad = grad / (tf.sqrt(vhat_t) + self.epsilon) | ||
else: | ||
grad = grad / (tf.sqrt(v_t) + self.epsilon) | ||
grad = tf.cond( | ||
tf.greater(weight_decay, | ||
0), lambda: grad + weight_decay * var, lambda: grad) | ||
0xshreyash marked this conversation as resolved.
Show resolved
Hide resolved
|
||
grad = tf.cond( | ||
tf.logical_and(grad_averaging, tf.not_equal(self.iterations, 0)), | ||
lambda: grad * coefficients['one_minus_beta_1_t'], lambda: grad) | ||
m = self.get_slot(var, 'm') | ||
return training_ops.resource_sparse_apply_keras_momentum( | ||
var.handle, | ||
m.handle, | ||
coefficients['lr_t'], | ||
tf.gather(grad, indices), | ||
indices, | ||
coefficients['beta_1_t'], | ||
use_locking=self._use_locking, | ||
use_nesterov=False) | ||
|
||
def get_config(self): | ||
config = super(NovoGrad, self).get_config() | ||
config.update({ | ||
'learning_rate': | ||
self._serialize_hyperparameter('learning_rate'), | ||
'beta_1': | ||
self._serialize_hyperparameter('beta_1'), | ||
'beta_2': | ||
self._serialize_hyperparameter('beta_2'), | ||
'epsilon': | ||
self.epsilon, | ||
'weight_decay': | ||
self._serialize_hyperparameter('weight_decay'), | ||
'grad_averaging': | ||
self._serialize_hyperparameter('grad_averaging'), | ||
}) | ||
return config |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that private API is the only way to import this. @seanpmorgan Sean, what do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @shreyashpatodia, could we add some comments here? Like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.