-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Epsilon Backstitch #1511
base: master
Are you sure you want to change the base?
WIP: Epsilon Backstitch #1511
Conversation
@@ -195,7 +201,10 @@ def train_new_models(dir, iter, srand, num_jobs, | |||
fr_shft=frame_shift, l2=l2_regularize, | |||
xent_reg=xent_regularize, leaky=leaky_hmm_coefficient, | |||
parallel_train_opts=run_opts.parallel_train_opts, | |||
momentum=momentum, max_param_change=max_param_change, | |||
momentum=momentum, | |||
backstitch_training_scale=backstitch_training_scale, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See if it's feasible to just pass around a string 'backstitch_opts_str'. These functions have too many args.
[not urgent, you can do the experiments first; we may abandon the epsilon stuff if there turns out to be no advantage in it.]
that backstitch training is applied on.""") | ||
self.parser.add_argument("--trainer.optimization.backstitch-training-epsilon", | ||
type=float, dest='backstitch_training_epsilon', | ||
default=0.0, help="""epsilon of backstitch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually, we should probably make this default to something like 0.1 or 0.05 (and have it be the scale that determines whether or not backstitch is used at all). But we'll have to do some experiments first, to see whether the epsilon version is even better, and what the value should be.
src/nnet3/nnet-chain-training.cc
Outdated
@@ -68,23 +68,42 @@ void NnetChainTrainer::Train(const NnetChainExample &chain_eg) { | |||
&request); | |||
const NnetComputation *computation = compiler_.Compile(request); | |||
|
|||
NnetComputer computer(nnet_config.compute_config, *computation, | |||
if (nnet_config.backstitch_training_scale > 0.0 && | |||
nnet_config.backstitch_training_epsilon > 0.0 && num_minibatches_processed_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to check that the epsilon is > 0.0 for this purpose. Let's just assert that it's always >0.0. (change the default to a nonzero value).
src/nnet3/nnet-chain-training.cc
Outdated
@@ -167,6 +188,14 @@ void NnetChainTrainer::UpdateParamsWithMaxChange() { | |||
<< "UpdatableComponent; change this code."; | |||
BaseFloat max_param_change_per_comp = uc->MaxChange(); | |||
KALDI_ASSERT(max_param_change_per_comp >= 0.0); | |||
if (nnet_config.backstitch_training_scale > 0.0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to do a much better job, via comments, of explaining what is going on here. Even I don't understand this. It's possible that the code structure here is the problem, and it will have to be changed so that it's easier to understand.
Here is my proposal.
We take UpdateParamsWithMaxChange out of the class (probably declare it in nnet-utils.h), remove the is_backstitch_step argument, take out the ScaleNnet() from it [that can be done separately by the
calling code], and give it this interface:
/**
This function does the operation '*nnet += scale * delta_nnet', while respecting
any max-parameter-change (max-param-change) specified in the updatable
components, plus the global max-param-change specified as 'max_param_change'.
With max-changes taken into account, the operation of this function is equivalent
to the following, although it's done more efficiently:
\code
Nnet temp_nnet(delta_nnet);
ScaleNnet(1.0 / max_change_scale, &temp_nnet);
[ Scale down parameters for each component of temp_nnet as needed so
their Euclidean norms do not exceed their per-component max-changes ]
[ Scale down temp_nnet as needed so its Euclidean norm does not exceed
the global max-change ]
ScaleNnet(max_change_scale, &temp_nnet); // undo the previous scaling.
AddNnet(temp_nnet, scale, nnet);
\endcode
@param [in] delta_nnet The copy of '*nnet' neural network that contains
the proposed change in parameters. Normally this will previously
have been set to: (delta_nnet =
parameter-derivative-on-current-minibatch * learning-rate per parameter),
with any natural gradient applied as specified in the components; but
this may be different if momentum or backstitch are used.
@param [in] max_param_change The global max-param-change specified on the
command line (e.g. 2.0), which specifies the largest change allowed
to '*nnet' in Euclidean norm. If <= 0, no global max-param-change will
be enforced, but any max-change values specified in the components
will still be enforced; see UpdatableComponent::MaxChange(), and
search for 'max-change' in the configs or nnet3-info output).
@param [in] max_change_scale This value, which will normally be 1.0, is used to scale
all per-component max-change values and the global 'max_param_change',
before applying them (so we use 'max_param_change * uc->MaxChange()'
as the per-component max-change, and 'max_param_change * max_change_scale'
as the global max-change).
@param [in] scale This value, which will normally be 1.0, is a scaling factor used when adding
to 'nnet', applied after any max-changes.
@param [in,out] nnet The nnet which we add to.
*/
void UpdateNnetWithMaxChange(const Nnet &delta_nnet,
BaseFloat max_param_change,
BaseFloat max_change_scale,
BaseFloat scale, Nnet *nnet);
You may have to have the backstitch code in a separate code branch from the regular/momentum
training code, to keep the simple case simple.
src/nnet3/nnet-chain-training.cc
Outdated
nnet_); | ||
scale_backstitch = (is_backstitch_step ? (-backstitch_ratio + 1 + | ||
nnet_config.backstitch_training_epsilon) / backstitch_ratio : | ||
nnet_config.momentum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code wouldn't work with momentum (you'd probably need a separate copy of the nnet for that), so better to use 0.0 instead of nnet_config.momentum, and assert that momentum is zero. After restructuring there may be 2 paths for momentum vs backstitch, anyway.
I can't figure out whether this code is correct. My rule when I can't easily figure out correctness is to ask for the code to be restructured, because if present me can't figure it out, then future me definitely won't be able to, not to mention present or future anyone-else.
The reorganized code (alpha=alpha_old+alpha_old^2=0.24, epsilon=0.2) gave the same result as the old backstitch code (alpha_old=0.2). |
ok, good; try with smaller epsilon, eg. 0.1 or less, if you haven't already.
[but don't waste CPUs: if it seems to be progressing the same as the old
code for the same parameters, using old runs might be OK.]
…On Sat, Apr 1, 2017 at 5:45 PM, Yiming Wang ***@***.***> wrote:
The reorganized code (alpha=alpha_old+alpha_old^2=0.24, epsilon=0.2) gave
the same result as the old backstitch code (alpha_old=0.2).
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1511 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu-gdleBe0VBV__47VKEpMEiMoBjAks5rrsWOgaJpZM4Ml8eg>
.
|
c85b29c
to
36806ed
Compare
@danpovey the latest backstitch code has been tested on the chain model. |
btw, I'm testing it as well, actually with combination with dropout. Just
FYI
y.
…On Wed, Apr 26, 2017 at 1:08 PM, Yiming Wang ***@***.***> wrote:
@danpovey <https://github.com/danpovey> the latest backstitch code has
been tested on the chain model.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1511 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AKisX1aiARABIiN4yCzuMr7jQE3qlJ9Dks5rz3oXgaJpZM4Ml8eg>
.
|
Yiming, have you modified the code with the random-seed stuff so that it's
fully compatible with dropout? [Yenda, unless we take care to make the
random seed the same in the 1st and second passes of it, it may not give
as much improvement as it could.] Yiming, please get on that right away
if you haven't already done it.
Dan
…On Wed, Apr 26, 2017 at 1:10 PM, jtrmal ***@***.***> wrote:
btw, I'm testing it as well, actually with combination with dropout. Just
FYI
y.
On Wed, Apr 26, 2017 at 1:08 PM, Yiming Wang ***@***.***>
wrote:
> @danpovey <https://github.com/danpovey> the latest backstitch code has
> been tested on the chain model.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#1511 (comment)>,
or mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/
AKisX1aiARABIiN4yCzuMr7jQE3qlJ9Dks5rz3oXgaJpZM4Ml8eg>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1511 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu2z-Ox_Tb3ywCpwEtZPFMb3dGipEks5rz3qcgaJpZM4Ml8eg>
.
|
Yes, I have done that in the commit "resets random generator's seed for RandomComponent..." |
src/nnet3/nnet-chain-training.cc
Outdated
void NnetChainTrainer::TrainInternal(const NnetChainExample &eg, | ||
const NnetComputation &computation, | ||
bool is_backstitch_step) { | ||
srand(srand_seed_ + num_minibatches_processed_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't do this if you're not doing backstitch, because it's not free of cost.
So only execute these two statements if you're doing backstitch; calling it outside of (and just before) TrainInternal will make more sense.
@@ -158,13 +169,12 @@ class NnetTrainer { | |||
|
|||
~NnetTrainer(); | |||
private: | |||
void ProcessOutputs(const NnetExample &eg, | |||
NnetComputer *computer); | |||
void TrainInternal(const NnetExample &eg, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll need documentation for this function.
I just noticed that this PR is to master, and it isn't even up to date with master. Can you please first try to merge with master, and then try to merge with the kaldi_52 branch? We'll be wanting to merge this with kaldi_52. You will likely have conflicts. |
…to ensure consistency, e.g., dropout mask, in backstitch training
cd59e5a
to
73787c6
Compare
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it. |
For discussion. Modified on top of #1391