Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Scheduler taking a value to the step function #2849

Closed
Alicegaz opened this issue Aug 6, 2020 · 11 comments · Fixed by Alicegaz/pytorch-lightning#1
Closed

Support for Scheduler taking a value to the step function #2849

Alicegaz opened this issue Aug 6, 2020 · 11 comments · Fixed by Alicegaz/pytorch-lightning#1
Assignees
Labels
discussion In a discussion stage feature Is an improvement or enhancement won't fix This will not be worked on

Comments

@Alicegaz
Copy link

Alicegaz commented Aug 6, 2020

Looking to the code in training_loop.py it seems like the only scheduler that can take values to the step function is ReduceLROnPlateau, however, there is CosineAnnealingWarmRestarts scheduler and custom schedulers that can take epoch/step id or any other value to the step function.

https://github.com/PyTorchLightning/pytorch-lightning/blob/9ab071588bbe0e24441ae0f07a271bded0e4d9c3/pytorch_lightning/trainer/training_loop.py#L1148

@github-actions
Copy link
Contributor

github-actions bot commented Aug 6, 2020

Hi! thanks for your contribution!, great first issue!

@awaelchli
Copy link
Contributor

@ananyahjha93 and I discussed this and we think that this change is not necessary, as it does not align with the PyTorch interface of the _LRScheduler base class. Passing in the epoch to step is already deprecated
and the _LRScheduler does not take metrics in step. The only exception is the ReduceOnPlateau and custom ones implemented by users. PL supports the former directly, and for the custom ones I recommend that the user simply overrides the optimizer_step hook in LightningModule and pass the metric into the scheduler there.

@awaelchli
Copy link
Contributor

@PyTorchLightning/core-contributors

@YannDubs
Copy link

YannDubs commented Aug 9, 2020

I think it would be a mistake / lack of generality not to let the user decide when a metric should be given to the scheduler. You only need to change one line of code to let the use decide for themselves:

Currently: https://github.com/PyTorchLightning/pytorch-lightning/blob/071e09fe38d6170d8b019101a3c1871aeb1b2e99/pytorch_lightning/trainer/optimizers.py#L97

def configure_schedulers(self, schedulers: list):
        # Convert each scheduler into dict structure with relevant information
        lr_schedulers = []
        default_config = {'interval': 'epoch',  # default every epoch
                          'frequency': 1,  # default every epoch/batch
                          'reduce_on_plateau': False,  # most often not ReduceLROnPlateau scheduler
                          'monitor': 'val_loss'}  # default value to monitor for ReduceLROnPlateau
        for scheduler in schedulers:
            if isinstance(scheduler, dict):
                if 'scheduler' not in scheduler:
                    raise ValueError('Lr scheduler should have key `scheduler`',
                                     ' with item being a lr scheduler')
                scheduler['reduce_on_plateau'] = isinstance(
                    scheduler['scheduler'], optim.lr_scheduler.ReduceLROnPlateau)

One line change to let the user decide (don't override "reduce_on_plateau" so that the user can specify it to ensure that a metric is given to the step()

[...]
            if isinstance(scheduler, dict):
                if 'scheduler' not in scheduler:
                    raise ValueError('...')
                if "reduce_on_plateau" not in scheduler:
                    scheduler['reduce_on_plateau'] = isinstance(
                        scheduler['scheduler'], optim.lr_scheduler.ReduceLROnPlateau)

Of course it would then make sense to rename 'reduce_on_plateau' -(?)-> "access_metric"

Advantages:

  • The user would then be able to specify that the metric should be sured from the scheduler dictionnary (just like 'interval': 'epoch')
  • Everything is already there ('monitor': 'val_loss', dealing with step(metric)), it just enables the user to use it for their usecases
  • It's more general than thinking that ReduceLR will be the only scheduler that will use use the validation step

@ananyahjha93 ananyahjha93 added discussion In a discussion stage feature Is an improvement or enhancement docs Documentation related and removed docs Documentation related labels Aug 9, 2020
@ananyahjha93
Copy link
Contributor

ananyahjha93 commented Aug 9, 2020

@YannDubs this is actually a good solution. Submit a PR with this? (rename reduce_on_plateau to access_metric) and make changes to the place you mentioned above and to https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L1252 where scheduler.step() is called.

Also, add proper documentation for the changes in the PR as they are going to be significant.

@ananyahjha93 ananyahjha93 self-assigned this Aug 9, 2020
@awaelchli
Copy link
Contributor

+1, good suggestion

@Alicegaz
Copy link
Author

Alicegaz commented Aug 10, 2020

@YannDubs, @ananyahjha93, @awaelchli regarding your concerns, the use of 'monitor' value in a scheduler dictionary already carries the meaning of passing value to the step function. So there is no mistake. Why should we use another instance 'access_metric', it will add redundancy, as 'monitor' only referenced inside the check for the 'access_metric' and is not used anywhere else except this clause.

Also, YannDubs solution does not help to resolve the problem discussed in this issue. Unlike suggestion in my PR #2851, with this we still will not be able to do like this:

>>> iters = len(dataloader)
>>> for epoch in range(20):
>>>     for i, sample in enumerate(dataloader):
>>>         inputs, labels = sample['inputs'], sample['labels']
>>>         optimizer.zero_grad()
>>>         outputs = net(inputs)
>>>         loss = criterion(outputs, labels)
>>>         loss.backward()
>>>         optimizer.step()
>>>         scheduler.step(epoch + i / iters)

Check out official PyTorch code for the CosineAnnealingWarmRestarts.
Suggesting your to reopen the PR #2851! The only thing we can do to make this PR more clear is to add a check for registered schedulers to take in only supported types to the step ("ReduceOnPlateau": "val_*", 'CosineAnnealingWithWarmRestarts': ['epoch', 'iter_frac'], ...) as well as raise an Exception if any for the custom schedulers in cases when they take not supported values.

@williamFalcon
Copy link
Contributor

@ananyahjha93 let’s look at both solutions for a general one.

I’d prefer to let the user have the flexibility on this

@YannDubs
Copy link

YannDubs commented Aug 10, 2020

@Alicegaz using 'access_metric' allows you to keep a meaningful default for the monitor (i.e. valid_loss), but if the number of keys in the dict is a concern I agree it can be dropped.

For your second point I agree my method doesn't help (not sure how important it is given that pytorch is removing those arguments though, but it might be!). I was trying to say that even though pytorch are depreciating arguments to the step function, PL should at least enable other optimizers to depend on the (val) metric --- which is easily done with essentially no difference in the code.

@ananyahjha93
Copy link
Contributor

@Alicegaz yes, the idea was to use only access_metric and not add it to duplicate monitor, since access_metric is more general than the specific case of ReduceLROnPlateau. But, for the second case, pytorch is deprecating passing epochs to step() function, so I am not too sure that we should keep supporting that. That just requires users to contribute additional logic if they are inheriting _LRScheduler class.

But I would also agree that, not specifically epochs, but custom schedulers should have support for passing arguments to their step function.

@stale
Copy link

stale bot commented Oct 22, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 22, 2020
@stale stale bot closed this as completed Oct 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion In a discussion stage feature Is an improvement or enhancement won't fix This will not be worked on
Projects
None yet
5 participants