-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix Misstep #2478
fix Misstep #2478
Conversation
@@ -504,6 +505,9 @@ def run_training_epoch(self): | |||
# epoch end hook | |||
self.run_on_epoch_end_hook(model) | |||
|
|||
# increate global step by one to progress to the next epoch | |||
self.global_step += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to do self.increment_accumulated_grad_global_step() here, otherwise the total_train_batch_idx does not get updated. I think that's why the tests are failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the change and it works as expected. So I have changed it.
However, the tests are still failing. Actually, I think I am confused about the test. I ran the local test on my pull request and got this total summary result at the end:
I also ran the test over the original Pytorch Lightning master branch and also got this total summary result at the end:
Am I making a mistake somewhere? Should I have based my branch off on something other than the master branch?
# check if dictionary keys are unique | ||
agg_keys = set([key for met in self._metrics_to_agg for key in met.keys()]) | ||
num_keys = sum([len(met) for met in self._metrics_to_agg]) | ||
|
||
# exclude 'epoch' because it is a metric automatically added in by log_metrics and will count as a | ||
# duplicate. If you want to get rid of this, I would suggest you should get rid of `scalar_metrics[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the changes here really related to the global step issue?
If possible could we make a separate PR for this? we only want to fix one thing at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see it being a separate issue. But, this piece of code is necessary for the code to run smoothly. Otherwise, when we fix the step issue (i.e. combining the last training batch + training_epoch_end + validation_epoch_end metrics in the same step), there will be an error combining the three pieces of data together because the original merge_dict()
in _reduce_agg_metrics()
requires that all the metric dictionaries have the same keys - which for these three types of data is usually not the case.
I don't mind doing a separate pull request for this though - but it will probably result in most tests failing for this pull request, which we will probably have to make sure to fix in the new pull request.
This pull request is now in conflict... :( |
@ameliatqy I have rebased it to master, can we finish it in this release? |
Hello @ameliatqy! Thanks for updating this PR.
Comment last updated at 2020-08-18 02:14:57 UTC |
Codecov Report
@@ Coverage Diff @@
## master #2478 +/- ##
======================================
- Coverage 90% 90% -0%
======================================
Files 81 81
Lines 7672 7689 +17
======================================
+ Hits 6906 6918 +12
- Misses 766 771 +5 |
I have worked on the pull request and the checks have all passed. How do I get it approved? I see that I need three approving reviews - am I responsible for finding people to review it or will the reviewers be assigned by someone else? |
Sorry for this late response. I have two questions:
|
This pull request is now in conflict... :( |
please rebase master since this code is super old haha |
Interesting, I thought I rebased my branch today XD I just rebased it again - does it look right? Rebasing is obviously not my strong point XD |
This pull request is now in conflict... :( |
@ameliatqy how about this one, mind rebase and resolve conflicts? 🐰 |
Sure, I'd be happy to. But to fully resolve this pull request, could someone answer my question in #2478 (comment) ? I can't fully finish up the pull request without an answer to this question. |
@@ -73,8 +74,6 @@ def log_metrics(self, metrics, grad_norm_dic, step=None): | |||
# log actual metrics | |||
if self.is_global_zero and self.logger is not None: | |||
self.logger.agg_and_log_metrics(scalar_metrics, step=step) | |||
self.logger.save() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this related to #3398 (comment)?
@SkafteNicki Maybe you know something about this also?
This is now fixed on master. There is no offset anymore and training epoch end logs at the correct step. |
What does this PR do?
Fixes #2455. This used to be #2475 but I messed up on rebasing and to be safe, created a new pull request. This is a Draft PR.
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Yes, I am new to contributing to open-source and actually find it quite fun. May be time to consider a new hobby...