-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Horovod] Fix Reduce for Horovod #6585
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we pls add test for this case
Co-authored-by: Jirka Borovec <[email protected]>
… into horovod-reduce
Hello @amogkam! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2021-04-19 10:28:00 UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch ! Yes, definitely miss this.
Co-authored-by: Carlos Mocholí <[email protected]>
Codecov Report
@@ Coverage Diff @@
## master #6585 +/- ##
=======================================
- Coverage 92% 85% -7%
=======================================
Files 194 196 +2
Lines 12403 13047 +644
=======================================
- Hits 11433 11139 -294
- Misses 970 1908 +938 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great fix!
horovod ftw
I changed the test to re-use the same code we already have for ddp, and also added it to tests/checkpointing/test_checkpoint_callback_frequency.py:139:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pytorch_lightning/trainer/connectors/env_vars_connector.py:40: in insert_env_defaults
return fn(self, **kwargs)
pytorch_lightning/trainer/trainer.py:308: in __init__
replace_sampler_ddp, deterministic, precision, amp_backend, amp_level, plugins
pytorch_lightning/trainer/connectors/accelerator_connector.py:127: in __init__
self.set_distributed_mode()
pytorch_lightning/trainer/connectors/accelerator_connector.py:546: in set_distributed_mode
self._set_horovod_backend()
pytorch_lightning/trainer/connectors/accelerator_connector.py:567: in _set_horovod_backend
self.check_horovod()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pytorch_lightning.trainer.connectors.accelerator_connector.AcceleratorConnector object at 0x7f78aee90950>
def check_horovod(self):
"""Raises a `MisconfigurationException` if the Trainer is not configured correctly for Horovod."""
if not _HOROVOD_AVAILABLE:
raise MisconfigurationException(
'Requested `distributed_backend="horovod"`, but Horovod is not installed.'
"Install with \n $HOROVOD_WITH_PYTORCH=1 pip install horovod[pytorch]"
)
if self.num_gpus > 1 or self.num_nodes > 1:
raise MisconfigurationException(
> "Horovod does not support setting num_nodes / num_gpus explicitly. Use "
"horovodrun / mpirun to configure the number of processes."
)
E pytorch_lightning.utilities.exceptions.MisconfigurationException: Horovod does not support setting num_nodes / num_gpus explicitly. Use horovodrun / mpirun to configure the number of processes.
pytorch_lightning/trainer/connectors/accelerator_connector.py:601: MisconfigurationException I guess the test needs to use?: But this doesn't allow passing a custom |
This change recently got merged here #6958 |
@pytest.mark.parametrize(['k', 'epochs', 'val_check_interval', 'expected'], [(1, 1, 1.0, 1), (2, 2, 0.3, 5)]) | ||
def test_top_k_ddp(save_mock, tmpdir, k, epochs, val_check_interval, expected): | ||
def test_top_k_distributed(save_mock, tmpdir, accelerator, k, epochs, val_check_interval, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not include horovod in the parameterization here yet, otherwise we risk getting another flaky test. I believe we need to improve our testing integration with horovod first. #6935
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the main fix of this PR was merged, I suggest to close this one
What does this PR do?
#6410 added an additional
reduce_boolean_decision
step in ModelCheckpoint and EarlyStopping. However Horovod's reduce functionality is not currently working, therefore preventing Horovod backend from being used with the ModelCheckpoint. This PR fixes the bug with Horovod's reduce function.cc @tchaton
Fixes #<issue_number>
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃