-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move PerformanceEMA to utils, TrainingAverager to optim, update utils #405
Conversation
justheuristic
commented
Nov 7, 2021
•
edited
Loading
edited
- implement and test async wrapper for ContextManager (used in DecentralizedAverager and ProgressTracker)
- implement .reset_timer in PerformanceEMA (used when progress was reset, e.g. with fp16 gradient overflow, which should not affect samples per second)
- move PerformanceEMA to hivemind.utils (rationale: will be used in hivemind.moe in @mryab 's pipelining exps)
- move TrainingAverager to hivemind.optim (for compliance with hivemind.Optimizer and future deprecation in favour of Training StateAverager)
- fix process-wide RSA keys in the validator
async def enter_asynchronously(context: AbstractContextManager): | ||
"""Wrap a non-async context so that it can be entered asynchronously""" | ||
async with _AsyncContextWrapper(context) as ret_value: | ||
yield ret_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: we can't simply
try:
yield await loop.run_in_executor(context.__enter__(...))
finally:
context.__exit__(None, None, None)
b/c this option does not correctly propagate exceptions into the inner context manager
Codecov Report
@@ Coverage Diff @@
## master #405 +/- ##
==========================================
- Coverage 83.73% 83.50% -0.24%
==========================================
Files 73 73
Lines 6678 6687 +9
==========================================
- Hits 5592 5584 -8
- Misses 1086 1103 +17
|
@@ -453,7 +461,7 @@ async def _run_allreduce(self, group_info: GroupInfo, min_vector_size: int, **kw | |||
None, load_balance_peers, self.total_size, download_bandwidths, min_vector_size | |||
) | |||
|
|||
async with self.get_tensors_async() as local_tensors: | |||
async with enter_asynchronously(self.get_tensors()) as local_tensors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer seems to acquire lock_averaged_tensors (which should really be averaged_tensors_lock BTW), is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, inside the get_tensors
@@ -37,15 +37,19 @@ def update(self, task_size: float, interval: Optional[float] = None) -> float: | |||
self.samples_per_second = 1 / max(adjusted_seconds_per_sample, self.eps) | |||
return self.samples_per_second | |||
|
|||
def reset_timer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now it appears this method has only one usage in the same class. Is it going to have more usages outside of the class? If not, maybe you can have it as a private method or simply keep it inlined
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will :)
@@ -33,6 +33,9 @@ def event_loop(): | |||
def cleanup_children(): | |||
yield | |||
|
|||
with RSAPrivateKey._process_wide_key_lock: | |||
RSAPrivateKey._process_wide_key = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't see this change in the description and right now it seems to have no effect, do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has no effect, but it will if you run existing tests in different order.
TL;DR here is how it breaks things:
- you create something that instantiates RSAPrivateKey.process_wide
- and then all subsequent tests use the same key
- crucially, if you create two DHT instances with RSA validator AFTER instantiating private key, everything breaks because they both inherit your key
Future PRs introduce tests that instantiate validators before the sensitive tests and break everything in bizarre ways.