Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

move PerformanceEMA to utils, TrainingAverager to optim, update utils #405

Merged
merged 4 commits into from
Nov 7, 2021

Conversation

justheuristic
Copy link
Member

@justheuristic justheuristic commented Nov 7, 2021

  • implement and test async wrapper for ContextManager (used in DecentralizedAverager and ProgressTracker)
  • implement .reset_timer in PerformanceEMA (used when progress was reset, e.g. with fp16 gradient overflow, which should not affect samples per second)
  • move PerformanceEMA to hivemind.utils (rationale: will be used in hivemind.moe in @mryab 's pipelining exps)
  • move TrainingAverager to hivemind.optim (for compliance with hivemind.Optimizer and future deprecation in favour of Training StateAverager)
  • fix process-wide RSA keys in the validator

async def enter_asynchronously(context: AbstractContextManager):
"""Wrap a non-async context so that it can be entered asynchronously"""
async with _AsyncContextWrapper(context) as ret_value:
yield ret_value
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: we can't simply

try:
    yield await loop.run_in_executor(context.__enter__(...))
finally:
    context.__exit__(None, None, None)

b/c this option does not correctly propagate exceptions into the inner context manager

@justheuristic justheuristic requested a review from mryab November 7, 2021 11:06
@codecov
Copy link

codecov bot commented Nov 7, 2021

Codecov Report

Merging #405 (1a0e20a) into master (7c4d13f) will decrease coverage by 0.23%.
The diff coverage is 92.00%.

@@            Coverage Diff             @@
##           master     #405      +/-   ##
==========================================
- Coverage   83.73%   83.50%   -0.24%     
==========================================
  Files          73       73              
  Lines        6678     6687       +9     
==========================================
- Hits         5592     5584       -8     
- Misses       1086     1103      +17     
Impacted Files Coverage Δ
hivemind/averaging/__init__.py 100.00% <ø> (ø)
hivemind/optim/training_averager.py 95.83% <ø> (ø)
hivemind/utils/performance_ema.py 79.48% <33.33%> (ø)
hivemind/__init__.py 100.00% <100.00%> (ø)
hivemind/averaging/averager.py 85.75% <100.00%> (-0.23%) ⬇️
hivemind/optim/__init__.py 100.00% <100.00%> (ø)
hivemind/optim/adaptive.py 77.77% <100.00%> (ø)
hivemind/optim/collaborative.py 23.80% <100.00%> (ø)
hivemind/optim/simple.py 81.42% <100.00%> (ø)
hivemind/utils/asyncio.py 99.01% <100.00%> (+0.14%) ⬆️
... and 2 more

@@ -453,7 +461,7 @@ async def _run_allreduce(self, group_info: GroupInfo, min_vector_size: int, **kw
None, load_balance_peers, self.total_size, download_bandwidths, min_vector_size
)

async with self.get_tensors_async() as local_tensors:
async with enter_asynchronously(self.get_tensors()) as local_tensors:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This no longer seems to acquire lock_averaged_tensors (which should really be averaged_tensors_lock BTW), is this intended?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, inside the get_tensors

@@ -37,15 +37,19 @@ def update(self, task_size: float, interval: Optional[float] = None) -> float:
self.samples_per_second = 1 / max(adjusted_seconds_per_sample, self.eps)
return self.samples_per_second

def reset_timer(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now it appears this method has only one usage in the same class. Is it going to have more usages outside of the class? If not, maybe you can have it as a private method or simply keep it inlined

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will :)

@@ -33,6 +33,9 @@ def event_loop():
def cleanup_children():
yield

with RSAPrivateKey._process_wide_key_lock:
RSAPrivateKey._process_wide_key = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't see this change in the description and right now it seems to have no effect, do we need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has no effect, but it will if you run existing tests in different order.

TL;DR here is how it breaks things:

  • you create something that instantiates RSAPrivateKey.process_wide
  • and then all subsequent tests use the same key
  • crucially, if you create two DHT instances with RSA validator AFTER instantiating private key, everything breaks because they both inherit your key

Future PRs introduce tests that instantiate validators before the sensitive tests and break everything in bizarre ways.

@justheuristic justheuristic changed the title [hivemind.Optimizer] update utilities, move modules around update utils for hivemind.Optmizer; move PerformanceEMA to utils, move TrainingAverager to optim Nov 7, 2021
@justheuristic justheuristic changed the title update utils for hivemind.Optmizer; move PerformanceEMA to utils, move TrainingAverager to optim move PerformanceEMA to utils, move TrainingAverager to optim, update utils Nov 7, 2021
@justheuristic justheuristic changed the title move PerformanceEMA to utils, move TrainingAverager to optim, update utils PerformanceEMA -> utils, TrainingAverager -> optim, update utils Nov 7, 2021
@justheuristic justheuristic changed the title PerformanceEMA -> utils, TrainingAverager -> optim, update utils move PerformanceEMA to utils, TrainingAverager to optim, update utils Nov 7, 2021
@justheuristic justheuristic merged commit ed42040 into master Nov 7, 2021
@justheuristic justheuristic deleted the optimizer_utils_update branch November 7, 2021 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants