Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor MPFuture to use a single pipe/thread per process #298

Merged
merged 80 commits into from
Jul 3, 2021

Conversation

justheuristic
Copy link
Member

@justheuristic justheuristic commented Jun 30, 2021

TODO list:

  • implement the new MPFuture backend
    • MPFuture no longer spawns pipe + thread for each unique future
    • MPFuture is now a single object instead of a linked pair
    • MPFuture now returns the same exception types as regular future (and as asyncio.Future in __await__)
    • MPFuture.result/exception can now only be awaited from the process that created it
    • MPFuture can no longer be used in disconnected processes (i.e. only forked/forkserver-ed)
  • Actually remove finished MPFuture instances from ACTIVE_FUTURES once it was assigned a terminal state
  • update MPFuture tests
  • update MPFuture usage in averager, dht, server
  • remove hivemind.run_in_background and HIVEMIND_THREADS from core
  • remove hivemind.run_in_background and HIVEMIND_THREADS from examples
  • remove HIVEMIND_THREADS from readme
  • create a separate executor for compression
  • Extra tests:
    • send MPFuture over pipe
    • ensure MPFutures are deleted from ACTIVE_FUTURES
    • test add_done_callback

MPFuture no longer uses SyncManager because SyncManager is not fault tolerant; If one process fails (e.g. is terminated) during interaction with SyncManager, it will randomly hang other processes that access the same object.

Note 1:
forking a process results in fork not having background threads:

image

Moreover, if one explicitly prints the background thread inside a forked process, it will be displayed as stopped (while in master it is started)

image

__

Note 2:
sending pipes over pipes is not free
image

This is not due to lengthy serialization or large size, but due to the need to open new files on de-serialization.

In contrast, creating a shared value is much faster
image

image

@codecov
Copy link

codecov bot commented Jun 30, 2021

Codecov Report

Merging #298 (cc113c9) into master (2e1bb9c) will increase coverage by 0.09%.
The diff coverage is 90.81%.

@@            Coverage Diff             @@
##           master     #298      +/-   ##
==========================================
+ Coverage   81.62%   81.72%   +0.09%     
==========================================
  Files          63       63              
  Lines        5813     5866      +53     
==========================================
+ Hits         4745     4794      +49     
- Misses       1068     1072       +4     
Impacted Files Coverage Δ
hivemind/hivemind_cli/run_server.py 0.00% <0.00%> (ø)
hivemind/utils/limits.py 25.00% <ø> (ø)
hivemind/server/task_pool.py 43.16% <28.57%> (+0.71%) ⬆️
hivemind/client/averaging/__init__.py 72.49% <80.00%> (ø)
hivemind/utils/mpfuture.py 94.68% <93.78%> (-0.71%) ⬇️
hivemind/client/averaging/training.py 62.50% <100.00%> (+0.33%) ⬆️
hivemind/dht/__init__.py 77.62% <100.00%> (ø)
hivemind/utils/__init__.py 100.00% <100.00%> (ø)
hivemind/utils/compression.py 96.18% <100.00%> (+0.05%) ⬆️
hivemind/client/averaging/key_manager.py 95.45% <0.00%> (-2.28%) ⬇️
... and 2 more

@borzunov borzunov self-requested a review June 30, 2021 13:14
@justheuristic justheuristic requested review from mryab and yhn112 June 30, 2021 21:11

from hivemind.utils.threading import run_in_background
import torch
Copy link
Member Author

@justheuristic justheuristic Jun 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note on torch: it is indeed weird, but so far we're still not sure how else to implement shared value for py3.7

Options considered:

  • current version (with torch.empty)
  • mp.Value or mp.Event - cannot send to other processes (cannot serialize)
  • using multiprocessing.shared_memory - incompatible with py3.7 (and thus colab & kaggle kernels)
  • using _posixshmem (extra dependency to requirements.txt)
  • using mp.Pipe - back to where we started, will need an extra pipe per each future; too many open files

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's alright for now, but please import the two methods/attributes explicitly (probably with a short comment that explains its necessity, like "needed for python 3.7-compatible shared memory")

@justheuristic justheuristic changed the title A brighter MPFuture for hivemind Refactor MPFuture to use a single pipe per process Jul 3, 2021
@justheuristic justheuristic changed the title Refactor MPFuture to use a single pipe per process Refactor MPFuture to use a single pipe/thread per process Jul 3, 2021

from hivemind.utils.threading import run_in_background
import torch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that's alright for now, but please import the two methods/attributes explicitly (probably with a short comment that explains its necessity, like "needed for python 3.7-compatible shared memory")

Comment on lines +62 to +65
_global_sender_pipe: Optional[PipeEnd] = None # a pipe that is used to send results/exceptions to this process
_pipe_waiter_thread: Optional[threading.Thread] = None # process-specific thread that receives results/exceptions
_active_futures: Optional[Dict[UID, MPFuture]] = None # pending or running futures originated from current process
_active_pid: Optional[PID] = None # pid of currently active process; used to handle forks natively
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a really good practice, to use None as a default value, if it's never used, or even checked for?
At least for _active_futures it seems really simple to have {} as a default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though mutable defaults are generally frowned upon in function arguments, perhaps in that case it's ok, since we use the global class field anyway

And for non-mutable ones, maybe just a type annotation would suffice (unless we need to explicitly check that is was not initialized)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a similar pattern/issue in other parts of the repo, let's stick to None for now and discuss global issue on the nearest meeting

except (BrokenPipeError, EOFError):
logger.debug(f"MPFuture backend was shut down (pid={pid})")
except Exception as e:
logger.exception(f"MPFuture: could not retrieve update: caught {repr(e)} (pid={pid})")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to specify the class that the log entry refers to, since our logging includes class and method names — in essence, you're saying the same thing twice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@justheuristic justheuristic merged commit 200fbec into master Jul 3, 2021
@justheuristic justheuristic deleted the a-better-future branch July 3, 2021 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants