v0.9 refactoring concerns #98

justheuristic · 2020-09-08T06:20:45Z

As before ( #64 ), we have a number of things we may want to do right before the v0.9 release

RemoteMixtureOfExperts uses grid_size while server side uses expert_pattern (e.g. ffn.[0:256].[0:256]), should we switch to expert_pattern everywhere? (@justheuristic )
Should we use slots for data structures such as _IntermediateResult
Should we switch to declaring objects bottom-up vs top-down? For instance, _IntermediateResult is used before its declared (@mryab )
theguyshetoldyounottoworryabout rename it? (@mryab )
Can we get rid of return_futures=True option in node.get_*
rename _IntermediateResult to SearchState, rationale: it is not necessarily a result, it also stores search data other than the result (@justheuristic )
- await SearchState.future -> await SearchState?
- SearchState.future.cancel() -> SearchState.finish_search()?
rename LocalStorage to TimedStorage (or similar)? reason: it is not always used as node's local storage (@justheuristic )
make alias DHTID.from(x) = DHTID.generate(source=x)? (@justheuristic )
why are we using umsgpack instead of just msgpack? (@justheuristic )
we force the same vector compression rules during forward & backward pass Vector compressing only for forward pass #99 (@Vsevolod-pl )
update documentation on hivemind.dht: subkeys, beam_search, no first_k_active
update documentation on hivemind.server: compression
consider setting maxsize to DHTValueType, e.g. 10k
similar: consider forcing maxsize on any DHT record
rename hivemind.utils.grpc -> hivemind.utils.protobuf ?
tests still raise warning about non-copyable ndarray in (de)serialize_torch_tensor
remove receiver_threads in DHT and DecetralizedAverager, hard-code 1 thread (@justheuristic )
update grpc version, replace grpc.experimental.aio to grpc.aio ( [Aio] Graduation from experimental folder grpc/grpc#23240 )
in hivemind.dht.DHT._get(endpoint) we're currently not caching channels at all. Should we?
Investigate segfault in our circleci env (@mryab @justheuristic )
perhaps we should move all heavy class definitions out of init.py modules (@mryab )
Allow different tensor compression schemes for backward pass? Vector compressing only for forward pass #99
rewrite test_dht_protocol, test_empty_table and test_dht_node to use await instead of loop.run_until_complete
what will the server do if it is given an input with incorrect requires_grad? (not as in schema)
- it will actually ignore requires_grad completely and override it in expert_backend.backward.
- Should we respect schema's requires_grad in expert_backend.backward()?
proper channel caching: we could implement a centralized process-local channel cache
- right now we use boilerplate caching in get_expert_stub (that would break e.g. in multiprocessing) and use NO caching in DHT
- Here's reference https://github.com/grpc/grpc/blob/master/src/python/grpcio/grpc/_simple_stubs.py#L79
- but it is experimental and doesn't work with aio yet >.>
- how about we implement a similar cache in hivemind.utils.grpc, but with full aio support ?
since we're using gRPC for all requests, we may be able to share one port for everything
should we replace prints in tests/benchmarks with logger.info? Do we need as many prints in dht tests? (@mryab , see added torch1.7 support, switch to grpc 1.33, grpc bump, improved tests & logging, #116 comments )
currently test_utils is a package containing a single function used on just one occasion, should we just move it there?
should we make our daemons resistant to KeyboardInterrupt when they are running in background?
- problem symptom: you're running in jupyter, you press ctrl+C and it kills all dht daemons
- applies to: DHT, DecentralizedAverager, Server
naming collision: hivemind.utils.grpc actually points to the external package, not hivemind/utils/grpc.py. This is due to wildcard imports (from hivemind.utils.grpc import *). Should we switch away from them?
DHT.get_my_endpoint() implemented by pinging k random peers
currently we have no limit on gRPC message size in ConnectionHandler , this is a potential vulnerability for open infrastructure. Should we impose an upped limit and send large tensors in chunks?
TimedStorage: add .pop() method
DHT/DecentralizedAverager - rename return_future to something intuitive, e.g. sync=False
MPFuture: make sure await asyncio.create_task(mpfuture) works without extra wrappers
In DecentrlizedAverager there is a lock_averaged_tensor with is not an asyncio-friendly lock. If we acquire it for long, the averager will be paralyzed! We should run it in executor or make a special mp+aio lock.

The text was updated successfully, but these errors were encountered:

justheuristic · 2021-06-15T13:59:04Z

Sifted through all the issues with @yhn112 @borzunov @mryab and split the remaining ones into:

general code stype [REFACTOR] general codebase updates for v0.10 #275
DHT: [REFACTOR] updates to DHT internals #276
MoE: [REFACTOR] minor update to RemoteMixtureOfExperts interface #277

justheuristic added this to the v0.9: gating function whereabouts milestone Sep 8, 2020

justheuristic mentioned this issue Sep 30, 2020

Roadmap #77

Open

51 tasks

justheuristic mentioned this issue Nov 17, 2020

Process-wide channel cache for gRPC+aio #120

Merged

9 tasks

justheuristic mentioned this issue Mar 16, 2021

Add quantile compression #182

Merged

mryab mentioned this issue Mar 24, 2021

Extract expert-specific methods from DHT #192

Merged

17 tasks

This was referenced Jun 15, 2021

[REFACTOR] general codebase updates for v0.10 #275

Closed

[REFACTOR] updates to DHT internals #276

Open

justheuristic closed this as completed Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.9 refactoring concerns #98

v0.9 refactoring concerns #98

justheuristic commented Sep 8, 2020 •

edited

Loading

justheuristic commented Jun 15, 2021

v0.9 refactoring concerns #98

v0.9 refactoring concerns #98

Comments

justheuristic commented Sep 8, 2020 • edited Loading

justheuristic commented Jun 15, 2021

justheuristic commented Sep 8, 2020 •

edited

Loading