Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.9 refactoring concerns #98

Closed
16 of 39 tasks
justheuristic opened this issue Sep 8, 2020 · 1 comment
Closed
16 of 39 tasks

v0.9 refactoring concerns #98

justheuristic opened this issue Sep 8, 2020 · 1 comment

Comments

@justheuristic
Copy link
Member

justheuristic commented Sep 8, 2020

As before ( #64 ), we have a number of things we may want to do right before the v0.9 release

  • RemoteMixtureOfExperts uses grid_size while server side uses expert_pattern (e.g. ffn.[0:256].[0:256]), should we switch to expert_pattern everywhere? (@justheuristic )

  • Should we use slots for data structures such as _IntermediateResult

  • Should we switch to declaring objects bottom-up vs top-down? For instance, _IntermediateResult is used before its declared (@mryab )

  • theguyshetoldyounottoworryabout rename it? (@mryab )

  • Can we get rid of return_futures=True option in node.get_*

  • rename _IntermediateResult to SearchState, rationale: it is not necessarily a result, it also stores search data other than the result (@justheuristic )

    • await SearchState.future -> await SearchState?
    • SearchState.future.cancel() -> SearchState.finish_search()?
  • rename LocalStorage to TimedStorage (or similar)? reason: it is not always used as node's local storage (@justheuristic )

  • make alias DHTID.from(x) = DHTID.generate(source=x)? (@justheuristic )

  • why are we using umsgpack instead of just msgpack? (@justheuristic )

  • we force the same vector compression rules during forward & backward pass Vector compressing only for forward pass #99 (@Vsevolod-pl )

  • update documentation on hivemind.dht: subkeys, beam_search, no first_k_active

  • update documentation on hivemind.server: compression

  • consider setting maxsize to DHTValueType, e.g. 10k

  • similar: consider forcing maxsize on any DHT record

  • rename hivemind.utils.grpc -> hivemind.utils.protobuf ?

  • tests still raise warning about non-copyable ndarray in (de)serialize_torch_tensor

  • remove receiver_threads in DHT and DecetralizedAverager, hard-code 1 thread (@justheuristic )

  • update grpc version, replace grpc.experimental.aio to grpc.aio ( [Aio] Graduation from experimental folder grpc/grpc#23240 )

  • in hivemind.dht.DHT._get(endpoint) we're currently not caching channels at all. Should we?

  • Investigate segfault in our circleci env (@mryab @justheuristic )

  • perhaps we should move all heavy class definitions out of init.py modules (@mryab )

  • Allow different tensor compression schemes for backward pass? Vector compressing only for forward pass #99

  • rewrite test_dht_protocol, test_empty_table and test_dht_node to use await instead of loop.run_until_complete

  • what will the server do if it is given an input with incorrect requires_grad? (not as in schema)

    • it will actually ignore requires_grad completely and override it in expert_backend.backward.
    • Should we respect schema's requires_grad in expert_backend.backward()?
  • proper channel caching: we could implement a centralized process-local channel cache

  • since we're using gRPC for all requests, we may be able to share one port for everything

  • should we replace prints in tests/benchmarks with logger.info? Do we need as many prints in dht tests? (@mryab , see added torch1.7 support, switch to grpc 1.33, grpc bump, improved tests & logging,  #116 comments )

  • currently test_utils is a package containing a single function used on just one occasion, should we just move it there?

  • should we make our daemons resistant to KeyboardInterrupt when they are running in background?

    • problem symptom: you're running in jupyter, you press ctrl+C and it kills all dht daemons
    • applies to: DHT, DecentralizedAverager, Server
  • naming collision: hivemind.utils.grpc actually points to the external package, not hivemind/utils/grpc.py. This is due to wildcard imports (from hivemind.utils.grpc import *). Should we switch away from them?

  • DHT.get_my_endpoint() implemented by pinging k random peers

  • currently we have no limit on gRPC message size in ConnectionHandler , this is a potential vulnerability for open infrastructure. Should we impose an upped limit and send large tensors in chunks?

  • TimedStorage: add .pop() method

  • DHT/DecentralizedAverager - rename return_future to something intuitive, e.g. sync=False

  • MPFuture: make sure await asyncio.create_task(mpfuture) works without extra wrappers

  • In DecentrlizedAverager there is a lock_averaged_tensor with is not an asyncio-friendly lock. If we acquire it for long, the averager will be paralyzed! We should run it in executor or make a special mp+aio lock.

@justheuristic
Copy link
Member Author

Sifted through all the issues with @yhn112 @borzunov @mryab and split the remaining ones into:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant