Convert averager rpc_aggregate_part to P2P #188

justheuristic · 2021-03-20T12:36:40Z

[this PR should be merged into libp2p branch]

Right now, the single heaviest communication pattern in hivemind is averaging model parameters with peers. This is currently done via AllreduceRunner in this RPC:
https://github.com/learning-at-home/hivemind/blob/master/hivemind/client/averaging/allreduce.py#L204

... called via this request method:
https://github.com/learning-at-home/hivemind/blob/master/hivemind/client/averaging/allreduce.py#L137

The main quest is to overhaul this method to work via P2P

create a test where two AllreduceRunners are manually created with two connected hivemind.P2P instances
call rpc_aggregate_part in allreduce patern, similarly to this test
test that it still works with large tensors (>10k values)

Since gRPC is not optimized for large messages, we slice these tensors into parts using split_for_streaming/combine_from_streaming. That said, P2P transport may not need this partitioning.

borzunov · 2021-08-03T17:41:25Z

Solved in #323.

justheuristic added enhancement New feature or request help wanted Extra attention is needed labels Mar 20, 2021

justheuristic assigned dvmazur Mar 20, 2021

justheuristic removed the help wanted Extra attention is needed label Mar 20, 2021

justheuristic assigned MaximKsh Apr 12, 2021

justheuristic mentioned this issue Jun 28, 2021

Add per-tensor compression, faster and more flexible all-reduce #272

Merged

18 tasks

borzunov closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert averager rpc_aggregate_part to P2P #188

Convert averager rpc_aggregate_part to P2P #188

justheuristic commented Mar 20, 2021

borzunov commented Aug 3, 2021

Convert averager rpc_aggregate_part to P2P #188

Convert averager rpc_aggregate_part to P2P #188

Comments

justheuristic commented Mar 20, 2021

borzunov commented Aug 3, 2021