Improve All-Reduce fault-tolerance #423

justheuristic · 2021-12-12T20:11:02Z

Test cases:

test with peers that fail early
test with peers that fail to send a certain part
test with peers that fail to reduce their part
test cancelling

Sanity checks:

run tests 100 times
benchmark_optimizer
test env 64+ nodes 4+ hours

codecov · 2021-12-12T20:11:35Z

Codecov Report

Merging #423 (a803aa8) into master (896885a) will increase coverage by 0.81%.
The diff coverage is 93.22%.

@@            Coverage Diff             @@
##           master     #423      +/-   ##
==========================================
+ Coverage   83.40%   84.22%   +0.81%     
==========================================
  Files          77       77              
  Lines        7809     7891      +82     
==========================================
+ Hits         6513     6646     +133     
+ Misses       1296     1245      -51

Impacted Files	Coverage Δ
hivemind/optim/grad_scaler.py	`30.98% <0.00%> (+0.84%)`	⬆️
hivemind/averaging/averager.py	`87.65% <75.00%> (+2.10%)`	⬆️
hivemind/averaging/allreduce.py	`92.55% <92.47%> (+14.84%)`	⬆️
hivemind/averaging/partition.py	`98.87% <100.00%> (+0.85%)`	⬆️
hivemind/optim/experimental/optimizer.py	`62.53% <100.00%> (+0.67%)`	⬆️
hivemind/utils/asyncio.py	`100.00% <100.00%> (+0.98%)`	⬆️
hivemind/optim/experimental/state_averager.py	`86.19% <0.00%> (+0.24%)`	⬆️
... and 5 more

borzunov

Thanks for the PR! I've went through everything but tests and left my comments.

hivemind/averaging/allreduce.py

hivemind/averaging/partition.py

hivemind/averaging/allreduce.py

borzunov · 2021-12-13T16:06:44Z

hivemind/utils/asyncio.py

+                await queue.put(loop.run_in_executor(executor, func, *args))
+            await queue.put(None)
+        except BaseException as e:
+            await queue.put(e)  # note: there is no chance that iterables


Unfinished comment

borzunov · 2021-12-13T16:08:44Z

hivemind/utils/asyncio.py

-            task.cancel()
+        task.cancel()
+        if task.done() and not task.cancelled():
+            task.exception()


Did you mean raise task.exception()? If yes, it does not seem necessary since await task will already raise this exception.

I specifically mean "if task did not send result or throw exception but we died anyway, silence the "task ... was never retrieved""

Let's remove L136-137 because .cancel() already suppresses this message.

Indeed, the message is printed only when task.__log_traceback == True (source), and task.cancel() sets it to False (source) exactly as task.exception() does (source).

borzunov · 2021-12-13T16:52:41Z

hivemind/averaging/partition.py

+            self.current_senders -= 1
+            if self.current_part_accumulated_from == self.current_senders:
+                self.current_part_future.set_result(self.accumulator.div_(self.denominator))
+                self.reset_accumulators()


Please extract these two lines to a function, it is an extremely error-prone code duplication.

For instance, imagine that someone writes a CenteredClip reducer and only changes L229-230 (not L239-240): since the latter are executed only during a failure, the unit tests are likely to miss that mistake.

hivemind/averaging/averager.py

borzunov · 2021-12-13T17:03:05Z

hivemind/averaging/allreduce.py

@@ -132,6 +148,8 @@ def should_delay_results(self, peer_id: PeerID) -> bool:
    async def run(self) -> AsyncIterator[torch.Tensor]:
        """Run all-reduce, return differences between averaged and original tensors as they are computed"""
        pending_tasks = set()
+        if self.sender_timeout is not None:
+            asyncio.create_task(self._handle_missing_senders())


Such fire-and-forget calls always lead to "Task destroyed but is pending". Please fix that, e.g., save the task to a class field and await when finalizing.

GREAT catch: i've meant to add it to pending_tasks (and i just did so)

borzunov · 2021-12-13T17:06:12Z

hivemind/averaging/allreduce.py

@@ -205,28 +262,32 @@ async def rpc_aggregate_part(
    ) -> AsyncIterator[averaging_pb2.AveragingData]:
        """a peer sends us a part of his tensor; we should average it with other peers and return the difference"""
        request: averaging_pb2.AveragingData = await anext(stream)


Please handle self.sender_timeout for this first part as well.

borzunov · 2021-12-13T17:08:25Z

hivemind/averaging/allreduce.py

+            async with self.banlock:
+                if context.remote_id not in self.banned_senders:
+                    self.banned_senders.add(context.remote_id)
+                    self.tensor_part_reducer.on_sender_failed(sender_index)


Please fix this code duplication (L305-308, L314-317, L352-355), this is likely to lead to a bug if someone decides to change the ban procedure (but changes only 1-2 of 3 copies).

replaced with a unified await self._ban_sender(peer_id) call

Co-authored-by: Alexander Borzunov <[email protected]>

hivemind/averaging/partition.py

Co-authored-by: Alexander Borzunov <[email protected]>

…ault_tolerant_allreduce

…in timeout)

justheuristic added 7 commits December 11, 2021 15:39

proper scheduling for gradient averaging

eced497

more agressive pre-scheduling

9e5561d

fault-tolerant all-reduce with next chunk timeout

82c9473

fault-tolerant all-reduce with next chunk timeout

b1b3336

DRY

4991e71

fault-tolerant allreduce: global test 1

965afa8

revert peer ids

874338e

justheuristic added 2 commits December 13, 2021 00:17

blackisort

6838fe1

cancel

1b0e1bd

justheuristic requested a review from borzunov December 12, 2021 21:22

justheuristic added 2 commits December 13, 2021 04:04

next_chunk_timeout in optimizer

ceb2a0a

black

94972a7

borzunov requested changes Dec 13, 2021

View reviewed changes

borzunov changed the title ~~Allreduce Fault Tolerance~~ Improve All-Reduce fault-tolerance Dec 13, 2021

justheuristic and others added 11 commits December 13, 2021 21:23

Update hivemind/averaging/allreduce.py

eaa1603

Co-authored-by: Alexander Borzunov <[email protected]>

Update hivemind/averaging/partition.py

6f1e22a

Co-authored-by: Alexander Borzunov <[email protected]>

Update hivemind/averaging/averager.py

35d8306

Co-authored-by: Alexander Borzunov <[email protected]>

track pending tasks

298f437

review

df67842

review

a3897a8

review

e51ee75

review

2386611

review

8eac4a0

review

cc348a8

black it

ac2c004

borzunov reviewed Dec 13, 2021

View reviewed changes

hivemind/averaging/partition.py Outdated Show resolved Hide resolved

justheuristic and others added 3 commits December 13, 2021 22:01

Update hivemind/averaging/allreduce.py

abc67e4

Co-authored-by: Alexander Borzunov <[email protected]>

review

92ab91e

Merge remote-tracking branch 'origin/fault_tolerant_allreduce' into f…

be13c1f

…ault_tolerant_allreduce

borzunov and others added 3 commits December 13, 2021 23:01

review

e3506d8

review

8d2427e

review

5adfec7

borzunov approved these changes Dec 13, 2021

View reviewed changes

borzunov and others added 23 commits December 14, 2021 00:44

review

3bb5e03

add test case

e50e459

isort

f564176

test for slowness

6e86743

report integrity

12b708c

report integrity

cb71c7f

nirvana tests

1d56876

nirvana tests

c09ed95

nirvana tests

aadf8c3

nirvana tests

ddeb508

nirvana tests

2b0d850

nirvana tests

671ca20

nirvana tests

8862a85

nirvana tests

f61ba96

nirvana test - wait for client-mode peers to begin sending data (with…

5e2f33f

…in timeout)

nirvana test - wait for client-mode peers to begin sending data (with…

53ffcca

…in timeout)

nirvana test - wait for client-mode peers to begin sending data (with…

65ae51e

…in timeout)

nirvana test - wait for client-mode peers to begin sending data (with…

91e8c3c

…in timeout)

nirvana test - wait for client-mode peers to begin sending data (with…

10121f8

…in timeout)

fix edge case where all peers are auxiliaries

3d8fb18

blacken

4b675aa

better loggign

b03ccd9

learned from 160 peers

a803aa8

justheuristic merged commit 6da8683 into master Dec 14, 2021

justheuristic deleted the fault_tolerant_allreduce branch December 14, 2021 09:05

justheuristic mentioned this pull request Dec 28, 2021

[BUG] aclose verbose logging in send_error_to_peer #368

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve All-Reduce fault-tolerance #423

Improve All-Reduce fault-tolerance #423

justheuristic commented Dec 12, 2021 •

edited

Loading

codecov bot commented Dec 12, 2021 •

edited

Loading

borzunov left a comment

borzunov Dec 13, 2021

justheuristic Dec 13, 2021

borzunov Dec 13, 2021

justheuristic Dec 13, 2021

borzunov Dec 13, 2021 •

edited

Loading

borzunov Dec 13, 2021

justheuristic Dec 13, 2021

borzunov Dec 13, 2021

justheuristic Dec 13, 2021

borzunov Dec 13, 2021

justheuristic Dec 13, 2021

borzunov Dec 13, 2021

justheuristic Dec 13, 2021

Improve All-Reduce fault-tolerance #423

Improve All-Reduce fault-tolerance #423

Conversation

justheuristic commented Dec 12, 2021 • edited Loading

codecov bot commented Dec 12, 2021 • edited Loading

Codecov Report

borzunov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borzunov Dec 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justheuristic commented Dec 12, 2021 •

edited

Loading

codecov bot commented Dec 12, 2021 •

edited

Loading

borzunov Dec 13, 2021 •

edited

Loading