Add version-consistent result rounding to load_balance_peers #230

justheuristic · 2021-04-18T01:24:31Z

Problem:
In current version, calling load_balance_peers may result in a nasty bug: two peers with different numpy/scipy versions (or even different builds of the same version, such as mkl/nomkl) can make incompatible decisions on how to round load balancing outputs.

Here's an example from my local laptiop (numpy+atlas)

Here's a colab instance with the same version of numpy+scipy, but different build and python version

When training collaboratively, all AWS peers would split vector parts as:
(4461869, 4461869, 4461868, 4461868, 0, 0, 0, 0)

... while a some desktop peers decided:
(4461869, 4461868, 4461868, 4461869, 0, 0, 0, 0)

As a result, AllReduce failed with INTERNAL_ERROR due to incompatible part sizes.

Solution:

rounded load balancing results up to 10^-9
hard-code a solver method in case scipy will ever change the default solver (they did so in the past)
added a callback for leader to solve linprog for the group

…le_load_balancing

justheuristic added 2 commits April 18, 2021 04:17

rount floating point errors

aa4ba47

rount floating point errors

027c2c8

justheuristic changed the title ~~Ensure stable rounding behavior of load_balance_peers~~ Version-consistent rounding in load_balance_peers Apr 18, 2021

Merge branch 'master' into stable_load_balancing

f387e4e

justheuristic marked this pull request as draft April 18, 2021 01:47

justheuristic closed this Apr 18, 2021

justheuristic reopened this Apr 18, 2021

justheuristic linked an issue Apr 18, 2021 that may be closed by this pull request

[found in YSDA run2] load_balance_peers is not version-consistent #231

Closed

justheuristic marked this pull request as ready for review April 18, 2021 23:58

Merge branch 'master' into stable_load_balancing

772881a

justheuristic requested a review from mryab April 19, 2021 00:01

justheuristic added 3 commits April 19, 2021 03:10

enforce LP dtype

d58f4ff

Merge remote-tracking branch 'origin/stable_load_balancing' into stab…

8efc232

…le_load_balancing

redundant dtypes

7d81ab6

mryab changed the title ~~Version-consistent rounding in load_balance_peers~~ Add version-consistent result rounding to load_balance_peers Apr 19, 2021

mryab and others added 2 commits April 19, 2021 03:14

review

7bb2034

kwarg

085b563

mryab approved these changes Apr 19, 2021

View reviewed changes

justheuristic merged commit 3d6a242 into master Apr 19, 2021

justheuristic deleted the stable_load_balancing branch April 19, 2021 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add version-consistent result rounding to load_balance_peers #230

Add version-consistent result rounding to load_balance_peers #230

justheuristic commented Apr 18, 2021 •

edited

Loading

Add version-consistent result rounding to load_balance_peers #230

Add version-consistent result rounding to load_balance_peers #230

Conversation

justheuristic commented Apr 18, 2021 • edited Loading

justheuristic commented Apr 18, 2021 •

edited

Loading