Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add version-consistent result rounding to load_balance_peers #230

Merged
merged 9 commits into from
Apr 19, 2021

Conversation

justheuristic
Copy link
Member

@justheuristic justheuristic commented Apr 18, 2021

Problem:
In current version, calling load_balance_peers may result in a nasty bug: two peers with different numpy/scipy versions (or even different builds of the same version, such as mkl/nomkl) can make incompatible decisions on how to round load balancing outputs.

Here's an example from my local laptiop (numpy+atlas)
image

Here's a colab instance with the same version of numpy+scipy, but different build and python version
image

When training collaboratively, all AWS peers would split vector parts as:
(4461869, 4461869, 4461868, 4461868, 0, 0, 0, 0)

... while a some desktop peers decided:
(4461869, 4461868, 4461868, 4461869, 0, 0, 0, 0)

As a result, AllReduce failed with INTERNAL_ERROR due to incompatible part sizes.

Solution:

  • rounded load balancing results up to 10^-9
  • hard-code a solver method in case scipy will ever change the default solver (they did so in the past)
  • added a callback for leader to solve linprog for the group

@justheuristic justheuristic changed the title Ensure stable rounding behavior of load_balance_peers Version-consistent rounding in load_balance_peers Apr 18, 2021
@justheuristic justheuristic marked this pull request as draft April 18, 2021 01:47
@justheuristic justheuristic reopened this Apr 18, 2021
@justheuristic justheuristic linked an issue Apr 18, 2021 that may be closed by this pull request
@justheuristic justheuristic marked this pull request as ready for review April 18, 2021 23:58
@justheuristic justheuristic requested a review from mryab April 19, 2021 00:01
@mryab mryab changed the title Version-consistent rounding in load_balance_peers Add version-consistent result rounding to load_balance_peers Apr 19, 2021
@justheuristic justheuristic merged commit 3d6a242 into master Apr 19, 2021
@justheuristic justheuristic deleted the stable_load_balancing branch April 19, 2021 00:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[found in YSDA run2] load_balance_peers is not version-consistent
2 participants