-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlocks in DecentralizedAverager and MPFuture #331
Conversation
Codecov Report
@@ Coverage Diff @@
## master #331 +/- ##
==========================================
- Coverage 82.25% 82.22% -0.03%
==========================================
Files 66 66
Lines 5979 5987 +8
==========================================
+ Hits 4918 4923 +5
- Misses 1061 1064 +3
|
@@ -5,6 +5,7 @@ | |||
import pytest | |||
|
|||
from hivemind.utils import get_logger | |||
from hivemind.utils.mpfuture import MPFuture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import from utils as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I'd weakly argue for avoiding importing from hivemind.utils
directly to avoid having one module with a lot of mixed-purpose stuff and the corresponding long import line.
To follow this, I've changed from hivemind.utils import get_logger
above to from hivemind.utils.logging import get_logger
.
This PR does the following:
Fix a bug in
DecentralizedAverager.rpc_join_group()
.Previously, if a peer didn't manage to update its group key in the DHT in time (or didn't manage to propagate the updates), another peer could have chosen it as a leader based on the group key from the previous stage. The groups have been assembling incorrectly, which led to occasional deadlocks in tests (where the time periods between sequential averager steps are low).
The fix forces leaders to compare followers' group keys before allowing them to join.
Fix a bug with MPFuture state after killing child processes.
Many tests don't perform graceful shutdowns, so we are using a fixture that kills leftover child processes (if they don't respond to
terminate()
in 1 sec). Such killings may leave the global MPFuture state broken (e.g. the global locks acquired and the background thread blocked).The fix resets MPFuture backend (including its global state) after killing the child processes between tests.
Add
-v
flag to pytest in CI.Without this flag, we had no opportunity to determine a particular test that caused a deadlock since pytest used one output line per file (and Github CI erases the last output line when timeout was exceeded).
The fix makes pytest to use one output line per test.