Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlocks in DecentralizedAverager and MPFuture #331

Merged
merged 7 commits into from
Jul 22, 2021

Conversation

borzunov
Copy link
Member

@borzunov borzunov commented Jul 21, 2021

This PR does the following:

  1. Fix a bug in DecentralizedAverager.rpc_join_group().

    Previously, if a peer didn't manage to update its group key in the DHT in time (or didn't manage to propagate the updates), another peer could have chosen it as a leader based on the group key from the previous stage. The groups have been assembling incorrectly, which led to occasional deadlocks in tests (where the time periods between sequential averager steps are low).

    The fix forces leaders to compare followers' group keys before allowing them to join.

  2. Fix a bug with MPFuture state after killing child processes.

    Many tests don't perform graceful shutdowns, so we are using a fixture that kills leftover child processes (if they don't respond to terminate() in 1 sec). Such killings may leave the global MPFuture state broken (e.g. the global locks acquired and the background thread blocked).

    The fix resets MPFuture backend (including its global state) after killing the child processes between tests.

  3. Add -v flag to pytest in CI.

    Without this flag, we had no opportunity to determine a particular test that caused a deadlock since pytest used one output line per file (and Github CI erases the last output line when timeout was exceeded).

    The fix makes pytest to use one output line per test.

@codecov
Copy link

codecov bot commented Jul 21, 2021

Codecov Report

Merging #331 (7c49501) into master (407c201) will decrease coverage by 0.02%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##           master     #331      +/-   ##
==========================================
- Coverage   82.25%   82.22%   -0.03%     
==========================================
  Files          66       66              
  Lines        5979     5987       +8     
==========================================
+ Hits         4918     4923       +5     
- Misses       1061     1064       +3     
Impacted Files Coverage Δ
hivemind/averaging/matchmaking.py 76.75% <50.00%> (-0.17%) ⬇️
hivemind/utils/mpfuture.py 89.66% <85.71%> (-0.17%) ⬇️
hivemind/dht/node.py 92.02% <0.00%> (-1.21%) ⬇️
hivemind/p2p/p2p_daemon.py 95.00% <0.00%> (+0.35%) ⬆️
hivemind/utils/timed_storage.py 96.93% <0.00%> (+1.02%) ⬆️
hivemind/averaging/key_manager.py 97.72% <0.00%> (+2.27%) ⬆️

@borzunov borzunov requested a review from justheuristic July 21, 2021 23:07
@borzunov borzunov marked this pull request as ready for review July 21, 2021 23:11
@borzunov borzunov requested a review from mryab July 21, 2021 23:14
@@ -5,6 +5,7 @@
import pytest

from hivemind.utils import get_logger
from hivemind.utils.mpfuture import MPFuture
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import from utils as well?

Copy link
Member Author

@borzunov borzunov Jul 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I'd weakly argue for avoiding importing from hivemind.utils directly to avoid having one module with a lot of mixed-purpose stuff and the corresponding long import line.

To follow this, I've changed from hivemind.utils import get_logger above to from hivemind.utils.logging import get_logger.

@borzunov borzunov merged commit 0d67284 into master Jul 22, 2021
@borzunov borzunov deleted the ensure-equal-group-keys branch July 22, 2021 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants