Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set default DHT num_workers = 4 #342

Merged
merged 10 commits into from
Jul 31, 2021
Merged

Set default DHT num_workers = 4 #342

merged 10 commits into from
Jul 31, 2021

Conversation

borzunov
Copy link
Member

@borzunov borzunov commented Jul 28, 2021

This change seems to speed up (a) DHT get requests by 3.6x and (b) DHT creation by 1.2x (probably due to speeding up the communication with initial nodes).

benchmark_dht.py

nora, this PR, max_workers = 8:

2021-07-29 00 40 15

nora, master (fb48133, see #318), max_workers = 1:

Screen Shot 2021-07-15 at 6 47 42 AM

Co-authored-by: justheuristic <[email protected]>
Co-authored-by: Denis Mazur <[email protected]>
@codecov
Copy link

codecov bot commented Jul 28, 2021

Codecov Report

Merging #342 (10ad6e8) into master (bedfa6e) will decrease coverage by 0.14%.
The diff coverage is 88.88%.

@@            Coverage Diff             @@
##           master     #342      +/-   ##
==========================================
- Coverage   83.27%   83.13%   -0.15%     
==========================================
  Files          66       66              
  Lines        5974     5976       +2     
==========================================
- Hits         4975     4968       -7     
- Misses        999     1008       +9     
Impacted Files Coverage Δ
hivemind/moe/client/beam_search.py 69.49% <66.66%> (ø)
hivemind/dht/__init__.py 90.14% <100.00%> (ø)
hivemind/dht/node.py 93.26% <100.00%> (+0.03%) ⬆️
hivemind/moe/server/dht_handler.py 98.03% <100.00%> (ø)
hivemind/averaging/matchmaking.py 79.87% <0.00%> (-3.01%) ⬇️
hivemind/dht/traverse.py 75.39% <0.00%> (+0.79%) ⬆️

@dvmazur
Copy link
Collaborator

dvmazur commented Jul 29, 2021

WARNING: this is not master code
Here are the nora DHT benchmarks for max_workers=1, 2, 4, 8, 16

max_workers=1
image
max_workers=2
image
max_workers=4
image
max_workers=8
image
max_workers=16
image

The optimal value for max_workers seems to be 8.

@dvmazur
Copy link
Collaborator

dvmazur commented Jul 29, 2021

Aside from wrong branch, I've noticed one more problem with previous results: it actually performed all operations in batches and used min(len(uids), dht.max_workers) as the default number of workers for both get and store. To make benchmarks more realistic, i've changed get to run with batch size=1 and use exactly the specified number of workers.

max_workers=1
image
max_workers=2
image
max_workers=4
image
max_workers=8
image
max_workers=16
image

For reference, this is how I modified get_experts:

async def _get_experts(
    dht: DHT, node: DHTNode, uids: List[ExpertUID], expiration_time: Optional[DHTExpiration]
) -> List[Optional[RemoteExpert]]:
    if expiration_time is None:
        expiration_time = get_dht_time()
    # num_workers = len(uids) if dht.max_workers is None else min(len(uids), dht.max_workers)
    num_workers = dht.max_workers
    found: Dict[ExpertUID, DHTValue] = await node.get_many(uids, expiration_time, num_workers=num_workers, sufficient_expiration_time=float('inf'))

    experts: List[Optional[RemoteExpert]] = [None] * len(uids)
    for i, uid in enumerate(uids):
        if found[uid] is not None and isinstance(found[uid].value, Endpoint):
            experts[i] = RemoteExpert(uid, found[uid].value)
    return experts

@dvmazur
Copy link
Collaborator

dvmazur commented Jul 29, 2021

Here are the same benchmarks with expiration_time=float('inf') for get_experts:

max_workers=1
image
max_workers=4
image
max_workers=8
image

@borzunov
Copy link
Member Author

borzunov commented Jul 30, 2021

@deniskamazur Thanks for the measurements! Can I ask you to:

  • Remove the comment with the measurements from the wrong branch to avoid confusion.

  • Run benchmarks for num_workers = 4, 5, 6, ..., 16 to find the most optimal value (they can run in the background while you're busy with other tasks). Bash loops may be helpful for that:

for i in {4..16}; do pkill -f p2pd && HIVEMIND_DHT_NUM_WORKERS="$i" python benchmark_dht.py <params>; done
  • Format the results (store/get time) in a table or a plot (example) because it is hard to compare them across the screenshots.

Already discussed with @deniskamazur
Copy link
Collaborator

@dvmazur dvmazur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😎

@borzunov borzunov changed the title Set default DHTNode.num_workers = 8 Set default DHTNode.num_workers = 4 Jul 30, 2021
@borzunov borzunov changed the title Set default DHTNode.num_workers = 4 Set default DHT num_workers = 4 Jul 30, 2021
@borzunov borzunov merged commit 0b3dcf4 into master Jul 31, 2021
@borzunov borzunov deleted the num-workers branch July 31, 2021 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants