-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Performance: Allowed forced inline update_index (no thread pool) #31455
Performance: Allowed forced inline update_index (no thread pool) #31455
Conversation
Biggest outstanding concerns w/ this change:
|
Codecov Report
@@ Coverage Diff @@
## master #31455 +/- ##
=========================================
- Coverage 81.4% 81.4% -0.1%
=========================================
Files 731 731
Lines 208741 208757 +16
=========================================
+ Hits 170028 170029 +1
- Misses 38713 38728 +15 |
Could always feature gate if necessary to avoid nodes without the change being unable to keep up |
runtime/src/accounts_db.rs
Outdated
@@ -8007,7 +8010,7 @@ impl AccountsDb { | |||
}); | |||
reclaims | |||
}; | |||
if len > threshold { | |||
if !force_inline_update_index && len > threshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
way better proper fix than mine, lol: ryoqun@d64184f#diff-1090394420d51617f3233275c2b65ed706b35b53b115fe65f82c682af8134a6fL7582
This is fine to me. we have tweaked this at least once previously. Dynamics of perf in master on disk index have also changed. Since most index entries should be in memory index during tx stores, all accesses should be fast. But, they are also using independent locks (in theory). I imagine the thread switching to the par pool is costing us more than we're saving. |
bc4916f
to
7e14c32
Compare
This still seems to be a good change - in all my testing thusfar it has either had no effect or improved throughput. As a final sanity check, I want to start up 2 valdiators w/ equivalent hardware, settings, snapshots, etc. and check replay-time stats w/ and w/o the patch. |
datapoint for improvement in a benchmark (a bit easier to reproduce compared to bench-tps, but essentially the same thing): #31625 (comment) |
I have no objections. |
I ran ledger-tool verify on same mnb snapshot w/ and w/o change. No difference in the time. Ledger-tool being aware of all entries in the slot so replay batches should be as large as expected i.e. worst case for not parallelizing. @ryoqun and/or @steviez thoughts on merging this? I don't think it needs to be feature-gated - it helps throughput in worst case where all txs conflict, but in realistic load there's no significant difference. |
I removed the comment on the update_index function, as I think @jeffwashington's suggestion to use an enum makes the argument self-documenting (at least to a degree I'm happy with) |
Agreed on the validity of this experiment as it relates to ledger-tool being the best case parallelization.
Agreed that I don't think this needs a feature gate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some notes here for my own sake as well:
- The path that is being updated to force inline is what is called by
Bank::commit_transactions()
- remove par_iter on update index below threshold #25699 previously added the change to skip the threadpool when the threshold (1) was not exceeded (ie single account update).
- In that PR, @jeffwashington commented:
Note that any individual pubkey update index call could be a page fault currently. With a threshold of 1, we are clearly better not adding parallelism. It gets more complicated to know with len > 1.
I'm not 100% sure if that comment is still accurate, but I'm wondering if we should consider raising the threshold for using the pool for the benefit of the code paths that use UpdateIndexThreadSelection::PoolWithThreshold
.
I think a larger question that is probably out of scope for this PR that I have wondered is what is the overhead of using the global rayon thread pool. Knowing this would allow us to make more informed decisions about when we should use a pool vs. skip it. Using the global pool here (into_par_iter()
) instead of a dedicated one could be hitting contention issues as well.
If the change is observed to help, maybe we table any further optimizations around adjusting the threshold until a point when we have a better understanding of the overhead of using rayon?
Comment is still accurate. |
I think the comment is still accurate. w/ len == 1 it's certainly better to not parallelize, and above that it's more complicated. I might be wrong, but I think that comment is in the general case for this fn, and not with the additional context that the calls that we're interested in here are specifically coming from committing transactions. In committing transactions we'll have very recently loaded these accounts, and presumably found them by looking in the index. That's not necessarily the case for other "pure database" operations like cleaning accounts.
Think this could also be a significant impact. As far as I'm aware, this fn is otherwise used during cleaning accounts where we have a dedicated threadpool of 1/4 cores. This is kind of reflected in the code as well: let chunk_size = std::cmp::max(1, len / quarter_thread_count()); // # pubkeys/thread in that there's an implied assumption we have 1/4 cores rather than all of them (in the case of committing). |
Caller may have more threads then 1/4 of the pool, but the goal was not to stall the entire fg processing to wait on the i/o of the disk index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for providing the extra context there. Given that, I have no concerns and LGTM!
Problem
Farming work out to a thread-pool works well for a lot of accounts-wide operations but is a bottleneck in the banking-stage and replay stage hot loops (see svg in details below).
Summary of Changes
Additional bool argument passed down to allow callers to force inline execution of the
update_index
function.Still need to do some additional benching, however, this change shows significant improvement in bench-tps numbers when we have a lot of conflict:
Numbers are displayed as "MAX TPS / AVG TPS"
Command details
Table columns correspond to:NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 1 --keypair-multiplier 2
NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 1
NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 4
NDEBUG=1 ./multinode-demo/bench-tps.sh
(You can download the above .svg and work your way into "solTxWorker-2", and see a large chunk of time is spent in
store_accounts_custom
, about half of that being a rayon operation).Fixes #