Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

Performance: Allowed forced inline update_index (no thread pool) #31455

Merged
merged 5 commits into from
May 18, 2023

Conversation

apfitzge
Copy link
Contributor

@apfitzge apfitzge commented May 2, 2023

Problem

Farming work out to a thread-pool works well for a lot of accounts-wide operations but is a bottleneck in the banking-stage and replay stage hot loops (see svg in details below).

Summary of Changes

Additional bool argument passed down to allow callers to force inline execution of the update_index function.

Still need to do some additional benching, however, this change shows significant improvement in bench-tps numbers when we have a lot of conflict:

Numbers are displayed as "MAX TPS / AVG TPS"

number of serializable tx-chains 1 4 16 50000
master 6715 / 4048 13397 / 7541 23603 / 12765 75637 / 52698
PR 11699 / 8208 24159 / 18601 40706 / 31451 80685 / 51630
Command details Table columns correspond to:
  1. NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 1 --keypair-multiplier 2
  2. NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 1
  3. NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 4
  4. NDEBUG=1 ./multinode-demo/bench-tps.sh

flamegraph-single-threaded-bench-tps

(You can download the above .svg and work your way into "solTxWorker-2", and see a large chunk of time is spent in store_accounts_custom, about half of that being a rayon operation).

Fixes #

@apfitzge
Copy link
Contributor Author

apfitzge commented May 2, 2023

Biggest outstanding concerns w/ this change:

  1. It will worsen performance w/ large-sized accounts
  2. Upgrading some nodes -> more packed blocks and replay stage can't keep up?

@apfitzge apfitzge added the work in progress This isn't quite right yet label May 2, 2023
@codecov
Copy link

codecov bot commented May 3, 2023

Codecov Report

Merging #31455 (adc36d4) into master (b3d5c0d) will decrease coverage by 0.1%.
The diff coverage is 95.0%.

@@            Coverage Diff            @@
##           master   #31455     +/-   ##
=========================================
- Coverage    81.4%    81.4%   -0.1%     
=========================================
  Files         731      731             
  Lines      208741   208757     +16     
=========================================
+ Hits       170028   170029      +1     
- Misses      38713    38728     +15     

@steviez
Copy link
Contributor

steviez commented May 3, 2023

  1. Upgrading some nodes -> more packed blocks and replay stage can't keep up?

Could always feature gate if necessary to avoid nodes without the change being unable to keep up

@@ -8007,7 +8010,7 @@ impl AccountsDb {
});
reclaims
};
if len > threshold {
if !force_inline_update_index && len > threshold {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ryoqun referenced this pull request in ryoqun/solana May 3, 2023
@jeffwashington
Copy link
Contributor

This is fine to me. we have tweaked this at least once previously. Dynamics of perf in master on disk index have also changed. Since most index entries should be in memory index during tx stores, all accesses should be fast. But, they are also using independent locks (in theory). I imagine the thread switching to the par pool is costing us more than we're saving.

@apfitzge apfitzge force-pushed the perf/force_inlined_update_index branch from bc4916f to 7e14c32 Compare May 3, 2023 18:42
@apfitzge apfitzge marked this pull request as ready for review May 3, 2023 19:53
@apfitzge
Copy link
Contributor Author

apfitzge commented May 9, 2023

This still seems to be a good change - in all my testing thusfar it has either had no effect or improved throughput.

As a final sanity check, I want to start up 2 valdiators w/ equivalent hardware, settings, snapshots, etc. and check replay-time stats w/ and w/o the patch.
I'm out most of this week so this branch will be stale until I have a chance to do that when I get back.

@apfitzge
Copy link
Contributor Author

datapoint for improvement in a benchmark (a bit easier to reproduce compared to bench-tps, but essentially the same thing): #31625 (comment)

@jeffwashington
Copy link
Contributor

I have no objections.

@apfitzge
Copy link
Contributor Author

I ran ledger-tool verify on same mnb snapshot w/ and w/o change. No difference in the time. Ledger-tool being aware of all entries in the slot so replay batches should be as large as expected i.e. worst case for not parallelizing.

@ryoqun and/or @steviez thoughts on merging this? I don't think it needs to be feature-gated - it helps throughput in worst case where all txs conflict, but in realistic load there's no significant difference.

@apfitzge
Copy link
Contributor Author

I removed the comment on the update_index function, as I think @jeffwashington's suggestion to use an enum makes the argument self-documenting (at least to a degree I'm happy with)

@steviez steviez removed the work in progress This isn't quite right yet label May 18, 2023
@steviez
Copy link
Contributor

steviez commented May 18, 2023

I ran ledger-tool verify on same mnb snapshot w/ and w/o change. No difference in the time. Ledger-tool being aware of all entries in the slot so replay batches should be as large as expected i.e. worst case for not parallelizing.

Agreed on the validity of this experiment as it relates to ledger-tool being the best case parallelization.

I don't think it needs to be feature-gated - it helps throughput in worst case where all txs conflict, but in realistic load there's no significant difference.

Agreed that I don't think this needs a feature gate

Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes here for my own sake as well:

Note that any individual pubkey update index call could be a page fault currently. With a threshold of 1, we are clearly better not adding parallelism. It gets more complicated to know with len > 1.

I'm not 100% sure if that comment is still accurate, but I'm wondering if we should consider raising the threshold for using the pool for the benefit of the code paths that use UpdateIndexThreadSelection::PoolWithThreshold.

I think a larger question that is probably out of scope for this PR that I have wondered is what is the overhead of using the global rayon thread pool. Knowing this would allow us to make more informed decisions about when we should use a pool vs. skip it. Using the global pool here (into_par_iter()) instead of a dedicated one could be hitting contention issues as well.

If the change is observed to help, maybe we table any further optimizations around adjusting the threshold until a point when we have a better understanding of the overhead of using rayon?

@jeffwashington
Copy link
Contributor

jeffwashington commented May 18, 2023

I'm not 100% sure if that comment is still accurate

Comment is still accurate.

@jeffwashington jeffwashington self-requested a review May 18, 2023 18:11
@apfitzge
Copy link
Contributor Author

@steviez

I'm not 100% sure if that comment is still accurate, but I'm wondering if we should consider raising the threshold for using the pool for the benefit of the code paths that use UpdateIndexThreadSelection::PoolWithThreshold.

I think the comment is still accurate. w/ len == 1 it's certainly better to not parallelize, and above that it's more complicated.

I might be wrong, but I think that comment is in the general case for this fn, and not with the additional context that the calls that we're interested in here are specifically coming from committing transactions. In committing transactions we'll have very recently loaded these accounts, and presumably found them by looking in the index. That's not necessarily the case for other "pure database" operations like cleaning accounts.

Using the global pool here (into_par_iter()) instead of a dedicated one could be hitting contention issues as well.

Think this could also be a significant impact. As far as I'm aware, this fn is otherwise used during cleaning accounts where we have a dedicated threadpool of 1/4 cores. This is kind of reflected in the code as well:

            let chunk_size = std::cmp::max(1, len / quarter_thread_count()); // # pubkeys/thread

in that there's an implied assumption we have 1/4 cores rather than all of them (in the case of committing).

@jeffwashington
Copy link
Contributor

jeffwashington commented May 18, 2023

in that there's an implied assumption we have 1/4 cores rather than all of them (in the case of committing).

Caller may have more threads then 1/4 of the pool, but the goal was not to stall the entire fg processing to wait on the i/o of the disk index.
We run txs in parallel, right, so one group could be storing results while another is also loading, executing, etc.
This was just a heuristic to allow us to avoid serializing N (used to be N*2) page faults per store/upsert of N accounts.

Copy link
Contributor

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the extra context there. Given that, I have no concerns and LGTM!

@apfitzge apfitzge merged commit d391e75 into solana-labs:master May 18, 2023
@apfitzge apfitzge deleted the perf/force_inlined_update_index branch May 18, 2023 20:32
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants