Performance: Allowed forced inline update_index (no thread pool) #31455

apfitzge · 2023-05-02T23:40:31Z

Problem

Farming work out to a thread-pool works well for a lot of accounts-wide operations but is a bottleneck in the banking-stage and replay stage hot loops (see svg in details below).

Summary of Changes

Additional bool argument passed down to allow callers to force inline execution of the update_index function.

Still need to do some additional benching, however, this change shows significant improvement in bench-tps numbers when we have a lot of conflict:

Numbers are displayed as "MAX TPS / AVG TPS"

number of serializable tx-chains	1	4	16	50000
master	6715 / 4048	13397 / 7541	23603 / 12765	75637 / 52698
PR	11699 / 8208	24159 / 18601	40706 / 31451	80685 / 51630

Command details

Table columns correspond to:

NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 1 --keypair-multiplier 2
NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 1
NDEBUG=1 ./multinode-demo/bench-tps.sh --num-conflict-groups 4
NDEBUG=1 ./multinode-demo/bench-tps.sh

(You can download the above .svg and work your way into "solTxWorker-2", and see a large chunk of time is spent in store_accounts_custom, about half of that being a rayon operation).

Fixes #

apfitzge · 2023-05-02T23:42:32Z

Biggest outstanding concerns w/ this change:

It will worsen performance w/ large-sized accounts
Upgrading some nodes -> more packed blocks and replay stage can't keep up?

codecov · 2023-05-03T00:27:49Z

Codecov Report

Merging #31455 (adc36d4) into master (b3d5c0d) will decrease coverage by 0.1%.
The diff coverage is 95.0%.

@@            Coverage Diff            @@
##           master   #31455     +/-   ##
=========================================
- Coverage    81.4%    81.4%   -0.1%     
=========================================
  Files         731      731             
  Lines      208741   208757     +16     
=========================================
+ Hits       170028   170029      +1     
- Misses      38713    38728     +15

steviez · 2023-05-03T03:34:07Z

Upgrading some nodes -> more packed blocks and replay stage can't keep up?

Could always feature gate if necessary to avoid nodes without the change being unable to keep up

ryoqun · 2023-05-03T12:43:04Z

runtime/src/accounts_db.rs

@@ -8007,7 +8010,7 @@ impl AccountsDb {
            });
            reclaims
        };
-        if len > threshold {
+        if !force_inline_update_index && len > threshold {


way better proper fix than mine, lol: ryoqun@d64184f#diff-1090394420d51617f3233275c2b65ed706b35b53b115fe65f82c682af8134a6fL7582

ryoqun@d64184f#r111626411

jeffwashington · 2023-05-03T16:26:27Z

This is fine to me. we have tweaked this at least once previously. Dynamics of perf in master on disk index have also changed. Since most index entries should be in memory index during tx stores, all accesses should be fast. But, they are also using independent locks (in theory). I imagine the thread switching to the par pool is costing us more than we're saving.

apfitzge · 2023-05-09T00:58:33Z

This still seems to be a good change - in all my testing thusfar it has either had no effect or improved throughput.

As a final sanity check, I want to start up 2 valdiators w/ equivalent hardware, settings, snapshots, etc. and check replay-time stats w/ and w/o the patch.
I'm out most of this week so this branch will be stale until I have a chance to do that when I get back.

apfitzge · 2023-05-13T00:14:24Z

datapoint for improvement in a benchmark (a bit easier to reproduce compared to bench-tps, but essentially the same thing): #31625 (comment)

jeffwashington · 2023-05-13T00:21:52Z

I have no objections.

apfitzge · 2023-05-17T18:25:59Z

I ran ledger-tool verify on same mnb snapshot w/ and w/o change. No difference in the time. Ledger-tool being aware of all entries in the slot so replay batches should be as large as expected i.e. worst case for not parallelizing.

@ryoqun and/or @steviez thoughts on merging this? I don't think it needs to be feature-gated - it helps throughput in worst case where all txs conflict, but in realistic load there's no significant difference.

runtime/src/accounts_db.rs

apfitzge · 2023-05-17T18:53:53Z

I removed the comment on the update_index function, as I think @jeffwashington's suggestion to use an enum makes the argument self-documenting (at least to a degree I'm happy with)

steviez · 2023-05-18T16:30:37Z

I ran ledger-tool verify on same mnb snapshot w/ and w/o change. No difference in the time. Ledger-tool being aware of all entries in the slot so replay batches should be as large as expected i.e. worst case for not parallelizing.

Agreed on the validity of this experiment as it relates to ledger-tool being the best case parallelization.

I don't think it needs to be feature-gated - it helps throughput in worst case where all txs conflict, but in realistic load there's no significant difference.

Agreed that I don't think this needs a feature gate

steviez

Some notes here for my own sake as well:

The path that is being updated to force inline is what is called by Bank::commit_transactions()
remove par_iter on update index below threshold #25699 previously added the change to skip the threadpool when the threshold (1) was not exceeded (ie single account update).
In that PR, @jeffwashington commented:

Note that any individual pubkey update index call could be a page fault currently. With a threshold of 1, we are clearly better not adding parallelism. It gets more complicated to know with len > 1.

I'm not 100% sure if that comment is still accurate, but I'm wondering if we should consider raising the threshold for using the pool for the benefit of the code paths that use UpdateIndexThreadSelection::PoolWithThreshold.

I think a larger question that is probably out of scope for this PR that I have wondered is what is the overhead of using the global rayon thread pool. Knowing this would allow us to make more informed decisions about when we should use a pool vs. skip it. Using the global pool here (into_par_iter()) instead of a dedicated one could be hitting contention issues as well.

If the change is observed to help, maybe we table any further optimizations around adjusting the threshold until a point when we have a better understanding of the overhead of using rayon?

jeffwashington · 2023-05-18T18:09:25Z

I'm not 100% sure if that comment is still accurate

Comment is still accurate.

runtime/src/accounts_db.rs

apfitzge · 2023-05-18T18:26:54Z

@steviez

I'm not 100% sure if that comment is still accurate, but I'm wondering if we should consider raising the threshold for using the pool for the benefit of the code paths that use UpdateIndexThreadSelection::PoolWithThreshold.

I think the comment is still accurate. w/ len == 1 it's certainly better to not parallelize, and above that it's more complicated.

I might be wrong, but I think that comment is in the general case for this fn, and not with the additional context that the calls that we're interested in here are specifically coming from committing transactions. In committing transactions we'll have very recently loaded these accounts, and presumably found them by looking in the index. That's not necessarily the case for other "pure database" operations like cleaning accounts.

Using the global pool here (into_par_iter()) instead of a dedicated one could be hitting contention issues as well.

Think this could also be a significant impact. As far as I'm aware, this fn is otherwise used during cleaning accounts where we have a dedicated threadpool of 1/4 cores. This is kind of reflected in the code as well:

            let chunk_size = std::cmp::max(1, len / quarter_thread_count()); // # pubkeys/thread

in that there's an implied assumption we have 1/4 cores rather than all of them (in the case of committing).

jeffwashington · 2023-05-18T18:33:24Z

in that there's an implied assumption we have 1/4 cores rather than all of them (in the case of committing).

Caller may have more threads then 1/4 of the pool, but the goal was not to stall the entire fg processing to wait on the i/o of the disk index.
We run txs in parallel, right, so one group could be storing results while another is also loading, executing, etc.
This was just a heuristic to allow us to avoid serializing N (used to be N*2) page faults per store/upsert of N accounts.

jeffwashington

lgtm

steviez

Thanks for providing the extra context there. Given that, I have no concerns and LGTM!

apfitzge added the work in progress This isn't quite right yet label May 2, 2023

ryoqun reviewed May 3, 2023

View reviewed changes

ryoqun referenced this pull request in ryoqun/solana May 3, 2023

Accountsdb quick opt. hacks

d64184f

apfitzge added 3 commits May 3, 2023 18:40

store_cached_inline_update_index

f6aea40

accounts::store_cached_inline_update_index

37bc5c4

delete old and rename

7e14c32

apfitzge force-pushed the perf/force_inlined_update_index branch from bc4916f to 7e14c32 Compare May 3, 2023 18:42

apfitzge marked this pull request as ready for review May 3, 2023 19:53

apfitzge mentioned this pull request May 13, 2023

Keep signal_receiver in scope #31625

Merged

jeffwashington reviewed May 17, 2023

View reviewed changes

runtime/src/accounts_db.rs Outdated Show resolved Hide resolved

enumification

50e4294

steviez removed the work in progress This isn't quite right yet label May 18, 2023

steviez reviewed May 18, 2023

View reviewed changes

jeffwashington reviewed May 18, 2023

View reviewed changes

runtime/src/accounts_db.rs Outdated Show resolved Hide resolved

jeffwashington reviewed May 18, 2023

View reviewed changes

runtime/src/accounts_db.rs Outdated Show resolved Hide resolved

jeffwashington self-requested a review May 18, 2023 18:11

restrict access to crate

adc36d4

jeffwashington approved these changes May 18, 2023

View reviewed changes

steviez approved these changes May 18, 2023

View reviewed changes

apfitzge merged commit d391e75 into solana-labs:master May 18, 2023

apfitzge deleted the perf/force_inlined_update_index branch May 18, 2023 20:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance: Allowed forced inline update_index (no thread pool) #31455

Performance: Allowed forced inline update_index (no thread pool) #31455

apfitzge commented May 2, 2023 •

edited

Loading

apfitzge commented May 2, 2023

codecov bot commented May 3, 2023 •

edited

Loading

steviez commented May 3, 2023

ryoqun May 3, 2023

jeffwashington commented May 3, 2023

apfitzge commented May 9, 2023

apfitzge commented May 13, 2023

jeffwashington commented May 13, 2023

apfitzge commented May 17, 2023

apfitzge commented May 17, 2023

steviez commented May 18, 2023

steviez left a comment

jeffwashington commented May 18, 2023 •

edited

Loading

apfitzge commented May 18, 2023

jeffwashington commented May 18, 2023 •

edited

Loading

jeffwashington left a comment

steviez left a comment

Performance: Allowed forced inline update_index (no thread pool) #31455

Performance: Allowed forced inline update_index (no thread pool) #31455

Conversation

apfitzge commented May 2, 2023 • edited Loading

Problem

Summary of Changes

apfitzge commented May 2, 2023

codecov bot commented May 3, 2023 • edited Loading

Codecov Report

steviez commented May 3, 2023

ryoqun May 3, 2023

Choose a reason for hiding this comment

jeffwashington commented May 3, 2023

apfitzge commented May 9, 2023

apfitzge commented May 13, 2023

jeffwashington commented May 13, 2023

apfitzge commented May 17, 2023

apfitzge commented May 17, 2023

steviez commented May 18, 2023

steviez left a comment

Choose a reason for hiding this comment

jeffwashington commented May 18, 2023 • edited Loading

apfitzge commented May 18, 2023

jeffwashington commented May 18, 2023 • edited Loading

jeffwashington left a comment

Choose a reason for hiding this comment

steviez left a comment

Choose a reason for hiding this comment

apfitzge commented May 2, 2023 •

edited

Loading

codecov bot commented May 3, 2023 •

edited

Loading

jeffwashington commented May 18, 2023 •

edited

Loading

jeffwashington commented May 18, 2023 •

edited

Loading