Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

AccountsDb: Don't use threads for update_index #24692

Closed

Conversation

ckamm
Copy link
Contributor

@ckamm ckamm commented Apr 26, 2022

Problem

I was profiling banking-bench runs with some-batch-only contention and noticed a lot of time was spent in update_index(). That makes sense, because len is 3 or something and splitting that up into multiple threads is primarily overhead.

But when I played with having a min_chunk_size I realized that dropping the parallelism also sped up the none contention case (where len > 256). Hence this patch just removes it completely.

banking-bench results:

           none      same-batch-only       full
before    96451          7967              3098
after    133738         15197             10504

(--packets-per-batch 128 --batches-per-iteration 6
--iterations=1200 for none, 200 otherwise)

Summary of Changes

Drop parallelism from update_index() in accountsdb.

banking-bench results:

           none      some-batch-only       full
before    96451          7967              3098
after    133738         15197             10504

(--packets-per-batch 128 --batches-per-iteration 6
 --iterations=1200 for none, 200 otherwise)
@mergify mergify bot added the community Community contribution label Apr 26, 2022
@mergify mergify bot requested a review from a team April 26, 2022 13:52
@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

@taozhu-chicago, is this in your area?

@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

#20601 suggests this was done for the case where the indexes are on disk. @jeffwashington

I wonder if there's a way to do it that doesn't sacrifice as much.

@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

Switch based on AccountsIndex::is_disk_index_enabled()?

@tao-stones
Copy link
Contributor

@taozhu-chicago, is this in your area?

Interesting finding! @jeffwashington knows better about accountsdb indexing.

@jeffwashington
Copy link
Contributor

Switch based on AccountsIndex::is_disk_index_enabled()?

Disk index will be enabled by default beginning with 1.11.

@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

Ok, sounds like we should update banking-bench to use a disk index too then.

@jeffwashington
Copy link
Contributor

Ok, sounds like we should update banking-bench to use a disk index too then.

Not a bad idea. I'm not familiar with banking-bench. I'm trying to think through a few dimensions:

  1. Having 10B accounts (or even 'just' 1B).
  2. Cache flushing and shrink will call this with an entire slot of changes at a time.
  3. I imagine individual account 'store' calls will call this with a single account.
  4. We currently have 8k 'bins' of accounts by default (this is a cli option). The locking contention is very different than it was 9 months ago. In theory locking contention should be negligible at runtime.

@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

As far as I can tell banking-bench already uses the default disk index.

@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

@jeffwashington Can you give me a very brief pointer as to what hits the disk in AccountsIndex::upsert()?

@jeffwashington
Copy link
Contributor

As far as I can tell banking-bench already uses the default disk index.

Ha. That's good to know. I had expected/hoped that my change to default would have caused disk index to be used for benches and tests. It may only be using 2 bins by default for tests/benches (to avoid 16k+ files per test launch).

We could put the guts of this function in a closure and call it within threads or not depending on some heuristic (like # accounts to update). Then we could go parallel at high counts and stay in the same thread at low counts.

@jeffwashington
Copy link
Contributor

@jeffwashington Can you give me a very brief pointer as to what hits the disk in AccountsIndex::upsert()?

if you upsert account A and account A isn't in the in-mem index (an lru cache of active or dirty pubkeys), then we have to lookup account A on disk and load it into the in-mem index. If the account exists on disk, this could be 2 page faults. If it does not exist, it could be 1 page fault.

@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

We could put the guts of this function in a closure and call it within threads or not depending on some heuristic (like # accounts to update). Then we could go parallel at high counts and stay in the same thread at low counts.

Agreed, that's what I did initialy. But then it looked like there was also a significant gain by not going parallel even when N>256. I'll investigate more tomorrow.

@jeffwashington
Copy link
Contributor

so it may depend on what the accounts bench is doing how accurate it is for real world.
Part of the issue may be a trade off between limiting worst-case behavior and optimizing best case behavior.
If we don't parallelize this, then we could have 2*N page faults in serial in the worst case. Where N could be quite large.
But, if the accounts are all in memory already, then perhaps threads are unnecessary. For example, we keep accounts index entries in-memory if any slots are cached slots. So, flushing the cache should never cause a page fault from update_index. In the shrink case, the account index per item will possibly have been loaded in the 'gathering data' part of shrink. In that case, the items will already be in the cache due to lru. So, the call to rewrite the whole append vec in shrink using update_index should only be working on index items that are already in memory.

We could pass a bool specifying whether caller thinks it likely to hit disk or not. Any update from tx processing would have had to load the account already recently. That should cause it to be in the lru. Any newly created account will very likely be a page fault. Tuning this has a lot of things to think about.

@jeffwashington
Copy link
Contributor

I'll investigate more tomorrow.

Thanks! Look forward to seeing what happens!

@ckamm
Copy link
Contributor Author

ckamm commented Apr 26, 2022

@taozhu-chicago fyi, here's my flamegraphs of a 5s run of a single thread of banking-bench, manually instrumented with firestorm:

merged flamegraph contention=none
merged

merged flamegraph contention=same-batch-only
merged

Notice how update_index takes 20% of total time in the latter case. Download and re-open them to get interactivity.

@codecov
Copy link

codecov bot commented Apr 26, 2022

Codecov Report

Merging #24692 (8b7c738) into master (adcfdfa) will increase coverage by 12.1%.
The diff coverage is 100.0%.

@@             Coverage Diff             @@
##           master   #24692       +/-   ##
===========================================
+ Coverage    69.8%    82.0%    +12.1%     
===========================================
  Files          36      596      +560     
  Lines        2268   165224   +162956     
  Branches      322        0      -322     
===========================================
+ Hits         1584   135550   +133966     
- Misses        571    29674    +29103     
+ Partials      113        0      -113     

@ckamm
Copy link
Contributor Author

ckamm commented Apr 29, 2022

@jeffwashington Could you check that this logic makes sense? (this is partially based on what you said above, and me getting familiar with the accounts_index code)

  • When the banking stage wants to execute transactions, it loads all accounts.
  • For existing accounts, this creates in-memory account index entries where storage was only on disk before.
  • The in-memory index entries are guaranteed to survive until at least the next index age due to LRU bumps in get() (which is >=SLOT_MS into the future, default is 5*SLOT_MS)
  • That means all involved accounts have either an in-memory index entry (existing accounts), or no index entry at all (new accounts).
  • During leader slots there is no other parallel process that could create on-disk index entries for new accounts.
  • Thus the stores coming from commit_transactions will never need to read from disk, even for accounts that are new and currently don't have an in-memory account index entry.
  • We can introduce a banking-stage fast-path for upsert() that skips the disk read.
  • And then we can probably drop the parallelism in update_index()

@jeffwashington
Copy link
Contributor

I believe you are correct in all your assertions.

It is unclear to me the various ways an account can be created. Can we guarantee that we have tried to load every account that gets created during anyone who would call this modified AccountsIndex::upsert()? I cannot personally make this assertion. I would like to believe it is true.

@jeffwashington
Copy link
Contributor

jeffwashington commented May 2, 2022

A mode of the acct idx synchronization I cut from the current version in master can have items exist in the index where we have not confirmed whether:

  1. this is an update to an item in the disk index
  2. or a newly created item that does not exist on disk.

We can defer looking it up on disk until:

  1. someone tries to look up this pubkey in the in-mem idx
  2. it is time to flush the in-mem idx to the disk idx and we're dealing with disk anyway. Note this happens in the bg.

Deferring this lookup improves the speed of upsert in cases where the item is not in the in-mem index. So, effectively this affects only creation. Naturally, this is the exact use case of accounts-bench, which is where I've been doing my 10B account testing. Certainly that is not representative of mnb traffic.
I implemented this mode and tested it. But, it adds complexity and potential bugs, so I did not harden this change yet. It was unclear what the perf change was. There were a lot of things in flux at the time.

@jeffwashington
Copy link
Contributor

an alternative impl for brainstorming purposes could be:
allow upsert to return an enum that says 'item not in in-mem idx, deferred'. If the item is in the in-mem idx, then the upsert is performed immediately and fn returns. Otherwise, the upserts that couldn't immediately occur could be broken into threads by the caller.

@jeffwashington
Copy link
Contributor

We do have a metric for how many disk 'misses' we encountered.
accounts_index.load_disk_missing_count
There is also accounts_index.inserts
These things are a little tricky to count. Hopefully these metrics are accurate.

@ckamm
Copy link
Contributor Author

ckamm commented May 3, 2022

It is unclear to me the various ways an account can be created. Can we guarantee that we have tried to load every account that gets created during anyone who would call this modified AccountsIndex::upsert()? I cannot personally make this assertion. I would like to believe it is true.

I think it's true, but would need to read and think more to be more confident about it. And even if it's true now, we need to be able to guard against some future change that makes the assumption incorrect.

On the other hand, never needing to touch the disk for the banking_stage's commit_transactions may be worth quite a bit.

The call chain looks like this

banking_stage.execute_and_commit_transactions_locked()
- bank.load_and_execute_transactions()
  - self.rc.accounts.load_accounts()
    - self.load_transaction()
      - self.accounts_db.load_with_fixed_root()
        - self.load()
          - self.do_load()
            - self.read_index_for_accessor_or_load_slow()
              - self.accounts_index.get()
- bank.commit_transactions()
  - self.rc.accounts.store_cached()
    - self.accounts_db.store_cached()
      - self.store()
        - self.store_accounts_unfrozen()
          - self.store_accounts_custom()
            - self.update_index()

I've also been wondering: Does the account_index shrinking operation happen fully in parallel, based only on time? What happens when there's a machine hiccup and more than 2s pass between the load_accounts() and store_cached()? Could the shrink pass have evicted data from the in-memory index in between the two calls?

@sakridge
Copy link
Contributor

sakridge commented May 3, 2022

Some previous ideas in this area:
#17765
#17761
#17760

I think experimenting with io_uring and issuing multiple outstanding async IO requests for each account from the same thread might be the way to go, but of course who knows until someone tries it. That would give you multiple outstanding IO requests but hopefully less synchronizing overhead and also allow for out-of-order execution.

@jeffwashington
Copy link
Contributor

I've also been wondering: Does the account_index shrinking operation happen fully in parallel, based only on time? What happens when there's a machine hiccup and more than 2s pass between the load_accounts() and store_cached()? Could the shrink pass have evicted data from the in-memory index in between the two calls?

yes, this could happen

@ckamm
Copy link
Contributor Author

ckamm commented May 4, 2022

A mode of the acct idx synchronization I cut from the current version in master can have items exist in the index where we have not confirmed whether:

1. this is an update to an item in the disk index
2. or a newly created item that does not exist on disk.

We can defer looking it up on disk until:

1. someone tries to look up this pubkey in the in-mem idx
2. it is time to flush the in-mem idx to the disk idx and we're dealing with disk anyway. Note this happens in the bg.

It sounds to me like this would be the smoothest solution for avoiding disk access during update_index/upsert. Is there a way I can help stabilizing that? Review?

@jeffwashington
Copy link
Contributor

@ckamm #24996
work in progress, but you can follow there or contribute

@jeffwashington
Copy link
Contributor

note that @behzadnouri fixed the thread pool usage in the function you were looking at in this pr.

@jeffwashington
Copy link
Contributor

@ckamm we also have random evictions from the in-mem disk idx! I forgot about that. So, there is no guarantee that something is in the in-mem idx if we tried to load it!

@ckamm
Copy link
Contributor Author

ckamm commented May 5, 2022

note that @behzadnouri fixed the thread pool usage in the function you were looking at in this pr.

I've rerun the (admittedly highly synthetic) banking-bench runs:

                       none     same-batch-only       full
into_par_iter (old)   96451          7967             3098
into_par_iter (new)   98342          8502             6914
into_iter (old)      133738         15197            10504

new is 9b8ce395a6d8bec787905f79b28f5fba066216d5
args: --packets-per-batch 128 --batches-per-iteration 6 --iterations=(1200 for none, 200 otherwise)

So there's clear improvement, particularly on inter-batch contention, but removing the parallelism in update_index() is still a huge effect on the benchmark.

Let's shelve this PR until there's a way to never need to access the disk in update_index.

@jeffwashington
Copy link
Contributor

#25017
new info. I realized that the Bank::new_for_benches
was using only 2 bins in the account index, same as for tests. This is not representative of a running validator, which uses 8k bins. This would mean that all the upserts are contending for the same 2 locks in the bench you were running.
I get much more similar numbers for with/without threads when I set the bench bins to 8k. Can you please confirm this?

@ckamm
Copy link
Contributor Author

ckamm commented May 6, 2022

Great find! Here are new measurements:

                       none     same-batch-only       full
into_par_iter (old)   96451          7967             3098
into_par_iter (new)   98342          8502             6914
into_par_iter (new2)  90626         10488             7181
into_iter (new2)     117029         14234            10013
into_iter (old)      133738         15197            10504
uncertainty guess    +-6000         +-500            +-500

new is 9b8ce395a6d8bec787905f79b28f5fba066216d5
new2 is 9b8ce395a6d8bec787905f79b28f5fba066216d5 + #25017 cherrypick
args: --packets-per-batch 128 --batches-per-iteration 6 --iterations=(1200 for none, 200 otherwise)

@jeffwashington
Copy link
Contributor

I don't understand none, same-batch-only, full. Can you please give me the args you're using for each column.

@ckamm
Copy link
Contributor Author

ckamm commented May 7, 2022

I don't understand none, same-batch-only, full. Can you please give me the args you're using for each column.

The full lines are

solana-banking-bench --write-lock-contention=none --iterations 1200 --packets-per-batch 128 --batches-per-iteration 6
solana-banking-bench --write-lock-contention=same-batch-only --iterations 200 --packets-per-batch 128 --batches-per-iteration 6
solana-banking-bench --write-lock-contention=full --iterations 200 --packets-per-batch 128 --batches-per-iteration 6

For none there's no write-lock contention. same-batch-only has contention only within a batch, not between threads. And full means every tx write locks the same account.

@HaoranYi
Copy link
Contributor

HaoranYi commented May 12, 2022

I collected the log from the sequential and parallel update_index. The parallel update did show higher time.

seq:
update_index=6876i
update_index=11690i
update_index=37071i
update_index=70248i
update_index=80240i
update_index=75405i
update_index=90615i
update_index=77577i
update_index=81027i
update_index=91071i
update_index=76332i
update_index=79510i
update_index=79585i
update_index=84019i
update_index=82869i
update_index=97253i
update_index=99108i
update_index=95302i
update_index=90375i
update_index=96806i
update_index=97653i
update_index=91434i
update_index=98709i
update_index=91080i
update_index=87820i
update_index=91234i

par
update_index=234595i
update_index=260550i
update_index=432631i
update_index=567223i
update_index=111420i
update_index=234540i
update_index=232079i
update_index=238629i
update_index=222410i
update_index=234959i
update_index=204163i
update_index=243421i
update_index=189590i
update_index=229200i
update_index=204157i
update_index=217074i
update_index=186890i
update_index=239063i
update_index=150286i
update_index=188116i
update_index=159818i
update_index=158262i
update_index=164990i
update_index=155094i
update_index=171549i
update_index=149854i
update_index=178359i
update_index=153083i
update_index=178077i
update_index=151449i
update_index=202511i
update_index=184801i
update_index=165769i

@HaoranYi
Copy link
Contributor

HaoranYi commented May 12, 2022

Also parallel version show more cache eviction than the sequential run. @jeffwashington is that expected?

seq: 
[2022-05-12T16:35:44.045076929Z INFO  solana_metrics::metrics] datapoint: accounts_index estimate_mem_bytes=1280344i flush_should_evict_us=0i count_in_mem=24627i count=24689i bg_waiting_percent=97.09029388427734 bg_throttling_wait_percent=96.370361328125 held_in_mem_slot_list_len=37i held_in_mem_slot_list_cached=0i min_in_bin_mem=0i max_in_bin_mem=11i count_from_bins_mem=24627i median_from_bins_mem=3i min_in_bin_disk=0i max_in_bin_disk=9i count_from_bins_disk=606i median_from_bins_disk=0i gets_from_mem=1600067i get_mem_us=28077i gets_missing=49911i get_missing_us=1169i entries_from_mem=0i entry_mem_us=0i load_disk_found_count=495i load_disk_found_us=1257i load_disk_missing_count=49416i load_disk_missing_us=1396i entries_missing=24693i entry_missing_us=113i failed_to_evict=0i updates_in_mem=641335i get_range_us=3435i inserts=24689i deletes=0i active_threads=1i items=12544i keys=0i ms_per_age=416i flush_scan_us=30536i flush_update_us=77836i flush_grow_us=14175i flush_evict_us=641i disk_index_resizes=0i disk_index_max_size=0i disk_index_new_file_us=22948i disk_index_resize_us=0i disk_index_flush_file_us=0i disk_index_flush_mmap_us=3663i disk_data_resizes=0i disk_data_max_size=32i disk_data_new_file_us=32796i disk_data_resize_us=0i disk_data_flush_file_us=0i disk_data_flush_mmap_us=4384i flush_entries_updated_on_disk=624i flush_entries_evicted_from_mem=624i

par: 
[2022-05-12T17:24:25.786169107Z INFO  solana_metrics::metrics] datapoint: accounts_index estimate_mem_bytes=1279928i flush_should_evict_us=14i count_in_mem=24613i count=24689i bg_waiting_percent=95.34046936035156 bg_throttling_wait_percent=89.16108703613281 held_in_mem_slot_list_len=28i held_in_mem_slot_list_cached=0i min_in_bin_mem=0i max_in_bin_mem=12i count_from_bins_mem=24613i median_from_bins_mem=3i min_in_bin_disk=0i max_in_bin_disk=9i count_from_bins_disk=6843i median_from_bins_disk=1i gets_from_mem=1075347i get_mem_us=34961i gets_missing=56252i get_missing_us=3246i entries_from_mem=0i entry_mem_us=0i load_disk_found_count=6840i load_disk_found_us=6905i load_disk_missing_count=49412i load_disk_missing_us=12750i entries_missing=24691i entry_missing_us=6560i failed_to_evict=3i updates_in_mem=432502i get_range_us=8224i inserts=24689i deletes=0i active_threads=1i items=11520i keys=0i ms_per_age=416i flush_scan_us=28037i flush_update_us=254416i flush_grow_us=96352i flush_evict_us=1107i disk_index_resizes=0i disk_index_max_size=0i disk_index_new_file_us=83867i disk_index_resize_us=0i disk_index_flush_file_us=0i disk_index_flush_mmap_us=11288i disk_data_resizes=0i disk_data_max_size=32i disk_data_new_file_us=93271i disk_data_resize_us=0i disk_data_flush_file_us=0i disk_data_flush_mmap_us=10856i flush_entries_updated_on_disk=6975i flush_entries_evicted_from_mem=6972i

@HaoranYi
Copy link
Contributor

HaoranYi commented May 12, 2022

Some previous ideas in this area: #17765 #17761 #17760

I think experimenting with io_uring and issuing multiple outstanding async IO requests for each account from the same thread might be the way to go, but of course who knows until someone tries it. That would give you multiple outstanding IO requests but hopefully less synchronizing overhead and also allow for out-of-order execution.

Yes. io_uring sounds interesting, worth to give it a try. I will take a look to play with it ...

@jeffwashington
Copy link
Contributor

Collecting rent calls this on one account at a time. So, par_iter is unnecessary overhead.
I am still worried about high account count calls.
I will work up a version of this that uses a threshold and parallelizes above the threshold. That should address the 1.10 slowdown due to this.

@ckamm
Copy link
Contributor Author

ckamm commented Jun 1, 2022

@jeffwashington Since this has come up in the 1.10 upgrade context: how about I make a PR for a fast pass in update_index() when there's just a single account? In particular Bank::collect_rent_eagerly() causes a lot of one-account writes.

It also looks to me like collect_rent_eagerly() could batch the writes in one call?

@jeffwashington
Copy link
Contributor

jeffwashington commented Jun 1, 2022

@jeffwashington Since this has come up in the 1.10 upgrade context: how about I make a PR for a fast pass in update_index() when there's just a single account? In particular Bank::collect_rent_eagerly() causes a lot of one-account writes.

It also looks to me like collect_rent_eagerly() could batch the writes in one call?

Great minds think alike. Please do this.

@ckamm
Copy link
Contributor Author

ckamm commented Jun 1, 2022

great minds think alike

Haha :)

So you do the update_index() changes, and I see if I can batch stores in collect_rent_eagerly()?

@jeffwashington
Copy link
Contributor

update_index() changes

#25699

@jeffwashington
Copy link
Contributor

@ckamm :
#25713
#25714

@jeffwashington
Copy link
Contributor

jeffwashington commented Jun 1, 2022

Note that the project to eliminate rewrites is in 1.11/master. It is not enabled by default yet. This would eliminate almost all of the stores that occur during rent collection. However, we would then add a hash call. @xiangzhu70 is looking into hashing lazily as we do on stores to the write cache already.

@jeffwashington
Copy link
Contributor

jeffwashington commented Jun 1, 2022

As we consider the rent collection case, we should find that since we are only dealing with accounts we can load, then by definition they exist in the disk index or the in-mem idx, and since we loaded them and held the range in the in-mem cache during collecting this partition, then ALL update index calls due to rent collection/rewrites will always use the in-mem index only with no need to go to disk. This is a special case to consider. This means we won't ever page fault on the disk index files.

@stale
Copy link

stale bot commented Jun 12, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 12, 2022
@ckamm
Copy link
Contributor Author

ckamm commented Jun 13, 2022

It still seems promising to me to make update_index never hit the disk and then not need a thread pool. But that will be different PR.

@ckamm ckamm closed this Jun 13, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
community Community contribution stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants