-
Notifications
You must be signed in to change notification settings - Fork 4.6k
AccountsDb: Don't use threads for update_index #24692
Conversation
banking-bench results: none some-batch-only full before 96451 7967 3098 after 133738 15197 10504 (--packets-per-batch 128 --batches-per-iteration 6 --iterations=1200 for none, 200 otherwise)
@taozhu-chicago, is this in your area? |
#20601 suggests this was done for the case where the indexes are on disk. @jeffwashington I wonder if there's a way to do it that doesn't sacrifice as much. |
Switch based on |
Interesting finding! @jeffwashington knows better about accountsdb indexing. |
Disk index will be enabled by default beginning with 1.11. |
Ok, sounds like we should update banking-bench to use a disk index too then. |
Not a bad idea. I'm not familiar with banking-bench. I'm trying to think through a few dimensions:
|
As far as I can tell banking-bench already uses the default disk index. |
@jeffwashington Can you give me a very brief pointer as to what hits the disk in |
Ha. That's good to know. I had expected/hoped that my change to default would have caused disk index to be used for benches and tests. It may only be using 2 bins by default for tests/benches (to avoid 16k+ files per test launch). We could put the guts of this function in a closure and call it within threads or not depending on some heuristic (like # accounts to update). Then we could go parallel at high counts and stay in the same thread at low counts. |
if you upsert account A and account A isn't in the in-mem index (an lru cache of active or dirty pubkeys), then we have to lookup account A on disk and load it into the in-mem index. If the account exists on disk, this could be 2 page faults. If it does not exist, it could be 1 page fault. |
Agreed, that's what I did initialy. But then it looked like there was also a significant gain by not going parallel even when N>256. I'll investigate more tomorrow. |
so it may depend on what the accounts bench is doing how accurate it is for real world. We could pass a bool specifying whether caller thinks it likely to hit disk or not. Any update from tx processing would have had to load the account already recently. That should cause it to be in the lru. Any newly created account will very likely be a page fault. Tuning this has a lot of things to think about. |
Thanks! Look forward to seeing what happens! |
Codecov Report
@@ Coverage Diff @@
## master #24692 +/- ##
===========================================
+ Coverage 69.8% 82.0% +12.1%
===========================================
Files 36 596 +560
Lines 2268 165224 +162956
Branches 322 0 -322
===========================================
+ Hits 1584 135550 +133966
- Misses 571 29674 +29103
+ Partials 113 0 -113 |
@jeffwashington Could you check that this logic makes sense? (this is partially based on what you said above, and me getting familiar with the accounts_index code)
|
I believe you are correct in all your assertions. It is unclear to me the various ways an account can be created. Can we guarantee that we have tried to load every account that gets created during anyone who would call this modified |
A mode of the acct idx synchronization I cut from the current version in master can have items exist in the index where we have not confirmed whether:
We can defer looking it up on disk until:
Deferring this lookup improves the speed of upsert in cases where the item is not in the in-mem index. So, effectively this affects only creation. Naturally, this is the exact use case of accounts-bench, which is where I've been doing my 10B account testing. Certainly that is not representative of mnb traffic. |
an alternative impl for brainstorming purposes could be: |
We do have a metric for how many disk 'misses' we encountered. |
I think it's true, but would need to read and think more to be more confident about it. And even if it's true now, we need to be able to guard against some future change that makes the assumption incorrect. On the other hand, never needing to touch the disk for the banking_stage's commit_transactions may be worth quite a bit. The call chain looks like this
I've also been wondering: Does the account_index shrinking operation happen fully in parallel, based only on time? What happens when there's a machine hiccup and more than 2s pass between the |
Some previous ideas in this area: I think experimenting with io_uring and issuing multiple outstanding async IO requests for each account from the same thread might be the way to go, but of course who knows until someone tries it. That would give you multiple outstanding IO requests but hopefully less synchronizing overhead and also allow for out-of-order execution. |
yes, this could happen |
It sounds to me like this would be the smoothest solution for avoiding disk access during update_index/upsert. Is there a way I can help stabilizing that? Review? |
note that @behzadnouri fixed the thread pool usage in the function you were looking at in this pr. |
@ckamm we also have random evictions from the in-mem disk idx! I forgot about that. So, there is no guarantee that something is in the in-mem idx if we tried to load it! |
I've rerun the (admittedly highly synthetic) banking-bench runs:
So there's clear improvement, particularly on inter-batch contention, but removing the parallelism in update_index() is still a huge effect on the benchmark. Let's shelve this PR until there's a way to never need to access the disk in update_index. |
#25017 |
Great find! Here are new measurements:
|
I don't understand |
The full lines are
For |
I collected the log from the sequential and parallel update_index. The parallel update did show higher time.
|
Also parallel version show more cache eviction than the sequential run. @jeffwashington is that expected?
|
Yes. io_uring sounds interesting, worth to give it a try. I will take a look to play with it ... |
Collecting rent calls this on one account at a time. So, par_iter is unnecessary overhead. |
@jeffwashington Since this has come up in the 1.10 upgrade context: how about I make a PR for a fast pass in It also looks to me like |
Great minds think alike. Please do this. |
Haha :) So you do the |
|
Note that the project to eliminate rewrites is in 1.11/master. It is not enabled by default yet. This would eliminate almost all of the stores that occur during rent collection. However, we would then add a hash call. @xiangzhu70 is looking into hashing lazily as we do on stores to the write cache already. |
As we consider the rent collection case, we should find that since we are only dealing with accounts we can load, then by definition they exist in the disk index or the in-mem idx, and since we loaded them and held the range in the in-mem cache during collecting this partition, then ALL update index calls due to rent collection/rewrites will always use the in-mem index only with no need to go to disk. This is a special case to consider. This means we won't ever page fault on the disk index files. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
It still seems promising to me to make update_index never hit the disk and then not need a thread pool. But that will be different PR. |
Problem
I was profiling banking-bench runs with
some-batch-only
contention and noticed a lot of time was spent inupdate_index()
. That makes sense, becauselen
is 3 or something and splitting that up into multiple threads is primarily overhead.But when I played with having a min_chunk_size I realized that dropping the parallelism also sped up the
none
contention case (wherelen > 256
). Hence this patch just removes it completely.banking-bench results:
(--packets-per-batch 128 --batches-per-iteration 6
--iterations=1200 for none, 200 otherwise)
Summary of Changes
Drop parallelism from
update_index()
in accountsdb.