-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared counters optimization #101
Conversation
`HashMap` now has a `counter_cells` pointer that potentially holds counter cells to relieve `count` of contention. Under the guard of `cells_busy`, it resizes when one of the counter cells experiences contention. All of the counter cells are dropped when the hash map gets deallocated.
Hi @jonhoo, would you mind giving this a review? Thanks! |
This is on my radar, I just haven't had the spare time to look at it yet! Thanks for giving it a shot. Once I have some time I'll give it a review :) One thing off the top of my head is that I'd like to see this implemented as a separate type in its own module rather than all inlined into the implementation. Will make it much easier to review, and possibly reusable in and of itself! |
I had some questions, I'm sure you'll get to them naturally when you're reviewing but wanted to list them out:
Take all the time you need, I'm eager to learn something and potentially get this merged so we'll take a look at it whenever you're ready! |
I didn't get to as many things as I wanted today, so this remains on my backlog, but figured I'd at least answer your questions!
Oof,
We definitely do not want spurious tests, but it sounds like those tests should be racy already if they're seeing a race when using this new concurrent counter... Can you point to a specific test that fails?
I wonder — could you just implement the
Oh, interesting. Yeah, definitely try for a separate PR that adds those!
❤️ |
Every test that asserts the maps length might be racy. Before the counter optimization the length of the map would always be correctly synced. Now however, when high contention occurs the length might be a bit behind the actual number of items in the map. When a test asserts the length while it's still counting the length might be 1 behind the actual length (if that makes sense). I did try to reproduce tests that might be spurious but couldn't find any specific ones as of yet.
I tried to implement the
Will do. I'll also include #98 since it's a similar problem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was finally able to give this a look, sorry it took so so so long!
@ibraheemdev Any chance you have some spare cycles to try benchmarking this to see if it meaningfully improves multi-core performance? Or better yet, help @jimvdl do the benchmarks themselves!
Oh, and you'll probably want to merge your changes with master, since some things have changed there in the intervening time. As for |
I may have some time next weekend. It also occurs to me that this datastructure may be generally useful as a crate. |
I think your hunch might be correct, it would mean you could increment the counter like this: let mut c = &self.counter;
c += 5; Which does feel a bit unergonomic, but I'm still struggling with some of Rust's basics so maybe there is a nicer way. |
Codecov Report
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not super pretty. I think it's okay just leaving that out then — it's not like calling .add
is that onerous!
I'm also not sure how to accurately test the map's length, because it might not always be in sync with the actual number of items. It will sync up eventually after there are no more concurrent updates and length assertions but that still causes some tests to occasionally fail. See the Java docs. I glanced at their tests to see if they have a special way of asserting the length but couldn't find anything, maybe I missed something? What do you guys think? |
- Made the resize hint check more concise. - Moved the TODO about the CounterCell implementation to the counter module. - Reverted the counter declaration back to the original line.
Hmm, how about something like loop {
let n = map.len()
assert!(n <= EXPECTED, "n <= {}", EXPECTED);
if n == EXPECTED { break; }
std::thread::yield_now();
} Maybe even with a max limit for the number of iterations? Probably worth sticking that in some kind of helper function for the tests too. |
Sorry that I keep bothering you with this but I'm seriously confused. I've been re-running the tests over and over to try and catch a test that fails due to the length being off by 1 (like I saw way before) but can't find one that fails now... Before it would fail pretty often but ever since we changed the cells the problem seems to have gone away?? Since I can't reproduce the problem anymore at all we probably won't need that helper function. Also, when I was testing I tried a sample size of 1 million parallel inserts and got this distribution: |
Huh, how weird. Maybe the old implementation had a bug somehow? It's a good sign it's not happening frequently now though! That distribution looks pretty good. I think benchmarks will definitely be helpful here though, and hopefully we'll see an impact at higher thread counts. |
I'll soon give adding benchmarks for this a try when I have a little bit more time. I did open a PR for the benchmarks as they currently do not compile, once that's resolved I'll get back to this PR. |
…-counters-optimization
…vdl/flurry into shared-counters-optimization
I've never really implemented benchmarks before but looking at the existing benchmarks I figured a good place to start would be comparing just a single I'm just going to include some results (I don't know if this is useful, lmk):
I didn't really know what the best way was to expose the private I hope this was at least a step in the right direction, let me know what I can improve. |
Hmm, that's interesting. If I'm reading your data correctly, it seems like |
I ran this on an Intel Core i7-7700K, which has 4 cores and 8 threads. These benchmarks don't even take the actual insert operation into account, which makes me suspect that the impact of having just one |
Yeah, it's tricky, because where I think this would matter is when you have, say, 16, or 32 real cores, which is harder to test for. I wonder what results you'd get on your box if you ran with 4 threads though — hyperthreads tend to only really add noise for these kinds of measurements. |
I will run the benchmarks again when I use only 4 cores to see if that makes a difference, I'll get back to you on that one. I might be able to test this on a CPU with 10 cores, and see if that makes a difference. Although let's say for a moment that having 16 or 32 makes a substantial difference, would the |
I've tried to disable hyperthreading but apparently my CPU doesn't have an option to disable it. (as far as I can find) I did try another route: in the task manager you can set the affinity of a running process, which I then set to 4 instead of the 8 it had before. I'm not sure if this gave me accurate results but here they are:
If you compare these to the 4 thread versions of the previous benchmark with hyperthreading enabled you can see a slight difference:
|
Hmm, that still looks like it's slower and less scalable, which is surprising. I'd be super curious to see on a higher-core-count machine, and ideally as a plot with error bars! |
I tried to get my hands on that 10 core machine but sadly couldn't use it. Anything else I can try/do? |
@ibraheemdev Any chance you have a box with more cores available? I might be able to spin something up, but my schedule is pretty packed for a while 😞 |
My machine has 8 cores, I can run the benchmarks but not sure we'll see any benefit at a low-medium core count. |
Hey was interested by this PR and had a play around, I found using the nightly only
Compared to the AtomicIsize
This was using the benchmarking setup from this PR and I have the code here if you want to have a look: https://github.com/JackThomson2/fast-counter/blob/master/src/lib.rs |
Those results are promising indeed! I like the solution you went with, definitely worth looking into more. I'm wondering why the thread local approach is more performant compared to my solution. When you possible published this as a crate it could be used in Flurry as a dependency instead. |
I'll have a look at getting this published / compare the performance to the thread_local non nightly macro. I originally had a look at optimising your approach, the few findings I could see which helped speed it up where
I think the reason it's not as fast was that you just moved the contention to the next cell. I had an experiment using wyrand as a pseudo random generator to pick a random cell, and this did also help slightly |
Here are the results for 2-16 cores with the different approaches:
I will look at adding inline attributes to the methods, I don't think these are inlining at the moment, when I manually copied them into the test file the 2 and 4 core are much closer |
All the results look way better than a single atomic counter. It has some additional overhead for 2 core cpu's but honestly I don't think that would be an issue since everyone has at least 4 cores anyway. You want to open a PR for this instead? If you do lmk I'll close this one. |
Even better news my suspicion around the inlining was correct was correct, when I added
I'll see if I can get time to make the PR to change to this! |
Closing in favour of #109. |
Great work folks, thanks for picking this up and driving it on your own! |
Hello, I've made an attempt at implementing the shared counters optimization.
Overview:
Once
count
experiences contention the compare exchange will fail triggering the allocation ofcounter_cells
. It allocates aVec
with 2 cells initializing one of them ton
. Any future contention oncount
will randomly select a counter cell and increment its count. If this also fails (and the number of counter cells doesn't exceed the number of cpu's) then a new counter cell array is allocated with n << 1, it copies all of the old values over and initializes the new counters to 0. Ifcounter_cells
is busy for whatever reason it falls back on simply attempting to increment count again. When the hash map is dropped thecounter_cells
array gets subsequently dropped.Fixes: #11
This is my first open-source contribution so any advice/feedback is greatly appreciated.
This change is