Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shared counters optimization #101

Closed
wants to merge 24 commits into from

Conversation

jimvdl
Copy link
Contributor

@jimvdl jimvdl commented Nov 27, 2021

Hello, I've made an attempt at implementing the shared counters optimization.

Overview:
Once count experiences contention the compare exchange will fail triggering the allocation of counter_cells. It allocates a Vec with 2 cells initializing one of them to n. Any future contention on count will randomly select a counter cell and increment its count. If this also fails (and the number of counter cells doesn't exceed the number of cpu's) then a new counter cell array is allocated with n << 1, it copies all of the old values over and initializes the new counters to 0. If counter_cells is busy for whatever reason it falls back on simply attempting to increment count again. When the hash map is dropped the counter_cells array gets subsequently dropped.

Fixes: #11

This is my first open-source contribution so any advice/feedback is greatly appreciated.


This change is Reviewable

`HashMap` now has a `counter_cells` pointer that potentially holds
counter cells to relieve `count` of contention. Under the guard of
`cells_busy`, it resizes when one of the counter cells experiences
contention. All of the counter cells are dropped when the hash map
gets deallocated.
@jimvdl
Copy link
Contributor Author

jimvdl commented Nov 29, 2021

Hi @jonhoo, would you mind giving this a review? Thanks!

@jonhoo
Copy link
Owner

jonhoo commented Dec 9, 2021

This is on my radar, I just haven't had the spare time to look at it yet! Thanks for giving it a shot. Once I have some time I'll give it a review :)

One thing off the top of my head is that I'd like to see this implemented as a separate type in its own module rather than all inlined into the implementation. Will make it much easier to review, and possibly reusable in and of itself!

@jimvdl
Copy link
Contributor Author

jimvdl commented Dec 10, 2021

I had some questions, I'm sure you'll get to them naturally when you're reviewing but wanted to list them out:

  • The struct name is currently LongAdder, I'm not sure if that is a fitting name since we are not adding longs like in Java. (maybe IsizeAdder, but I'll leave that up to you).
  • A bunch of the tests already verify that the length matches up but since the counter is not a atomic snapshot some tests might occasionally fail due to the sum being calculated while asserting the length.
  • Should I directly mirror LongAdder and implement functions like increment(), decrement() etc?
  • The flurry style linting beta clippy test warns about unsoundness on K & V on BinEntry because they do not implement Send and Sync, should I PR a fix for these before this PR gets reviewed?

Take all the time you need, I'm eager to learn something and potentially get this merged so we'll take a look at it whenever you're ready!

@jonhoo
Copy link
Owner

jonhoo commented Dec 16, 2021

I had some questions, I'm sure you'll get to them naturally when you're reviewing but wanted to list them out:

I didn't get to as many things as I wanted today, so this remains on my backlog, but figured I'd at least answer your questions!

* The struct name is currently `LongAdder`, I'm not sure if that is a fitting name since we are not adding longs like in Java. (maybe IsizeAdder, but I'll leave that up to you).

Oof, Isize just looks really weird. How about ConcurrentCounter? That's exactly what it is. I suppose we could throw "signed" in there, but I don't think that's super necessary.

* A bunch of the tests already verify that the length matches up but since the counter is not a atomic snapshot some tests might occasionally fail due to the sum being calculated while asserting the length.

We definitely do not want spurious tests, but it sounds like those tests should be racy already if they're seeing a race when using this new concurrent counter... Can you point to a specific test that fails?

* Should I directly mirror LongAdder and implement functions like increment(), decrement() etc?

I wonder — could you just implement the AddAssign trait instead? I think your add only takes a Guard at the moment because it uses Atomic, but maybe it'd be an idea to just allocate a Vec with num_cpu counters from the very beginning and keep a separate AtomicUsize that just tracks the highest index in the Vec we should be using? That means no resizing, which means no Atomic, which means no Guard. Want to give that a shot?

* The flurry style linting beta clippy test warns about unsoundness on `K` & `V` on `BinEntry` because they do not implement `Send` and `Sync`, should I PR a fix for these before this PR gets reviewed?

Oh, interesting. Yeah, definitely try for a separate PR that adds those!

Take all the time you need, I'm eager to learn something and potentially get this merged so we'll take a look at it whenever you're ready!

❤️

@jimvdl
Copy link
Contributor Author

jimvdl commented Dec 17, 2021

We _definitely_ do not want spurious tests, but it sounds like those tests should be racy already if they're seeing a race when using this new concurrent counter... Can you point to a specific test that fails?

Every test that asserts the maps length might be racy. Before the counter optimization the length of the map would always be correctly synced. Now however, when high contention occurs the length might be a bit behind the actual number of items in the map. When a test asserts the length while it's still counting the length might be 1 behind the actual length (if that makes sense). I did try to reproduce tests that might be spurious but couldn't find any specific ones as of yet.

I wonder — could you just implement the `AddAssign` trait instead? I think your `add` only takes a `Guard` at the moment because it uses `Atomic`, but maybe it'd be an idea to just allocate a `Vec` with `num_cpu` counters from the very beginning and keep a _separate_ `AtomicUsize` that just tracks the highest index in the `Vec` we should be using? That means no resizing, which means no `Atomic`, which means no `Guard`. Want to give that a shot?

I tried to implement the AddAssign trait but got slightly stuck due to the trait requiring &mut self. add_count only has &self and something like RefCell probably wouldn't work here due to it not being Sync. Neither would Arc<Mutex<ConcurrentCounter>> because that would defeat the purpose of making the counter concurrent. If we somehow can get around the exclusive borrow requirement AddAssign should work. Any tips/thoughts on how I might go about this?

Oh, interesting. Yeah, definitely try for a separate PR that adds those!

Will do. I'll also include #98 since it's a similar problem.

Copy link
Owner

@jonhoo jonhoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was finally able to give this a look, sorry it took so so so long!

@ibraheemdev Any chance you have some spare cycles to try benchmarking this to see if it meaningfully improves multi-core performance? Or better yet, help @jimvdl do the benchmarks themselves!

@jonhoo
Copy link
Owner

jonhoo commented Mar 7, 2022

Oh, and you'll probably want to merge your changes with master, since some things have changed there in the intervening time.

As for AddAssign, I think you could probably implement it for &Counter, but it's not terribly important, and may honestly just end up proving unexpected and unergonomic in the end 😅

@ibraheemdev
Copy link
Collaborator

Any chance you have some spare cycles to try benchmarking this to see if it meaningfully improves multi-core performance?

I may have some time next weekend. It also occurs to me that this datastructure may be generally useful as a crate.

@jimvdl
Copy link
Contributor Author

jimvdl commented Mar 8, 2022

As for AddAssign, I think you could probably implement it for &Counter and may honestly just end up proving unexpected and unergonomic

I think your hunch might be correct, it would mean you could increment the counter like this:

let mut c = &self.counter;
c += 5;

Which does feel a bit unergonomic, but I'm still struggling with some of Rust's basics so maybe there is a nicer way.

@codecov
Copy link

codecov bot commented Mar 8, 2022

Codecov Report

Merging #101 (1bd51f5) into master (85ac469) will decrease coverage by 0.26%.
The diff coverage is 78.57%.

Impacted Files Coverage Δ
src/counter.rs 75.00% <75.00%> (ø)
src/map.rs 81.00% <85.71%> (+0.41%) ⬆️
src/node.rs 77.22% <100.00%> (-1.31%) ⬇️

Copy link
Owner

@jonhoo jonhoo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not super pretty. I think it's okay just leaving that out then — it's not like calling .add is that onerous!

@jimvdl
Copy link
Contributor Author

jimvdl commented Mar 10, 2022

I'm also not sure how to accurately test the map's length, because it might not always be in sync with the actual number of items. It will sync up eventually after there are no more concurrent updates and length assertions but that still causes some tests to occasionally fail. See the Java docs.

I glanced at their tests to see if they have a special way of asserting the length but couldn't find anything, maybe I missed something? What do you guys think?

- Made the resize hint check more concise.
- Moved the TODO about the CounterCell implementation to the counter module.
- Reverted the counter declaration back to the original line.
@jonhoo
Copy link
Owner

jonhoo commented Mar 11, 2022

I'm also not sure how to accurately test the map's length

Hmm, how about something like

loop {
  let n = map.len()
  assert!(n <= EXPECTED, "n <= {}", EXPECTED);
  if n == EXPECTED { break; }
  std::thread::yield_now();
}

Maybe even with a max limit for the number of iterations? Probably worth sticking that in some kind of helper function for the tests too.

@jimvdl
Copy link
Contributor Author

jimvdl commented Mar 13, 2022

Sorry that I keep bothering you with this but I'm seriously confused. I've been re-running the tests over and over to try and catch a test that fails due to the length being off by 1 (like I saw way before) but can't find one that fails now... Before it would fail pretty often but ever since we changed the cells the problem seems to have gone away??

Since I can't reproduce the problem anymore at all we probably won't need that helper function.

Also, when I was testing I tried a sample size of 1 million parallel inserts and got this distribution: ConcurrentCounter { base: 990579, cells: [2384, 2308, 2360, 2369] }. On the one hand the solution of base + value and each iteration of the loop adding += cv works pretty well, on the other hand it seems that you need to do a ridiculous amount of parallel inserts to make use of the counter cells. We can still add some benchmarks to test this a little better, although it seems consistent.

@jonhoo
Copy link
Owner

jonhoo commented Mar 19, 2022

Huh, how weird. Maybe the old implementation had a bug somehow? It's a good sign it's not happening frequently now though!

That distribution looks pretty good. I think benchmarks will definitely be helpful here though, and hopefully we'll see an impact at higher thread counts.

@jimvdl
Copy link
Contributor Author

jimvdl commented Mar 25, 2022

I'll soon give adding benchmarks for this a try when I have a little bit more time. I did open a PR for the benchmarks as they currently do not compile, once that's resolved I'll get back to this PR.

@jimvdl
Copy link
Contributor Author

jimvdl commented Apr 5, 2022

I've never really implemented benchmarks before but looking at the existing benchmarks I figured a good place to start would be comparing just a single AtomicIsize and the ConcurrentCounter.

I'm just going to include some results (I don't know if this is useful, lmk):

AtomicIsize at 1 thread:

  Lower bound Estimate Upper bound
Slope 136.70 us 136.89 us 137.11 us
Throughput 238.99 Melem/s 239.38 Melem/s 239.71 Melem/s
0.9918508 0.9921105 0.9917720
Mean 137.41 us 137.87 us 138.41 us
Std. Dev. 1.6814 us 2.5705 us 3.4837 us
Median 136.69 us 136.91 us 137.26 us
MAD 573.91 ns 863.95 ns 1.3428 us

ConcurrentCounter at 1 thread:

  Lower bound Estimate Upper bound
Slope 277.78 us 278.57 us 279.64 us
Throughput 117.18 Melem/s 117.63 Melem/s 117.96 Melem/s
0.9637630 0.9648086 0.9628616
Mean 277.87 us 278.61 us 279.61 us
Std. Dev. 1.7306 us 4.4850 us 6.9734 us
Median 277.45 us 277.67 us 277.84 us
MAD 740.74 ns 1.1073 us 1.3220 us

AtomicIsize at 8 threads:

  Lower bound Estimate Upper bound
Slope 528.59 us 536.34 us 544.36 us
Throughput 60.196 Melem/s 61.096 Melem/s 61.992 Melem/s
0.6309893 0.6458310 0.6299712
Mean 531.82 us 537.97 us 544.15 us
Std. Dev. 29.949 us 31.549 us 32.805 us
Median 512.77 us 520.99 us 565.18 us
MAD 12.804 us 26.802 us 46.733 us

ConcurrentCounter at 8 threads:

  Lower bound Estimate Upper bound
Slope 1.0086 ms 1.0208 ms 1.0313 ms
Throughput 31.775 Melem/s 32.101 Melem/s 32.490 Melem/s
0.7435502 0.7552155 0.7466318
Mean 1.0148 ms 1.0246 ms 1.0327 ms
Std. Dev. 25.988 us 45.767 us 67.823 us
Median 1.0210 ms 1.0288 ms 1.0363 ms
MAD 23.335 us 31.836 us 39.822 us

I didn't really know what the best way was to expose the private counter module so for now I copied and pasted it in so that is already a point of improvement.

I hope this was at least a step in the right direction, let me know what I can improve.

@jonhoo
Copy link
Owner

jonhoo commented Apr 10, 2022

Hmm, that's interesting. If I'm reading your data correctly, it seems like ConcurrentCounter is slower under contention, not faster as we'd expect 🤔 How many (real) cores do you have on the computer you ran this on?

@jimvdl
Copy link
Contributor Author

jimvdl commented Apr 10, 2022

I ran this on an Intel Core i7-7700K, which has 4 cores and 8 threads. These benchmarks don't even take the actual insert operation into account, which makes me suspect that the impact of having just one AtomicIsize would even be less noticable in practise.

@jonhoo
Copy link
Owner

jonhoo commented Apr 10, 2022

Yeah, it's tricky, because where I think this would matter is when you have, say, 16, or 32 real cores, which is harder to test for. I wonder what results you'd get on your box if you ran with 4 threads though — hyperthreads tend to only really add noise for these kinds of measurements.

@jimvdl
Copy link
Contributor Author

jimvdl commented Apr 10, 2022

I will run the benchmarks again when I use only 4 cores to see if that makes a difference, I'll get back to you on that one.

I might be able to test this on a CPU with 10 cores, and see if that makes a difference. Although let's say for a moment that having 16 or 32 makes a substantial difference, would the ConcurrentCounter still be useful knowing that with fewer than 16 cores you would get a performance decrease?

@jimvdl
Copy link
Contributor Author

jimvdl commented Apr 14, 2022

I've tried to disable hyperthreading but apparently my CPU doesn't have an option to disable it. (as far as I can find) I did try another route: in the task manager you can set the affinity of a running process, which I then set to 4 instead of the 8 it had before. I'm not sure if this gave me accurate results but here they are:

AtomicIsize at 1 thread:

  Lower bound Estimate Upper bound
Slope 129.12 us 129.23 us 129.36 us
Throughput 253.31 Melem/s 253.56 Melem/s 253.78 Melem/s
0.9981254 0.9982359 0.9981086
Mean 129.14 us 129.26 us 129.40 us
Std. Dev. 379.87 ns 682.82 ns 953.36 ns
Median 128.98 us 129.03 us 129.13 us
MAD 207.45 ns 279.07 ns 362.17 ns

ConcurrentCounter at 1 thread:

  Lower bound Estimate Upper bound
Slope 231.46 us 231.60 us 231.77 us
Throughput 141.38 Melem/s 141.49 Melem/s 141.57 Melem/s
0.9991705 0.9992210 0.9991435
Mean 231.51 us 231.66 us 231.85 us
Std. Dev. 409.77 ns 891.91 ns 1.2775 us
Median 231.42 us 231.47 us 231.53 us
MAD 212.07 ns 293.64 ns 343.09 ns

AtomicIsize at 4 threads:

  Lower bound Estimate Upper bound
Slope 455.77 us 477.68 us 498.68 us
Throughput 65.709 Melem/s 68.598 Melem/s 71.896 Melem/s
0.2564080 0.2696591 0.2574293
Mean 449.16 us 464.91 us 480.76 us
Std. Dev. 76.443 us 80.814 us 84.013 us
Median 407.94 us 413.40 us 528.11 us
MAD 55.867 us 67.139 us 128.89 us

ConcurrentCounter at 4 theads:

  Lower bound Estimate Upper bound
Slope 742.61 us 756.28 us 769.56 us
Throughput 42.580 Melem/s 43.328 Melem/s 44.125 Melem/s
0.5678183 0.5859128 0.5687718
Mean 745.44 us 753.67 us 761.84 us
Std. Dev. 35.724 us 42.078 us 48.248 us
Median 746.76 us 751.40 us 759.00 us
MAD 24.176 us 41.755 us 54.862 us

If you compare these to the 4 thread versions of the previous benchmark with hyperthreading enabled you can see a slight difference:

AtomicIsize with 4 threads: (note: hyperthreading enabled)

  Lower bound Estimate Upper bound
Slope 485.85 us 496.67 us 506.66 us
Throughput 64.674 Melem/s 65.975 Melem/s 67.444 Melem/s
0.5423258 0.5583073 0.5446181
Mean 481.40 us 490.08 us 498.65 us
Std. Dev. 40.887 us 44.328 us 46.723 us
Median 470.72 us 520.55 us 523.34 us
MAD 11.299 us 16.970 us 67.557 us

ConcurrentCounter with 4 threads: (note: hyperthreading enabled)

  Lower bound Estimate Upper bound
Slope 1.0255 ms 1.0359 ms 1.0465 ms
Throughput 31.313 Melem/s 31.632 Melem/s 31.954 Melem/s
0.8311427 0.8397268 0.8309221
Mean 1.0211 ms 1.0318 ms 1.0422 ms
Std. Dev. 44.870 us 54.107 us 63.172 us
Median 1.0214 ms 1.0305 ms 1.0466 ms
MAD 37.691 us 49.023 us 59.846 us

@jonhoo
Copy link
Owner

jonhoo commented Apr 16, 2022

Hmm, that still looks like it's slower and less scalable, which is surprising. I'd be super curious to see on a higher-core-count machine, and ideally as a plot with error bars!

@jimvdl
Copy link
Contributor Author

jimvdl commented Apr 24, 2022

I tried to get my hands on that 10 core machine but sadly couldn't use it. Anything else I can try/do?

@jonhoo
Copy link
Owner

jonhoo commented Apr 24, 2022

@ibraheemdev Any chance you have a box with more cores available? I might be able to spin something up, but my schedule is pretty packed for a while 😞

@ibraheemdev
Copy link
Collaborator

My machine has 8 cores, I can run the benchmarks but not sure we'll see any benefit at a low-medium core count.

@JackThomson2
Copy link
Collaborator

Hey was interested by this PR and had a play around, I found using the nightly only #[thread_local] gave some promising results

jack_counter/1          time:   [158.47 us 159.09 us 160.13 us]
                        thrpt:  [204.63 Melem/s 205.97 Melem/s 206.77 Melem/s]
jack_counter/4          time:   [519.68 us 526.90 us 533.70 us]
                        thrpt:  [61.398 Melem/s 62.190 Melem/s 63.054 Melem/s]
jack_counter/8          time:   [421.63 us 427.01 us 432.60 us]
                        thrpt:  [75.746 Melem/s 76.738 Melem/s 77.718 Melem/s]

Compared to the AtomicIsize

atomic_counter/1        time:   [156.52 us 156.65 us 156.83 us]
                        thrpt:  [208.95 Melem/s 209.18 Melem/s 209.35 Melem/s]
atomic_counter/4        time:   [575.31 us 576.61 us 577.85 us]
                        thrpt:  [56.707 Melem/s 56.828 Melem/s 56.958 Melem/s]
atomic_counter/8        time:   [592.14 us 592.63 us 593.18 us]
                        thrpt:  [55.241 Melem/s 55.293 Melem/s 55.338 Melem/s]

This was using the benchmarking setup from this PR and I have the code here if you want to have a look: https://github.com/JackThomson2/fast-counter/blob/master/src/lib.rs

@jimvdl
Copy link
Contributor Author

jimvdl commented Jun 29, 2022

Those results are promising indeed! I like the solution you went with, definitely worth looking into more. I'm wondering why the thread local approach is more performant compared to my solution. When you possible published this as a crate it could be used in Flurry as a dependency instead.

@JackThomson2
Copy link
Collaborator

I'll have a look at getting this published / compare the performance to the thread_local non nightly macro.

I originally had a look at optimising your approach, the few findings I could see which helped speed it up where

  • Ensuring the number of cpus where set to next_power_of_two() and eliminating the divide here let c = &self.cells[index as usize % self.cells.len()];
  • Adding another base = self.base.load(Ordering::SeqCst); at the end of the loop (line 48) this meant it was less likely to fail, I assumed under higher contention this number will have been updated by the time we came around.

I think the reason it's not as fast was that you just moved the contention to the next cell. I had an experiment using wyrand as a pseudo random generator to pick a random cell, and this did also help slightly

@JackThomson2
Copy link
Collaborator

JackThomson2 commented Jul 26, 2022

Here are the results for 2-16 cores with the different approaches:

atomic_counter/2        time:   [282.18 us 285.66 us 289.18 us]
                        thrpt:  [113.31 Melem/s 114.71 Melem/s 116.12 Melem/s]

atomic_counter/4        time:   [324.25 us 326.41 us 328.51 us]
                        thrpt:  [99.749 Melem/s 100.39 Melem/s 101.06 Melem/s]

atomic_counter/8        time:   [345.57 us 346.09 us 346.61 us]
                        thrpt:  [94.539 Melem/s 94.681 Melem/s 94.824 Melem/s]

atomic_counter/16       time:   [414.53 us 415.65 us 416.83 us]
                        thrpt:  [78.612 Melem/s 78.836 Melem/s 79.048 Melem/s]


==============================================
==============================================


fast_counter/2          time:   [370.83 us 377.43 us 383.15 us]
                        thrpt:  [85.522 Melem/s 86.818 Melem/s 88.364 Melem/s]

fast_counter/4          time:   [338.49 us 345.35 us 351.70 us]
                        thrpt:  [93.171 Melem/s 94.882 Melem/s 96.807 Melem/s]

fast_counter/8          time:   [249.25 us 254.46 us 259.47 us]
                        thrpt:  [126.29 Melem/s 128.78 Melem/s 131.47 Melem/s]

fast_counter/16         time:   [163.34 us 169.76 us 176.39 us]
                        thrpt:  [185.77 Melem/s 193.03 Melem/s 200.61 Melem/s]


==============================================
==============================================


fast_counter thread local macro/2
                        time:   [388.31 us 392.67 us 396.95 us]
                        thrpt:  [82.549 Melem/s 83.449 Melem/s 84.387 Melem/s]

fast_counter thread local macro/4
                        time:   [364.32 us 369.14 us 373.44 us]
                        thrpt:  [87.746 Melem/s 88.769 Melem/s 89.943 Melem/s]

fast_counter thread local macro/8
                        time:   [254.32 us 259.57 us 265.15 us]
                        thrpt:  [123.58 Melem/s 126.24 Melem/s 128.84 Melem/s]

fast_counter thread local macro/16
                        time:   [172.06 us 175.66 us 179.73 us]
                        thrpt:  [182.32 Melem/s 186.54 Melem/s 190.44 Melem/s]

I will look at adding inline attributes to the methods, I don't think these are inlining at the moment, when I manually copied them into the test file the 2 and 4 core are much closer

@jimvdl
Copy link
Contributor Author

jimvdl commented Aug 3, 2022

All the results look way better than a single atomic counter. It has some additional overhead for 2 core cpu's but honestly I don't think that would be an issue since everyone has at least 4 cores anyway.

You want to open a PR for this instead? If you do lmk I'll close this one.

@JackThomson2
Copy link
Collaborator

Even better news my suspicion around the inlining was correct was correct, when I added #[inline] we're much closer on 2 core and faster on 4 core!

atomic_counter/2        time:   [290.27 us 293.65 us 297.26 us]
                        thrpt:  [110.23 Melem/s 111.59 Melem/s 112.89 Melem/s]

atomic_counter/4        time:   [320.62 us 323.01 us 325.27 us]
                        thrpt:  [100.74 Melem/s 101.45 Melem/s 102.20 Melem/s]

atomic_counter/8        time:   [343.33 us 344.14 us 344.98 us]
                        thrpt:  [94.985 Melem/s 95.217 Melem/s 95.442 Melem/s]

atomic_counter/16       time:   [410.49 us 411.71 us 412.99 us]
                        thrpt:  [79.344 Melem/s 79.590 Melem/s 79.827 Melem/s]

------------------------------------------------------------------------------

fast_counter_nightly/2  time:   [314.05 us 315.63 us 317.16 us]
                        thrpt:  [103.32 Melem/s 103.82 Melem/s 104.34 Melem/s]

fast_counter_nightly/4  time:   [292.82 us 294.93 us 296.72 us]
                        thrpt:  [110.44 Melem/s 111.10 Melem/s 111.91 Melem/s]

fast_counter_nightly/8  time:   [209.61 us 215.30 us 221.28 us]
                        thrpt:  [148.08 Melem/s 152.20 Melem/s 156.33 Melem/s]

fast_counter_nightly/16 time:   [157.28 us 160.06 us 163.00 us]
                        thrpt:  [201.04 Melem/s 204.72 Melem/s 208.34 Melem/s]

------------------------------------------------------------------------------


fast_counter_stable/2   time:   [400.89 us 407.77 us 413.33 us]
                        thrpt:  [79.277 Melem/s 80.360 Melem/s 81.739 Melem/s]

fast_counter_stable/4   time:   [369.10 us 372.90 us 376.90 us]
                        thrpt:  [86.942 Melem/s 87.873 Melem/s 88.778 Melem/s]

fast_counter_stable/8   time:   [247.36 us 253.10 us 258.51 us]
                        thrpt:  [126.76 Melem/s 129.47 Melem/s 132.47 Melem/s]

fast_counter_stable/16  time:   [162.17 us 166.01 us 170.13 us]
                        thrpt:  [192.60 Melem/s 197.39 Melem/s 202.06 Melem/s]

I'll see if I can get time to make the PR to change to this!

@jimvdl
Copy link
Contributor Author

jimvdl commented Aug 15, 2022

Closing in favour of #109.

@jimvdl jimvdl closed this Aug 15, 2022
@jonhoo
Copy link
Owner

jonhoo commented Aug 20, 2022

Great work folks, thanks for picking this up and driving it on your own!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement the sharded counters optimization
4 participants