perf(bitmap): optimize bitmap for all zeros and all ones #11090

wangrunji0408 · 2023-07-20T08:51:52Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

In our scenarios, it is very likely to have a bitmap with all bits set to 1. For example, the visibility of a compacted chunk and the null bitmap of an array with all values non-null. In this case, the dynamic allocated buffer can be avoided to safe both memory and time. A similar optimization is used in SmallVec.

We already have such an optimization for visibilities (Vis). But this enum is not a perfect abstraction over bitmap and leaks its internal structure (bitmap or len). This leads to our current mixed use of Vis, Bitmap and Option<Bitmap> for visibilities.

This PR includes this optimization as a built-in feature of Bitmap. Since we have maintained the bit count in Bitmap, we can identify whether the bitmap is all-zeros or all-ones by checking count_ones and num_bits only.

if count_ones == 0, the bitmap should be all-zeros
if count_ones == num_bits, the bitmap should be all-ones

in both cases, the bits buffer can be None.

With this optimization, we are free to use Bitmap everywhere without worrying about the extra cost in all-one cases. Next step, we can remove Vis and only use Bitmap for visibilities.

Benchmark results on bitmap of size 1024:

bench	before		after		change
zeros	20.075	ns	480.24	ps	-98%
ones	20.506	ns	480.21	ps	-98%
from_bytes	26.216	ns	28.430	ns	8%
get_from_ones	320.24	ps	320.31	ps	0%
get	318.48	ps	320.31	ps	1%
and	29.640	ns	31.000	ns	5%
or	29.224	ns	31.020	ns	6%
not	23.268	ns	21.869	ns	-6%
eq	317.18	ps	318.11	ps	0%
iter	477.25	ns	479.22	ns	0%
iter_on_ones	485.85	ns	333.99	ns	-31%
iter_ones_on_zeros	6.0423	ns	317.61	ps	-95%
iter_ones_on_ones	790.78	ns	334.26	ns	-58%
iter_ones_on_sparse	92.240	ns	90.978	ns	-1%
iter_ones_on_dense	696.99	ns	698.80	ns	0%
iter_range	333.91	ns	333.42	ns	-0%
zeros_iter	5.0219	ms	3.3781	ms	-33%
ones_iter	5.0097	ms	3.3936	ms	-32%

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)
My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)

My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

My PR contains user-facing changes.

Signed-off-by: Runji Wang <[email protected]>

BugenZhao

LGTM!

codecov · 2023-07-20T09:57:36Z

Codecov Report

Merging #11090 (3f406a5) into main (481f778) will decrease coverage by 0.01%.
The diff coverage is 89.76%.

@@            Coverage Diff             @@
##             main   #11090      +/-   ##
==========================================
- Coverage   69.78%   69.78%   -0.01%     
==========================================
  Files        1313     1313              
  Lines      223930   223958      +28     
==========================================
+ Hits       156279   156292      +13     
- Misses      67651    67666      +15

Flag	Coverage Δ
rust	`69.78% <89.76%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/common/src/buffer/bitmap.rs	`95.01% <89.60%> (-1.32%)`	⬇️
src/common/src/array/data_chunk.rs	`87.88% <100.00%> (ø)`

... and 5 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

kwannoel · 2023-07-20T13:52:01Z

I will take a look tmr. I suggest running q17 again btw. I remember it is performance sensitive with bitmap, and the last time some optimizations with bitmap ~~actually caused regressions~~ had minimal benefit on e2e performance, although micro-bench shows improvement.

xxchan · 2023-07-20T14:04:56Z

the last time some optimizations with bitmap actually caused regressions, although micro-bench shows improvement.

Just out of curiosity, which one is that?

kwannoel · 2023-07-20T14:50:48Z

the last time some optimizations with bitmap actually caused regressions, although micro-bench shows improvement.

Just out of curiosity, which one is that?

Sorry it should be "potentially caused regressions", don't have such strong evidence.

#8848 I think it was this one. It was likely just an observation of the performance regression from me, could be just variance.

I have rephrased the above comment to be more accurate: #11090 (comment).

xxchan · 2023-07-20T15:33:48Z

src/common/src/buffer/bitmap.rs

Although the benchmark shows it's seemingly neglectable, but just in case I will ask: does it introduce overheads for other cases, since this introduce match everywhere? And what's the worst case? 👀

Yes. You can see that the time for AND and OR increased 2ns (5%) in normal cases. I think the overhead should be acceptable compared to the benefit.

Agree matching could be costly, can try https://doc.rust-lang.org/std/intrinsics/fn.unlikely.html, to see if it improves performance.

(Just a thought I had, not actually asking for it to be implemented here).

https://users.rust-lang.org/t/compiler-hint-for-unlikely-likely-for-if-branches/62102/4

Well, I think it’s not an unlikely path. Otherwise why do we need to optimize it? 🤣 idk

agree with @xxchan. I can not decide which branch is more likely. 🤣

wangrunji0408 · 2023-07-21T04:39:57Z

I will take a look tmr. I suggest running q17 again btw. I remember it is performance sensitive with bitmap, and the last time some optimizations with bitmap ~~actually caused regressions~~ had minimal benefit on e2e performance, although micro-bench shows improvement.

Okay. I'm running q17 now. It's expected that bitmap has little impact on e2e performance. The actual purpose of this PR is to encourage the use of bitmap and avoid workarounds out of performance concerns (such as Vis). 😄

src/common/benches/bitmap.rs

src/common/src/buffer/bitmap.rs

wangrunji0408 · 2023-07-21T06:40:10Z

The performance of Q17 seems not changed significantly. (compared to last nightly version)

src/common/src/buffer/bitmap.rs

kwannoel

LGTM, thanks for this!

kwannoel · 2023-07-21T07:02:04Z

src/common/src/buffer/bitmap.rs

-    bits: Box<[usize]>,
+    ///
+    /// Optimization: If all bits are set to 0 or 1, this field is `None`.
+    bits: Option<Box<[usize]>>,


I think we can do further optimization, we can just use count_ones == num_bits OR count_ones == 0 to check if for all_ones and all_zeros case?
Then it seems we can avoid branching?

Not needed for now I guess. Just my own thoughts if we need to optimize in the future.

Do you mean that we first check count_ones and then bits.unwrap_unchecked()?

Co-authored-by: Noel Kwan <[email protected]>

Signed-off-by: Runji Wang <[email protected]>

wangrunji0408 added 3 commits July 19, 2023 23:54

optimize all 0s and all 1s for bitmap

7be4a5a

Signed-off-by: Runji Wang <[email protected]>

add more benches for bitmap

81834d6

Signed-off-by: Runji Wang <[email protected]>

restore &= and |=

7cb0c64

Signed-off-by: Runji Wang <[email protected]>

wangrunji0408 requested review from stdrc, BugenZhao, xxchan and kwannoel July 20, 2023 08:51

github-actions bot added the type/perf label Jul 20, 2023

wangrunji0408 added 2 commits July 20, 2023 17:23

fix clippy

4fde6ca

Signed-off-by: Runji Wang <[email protected]>

update estimated size for chunk

42f90ae

Signed-off-by: Runji Wang <[email protected]>

BugenZhao approved these changes Jul 20, 2023

View reviewed changes

xxchan reviewed Jul 20, 2023

View reviewed changes