-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(storage): change the prefix_hint to dist_key_hint for bloom_filter #6575
Conversation
Just found it a little hard to verify whether these changes are correct, so I opened this PR, hoping to detect it through CI. Sorry forgot to draft it... |
Codecov Report
@@ Coverage Diff @@
## main #6575 +/- ##
==========================================
+ Coverage 73.22% 73.23% +0.01%
==========================================
Files 1024 1024
Lines 163823 163891 +68
==========================================
+ Hits 119960 120033 +73
+ Misses 43863 43858 -5
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@@ -80,6 +80,7 @@ tokio-stream = "0.1" | |||
tonic = { version = "0.2", package = "madsim-tonic" } | |||
tracing = "0.1" | |||
twox-hash = "1" | |||
xxhash-rust = { version = "0.8.5", features = ["xxh32"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
twox-hash
and xxhash-rust
are both used to calculate xxhash. Let's pick one and only use one lib for xxhash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's replace twox-hash
with xxhash-rust
it in next PR.
/// or continuous. | ||
/// Note that `dist_key_in_pk_indices` may be shuffled, the start index should be the | ||
/// minimum value. | ||
pub fn get_dist_key_start_index_in_pk(dist_key_in_pk_indices: &[usize]) -> Option<usize> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_min_dist_key_start_index_if_continuous
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start_index is always min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rest LGTM. Good job! Thanks for the PR!
…_key (#6871) Perviously, we use `distribution_key` as our bloom filter key, and `distribution_key` is always the prefix of pk. After watermark design, we can not ensure the distribution key be the prefix of pk(#6288), so #6575 has changed the `prefix_hint` to `dist_key_hint` for bloom_filter. However, using distribution key as bloom filter key will make things more complex, for example: we need to handle many corner cases, such as shuffled distribution key and discontinuous distribution key. And we need to intercept distribution key from pk, and do many judgement in `StateTable`/`StorageTable`/`FilterKeyExtract`. After some discussion, we can decouple bloom filter key and distribution key, just use pk prefix as the bloom filter key, this makes things easier and bloom filter will be used in more places. Approved-By: hzxa21 Co-Authored-By: congyi <[email protected]>
I hereby agree to the terms of the Singularity Data, Inc. Contributor License Agreement.
Background
After the watermark design:
pk
and the state store needs a prefix_hint which supposed to be the distribution keydistribution_key
because it might be a hotspotpk
prefix to do fast state cleaning2 and 3 means the distribution key will not be the prefix of pk
pk
which is conflict with 1.Proposal
Bloom filter always uses distribution key, so
distribution key
do not need to be the prefix ofpk
.After we break the assumption and
distribution key
will may not be the prefix of pk, there will be the following changes:key_indices==dist_key_indices
, now we only need to judge whetherdistribution_key
is the subset ofpk
.distribution_key
is prefix, the bloom filter key is also the prefix ofFullKey
, which means the bloom filter key containsTablePrefix
andVnodePrefix
(For convenience slice. In new design, bloom filter key is no need to containsTablePrefix
andVnodePrefix
. So in a word, bloom filter key isdistribution_key
.Note that we may implement per table bloom filter after storage: improve SST builder with FullKey struct #6391
prefix_hint
, which is nameddist_key_hint
in this PR. vnode is connected in relational table layer, table_id is connected in hummock.risingwave/src/storage/src/hummock/state_store.rs
Lines 140 to 182 in ad03b2e
Checklist
./risedev check
(or alias,./risedev c
)Documentation
If your pull request contains user-facing changes, please specify the types of the changes, and create a release note. Otherwise, please feel free to remove this section.
Types of user-facing changes
Please keep the types that apply to your changes, and remove those that do not apply.
Release note
Please create a release note for your changes. In the release note, focus on the impact on users, and mention the environment or conditions where the impact may occur.
Refer to a related PR or issue link (optional)
close #6288