-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use u64 hash in buffer index instead of str literal #25883
Conversation
47d7831
to
732874a
Compare
I made a couple of QoL improvements in 732874a in addition to moving the call to hash the row value after the check to see if it is an indexed column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - I can generally follow the code, I guess because we need to keep a tab on the actual hashes (Hashset<u64>
?) outside the HashMap
, we are not passing the hasher function through Hashmap::with_hasher
?
By The XX hasher is only for taking the values, which are originally string literals, and converting them to a The previous structure of
This changes it to
Consequently, by using XX Hash, there is a chance of hash collisions (though unlikely), but that is acceptable in this case, because the result would be that the buffer produces excess rows, which DataFusion will filter out. |
I see - got it. Happy for it to be merged. |
We shouldn't be hashing the row indexes, they're already in the optimal type for doing set union and intersections on them, as long as they're sorted. |
@pauldix Ah, my changing to a If that is important, we could switch to use an |
Yeah, since the rows always arrive in order (i.e. every row added is always > than any row before) and they're never added to an entry more than once, it's wasteful to use a set rather than just appending to a vec. And doing set operations on two sorted vecs is generally as fast as it gets. No reason to use a hash set. |
Closes #25875