-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial ZeroHashMap #2579
Add initial ZeroHashMap #2579
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: This looks like the type of hash map impl that will work well in ICU4X. It looks algorithmically similar to the approach rkyv is taking.
I would like to see benches against ZeroMap before merging, both data size and lookup performance, and code size for bonus points.
Yes I based my implementation on rkyv's hashmap with a few changes. Both use CHD algorithm with the random hash function approach. There is a practical approach mentioned in Appendix A of the paper which uses a simpler hash but computes key hash only once as compared to twice for the random hash function approach. Changes from rkyv impl
rkyv does have one additional optimization for the case when the chain has only one bucket. |
Sounds good. When optimizing, please focus on (1) data size and (2) lookup speed. I don't care too much about building speed since it is done offline. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't reviewed the construction algorithm but wanted to leave my comments so far
utils/zerovec/src/map/hashmap.rs
Outdated
fn compute_hash<K: Hash>(seed: u32, k: K, m: usize) -> u32 { | ||
let mut hasher = create_hasher_with_seed(seed.into()); | ||
k.hash(&mut hasher); | ||
(hasher.finish() % m as u64) as u32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(hasher.finish() % m as u64) as u32 | |
(hasher.finish() as usize % m) as u32 |
This avoids doing a u64
mod instruction on 32-bit architectures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to take this back, this does not give the same results on 32 and 64 bit. Assume the hash is 2^32 + 1, then on 32 bit that gets truncated to 1, which is not congruent to 2^32+1 under arbitrary moduli.
I don't think we need 64 bits here, so let's do the arithmetic in 32 bits and then widen to usize.
rename iter -> keys change return type of compute hash change container to FlexZeroVec
add #[zerovec::derive(Hash)] which derives byte hash of the ule fix some function signatures
Added some benchmarks reusing the data from the existing ones.
A few tradeoffs and observations
|
@pdogr Good data on the large map. How does the performance compare on the small version of the map? |
For smaller keys hash computation time really comes into picture
|
FlexZeroVec vs ZeroVec Using |
It seems that another way to architect this could be to make this a standalone type that maps from keys to an index, and then keep a proper ZeroMap as the second stage of lookup. |
remove duplicate casts support only one `build_from_exact_iter` function
utils/zerovec/src/map/hashmap.rs
Outdated
where | ||
K: Hash, | ||
{ | ||
let l1 = compute_hash(0, k, self.displacements.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will panic if the map is empty, as compute_hash
will do % 0
. Add that as a precondition to compute_hash
and guard against it here.
Change benches to read data from `large_zerohashmap.postcard`
Switched to the practical approach mentioned in that paper with 64 bit (16 bits for g, 24 bits for f0, f1).
Benches
|
/// placeholder. | ||
#[cfg_attr(feature = "serde", serde(borrow))] | ||
displacements: ZeroVec<'a, (u32, u32)>, | ||
keys: K::Container, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the next pr: this doesn't borrow keys and values. You'll need a ZeroHashMapBorrowed
and all the boilerplate like in ZeroMap
...
utils/zerovec/src/hashmap/mod.rs
Outdated
/// assert_eq!(hashmap.get(&2), Some("c")); | ||
/// assert_eq!(hashmap.get(&4), None); | ||
/// ``` | ||
#[cfg_attr(feature = "serde", derive(serde::Serialize, serde::Deserialize))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the next pr: ZeroMap
has a special implementation to have readable JSON, you can probably steal that by just pointing it at keys
and values
.
Co-authored-by: Robert Bastian <[email protected]>
move key equality check to index method
where | ||
K: ZeroMapKV<'a> + ?Sized, | ||
V: ZeroMapKV<'a> + ?Sized, | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not blocking: The iterator code is duplicated across ZeroMap
and ZeroHashMap
now. Move the logic to ZeroVecLike
to deduplicate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do this in a follow-up.
compute_index in u64 shift back to (usize, u32, u32) from split_hash64
where | ||
S: serde::Serializer, | ||
{ | ||
(&self.displacements, &self.keys, &self.values).serialize(serializer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this isn't great for self-describing formats like JSON, but it's fine for now.
#2532
Adds core functionality of ZeroHashMap.
Perfect hashes are computed using CHD algorithm, with aHash being used as the hashing function.