feat: use u64 hash in buffer index instead of str literal #25883

hiltontj · 2025-01-21T02:13:10Z

hiltontj · 2025-01-21T13:27:33Z

I made a couple of QoL improvements in 732874a in addition to moving the call to hash the row value after the check to see if it is an indexed column.

praveen-influx

Looks good - I can generally follow the code, I guess because we need to keep a tab on the actual hashes (Hashset<u64>?) outside the HashMap, we are not passing the hasher function through Hashmap::with_hasher?

hiltontj · 2025-01-21T13:46:58Z

Looks good - I can generally follow the code, I guess because we need to keep a tab on the actual hashes (Hashset<u64>?) outside the HashMap, we are not passing the hasher function through Hashmap::with_hasher?

By HashSet<u64> are you referring to the HashSet<usize>? If so, that is the set of row indexes, so we don't use the XX Hasher there - I think we need to use a cryptographically secure hash on that, because the row indices that a given value falls into needs to be correct.

The XX hasher is only for taking the values, which are originally string literals, and converting them to a u64, so that they are moved around, stored, and compared more cheaply.

The previous structure of BufferIndex was:

BufferIndex {
    Column ID -> string literal -> Set of Indexes from the table buffer
}

This changes it to

BufferIndex {
    Column ID -> u64 hash of string literal -> Set of Indexes from the table buffer
}

Consequently, by using XX Hash, there is a chance of hash collisions (though unlikely), but that is acceptable in this case, because the result would be that the buffer produces excess rows, which DataFusion will filter out.

praveen-influx · 2025-01-21T14:06:47Z

The XX hasher is only for taking the values, which are originally string literals, and converting them to a u64

I see - got it. Happy for it to be merged.

pauldix · 2025-01-21T14:16:32Z

We shouldn't be hashing the row indexes, they're already in the optimal type for doing set union and intersections on them, as long as they're sorted.

hiltontj · 2025-01-21T14:30:14Z

We shouldn't be hashing the row indexes, they're already in the optimal type for doing set union and intersections on them, as long as they're sorted.

@pauldix Ah, my changing to a HashSet<usize> to hold row indices in #25866 could be an issue if keeping them in sorted order is important. Previously it was a Vec<usize> which would keep the rows in the order in which they are added - which would be sorted order.

If that is important, we could switch to use an IndexSet.

pauldix · 2025-01-21T14:42:13Z

Yeah, since the rows always arrive in order (i.e. every row added is always > than any row before) and they're never added to an entry more than once, it's wasteful to use a set rather than just appending to a vec. And doing set operations on two sorted vecs is generally as fast as it gets. No reason to use a hash set.

feat: use u64 hash in buffer index instead of str literal

4f9b031

hiltontj requested a review from a team January 21, 2025 02:13

hiltontj self-assigned this Jan 21, 2025

refactor: move hash of column after if branch and add docs

732874a

hiltontj force-pushed the hiltontj/xxhash-buffer-index branch from 47d7831 to 732874a Compare January 21, 2025 13:26

praveen-influx approved these changes Jan 21, 2025

View reviewed changes

hiltontj merged commit d1fd155 into main Jan 21, 2025
13 checks passed

hiltontj deleted the hiltontj/xxhash-buffer-index branch January 21, 2025 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use u64 hash in buffer index instead of str literal #25883

feat: use u64 hash in buffer index instead of str literal #25883

hiltontj commented Jan 21, 2025

hiltontj commented Jan 21, 2025

praveen-influx left a comment

hiltontj commented Jan 21, 2025

praveen-influx commented Jan 21, 2025

pauldix commented Jan 21, 2025

hiltontj commented Jan 21, 2025

pauldix commented Jan 21, 2025

feat: use u64 hash in buffer index instead of str literal #25883

feat: use u64 hash in buffer index instead of str literal #25883

Conversation

hiltontj commented Jan 21, 2025

hiltontj commented Jan 21, 2025

praveen-influx left a comment

Choose a reason for hiding this comment

hiltontj commented Jan 21, 2025

praveen-influx commented Jan 21, 2025

pauldix commented Jan 21, 2025

hiltontj commented Jan 21, 2025

pauldix commented Jan 21, 2025