-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specialized zerovec collections for stringy types #2721
Comments
I feel like those might be cleaner if done as separate ZeroMap-style collections? I'm wary of stuffing everything into ZeroMap since the traits get really icky really quickly. At least for the trie based ones; I think we can almost certainly have zero-copy tries as a separate type.
This would be cool. I think the way to go here is similar to #2312 where we have struct StringySearch<K: VarULE + RadixSearchable>(pub K);
struct StringySearchVec<K: VarULE>(VarZeroVec<K>)
impl ZeroMapKV for StringySearch<K> {
type Container = StringySearchVec<K>;
} The annoying thing is that the lookup pattern is a property of the vector not of the key so we will need small explosion of StringySearchVec wrappers (we might be able to get away with a single generic wrapper type though) |
Sketch of a potential AsciiTrie. I realize this is fairly similar to @markusicu's BytesTrie but it's a fun exercise nonetheless. Input: a trie byte slice (a pointer and a length) and an input string byte slice. Output: Some(x) if the trie contains a value for the string or None. Lookup algorithm:
The varint algorithm can be similar to the one used in Postcard. If at any point the varint requires reading a byte that is beyond the end of the slice (bad data), return None. |
Currently if we want to store a string as a ZeroMap key, the best we can do is either a
ZeroVec<[u8; 4]>
(for fixed-length strings or tinystrs) orVarZeroVec<[u8]>
(for variable-length strings). When performing lookup, we perform a full binary search operation, potentially comparing up to the whole string each time.However, without changing the serialized format, we may be able to look up items in either of these collections with a more efficient algorithm, something along the lines of a radix search. First we find the range of values with the same first byte; then, within that range, the same second byte; and so on. I'm not actually sure if this is faster (it might not play nicely with cache locality), but it might be an area worth exploring.
An extension would be to explore other data structures optimized for storing ASCII strings. One is BytesTrie (#1155) and another is Char16Trie, which we already have implemented. Should we explore the implications of optionally using these structures as the key store in a ZeroMap?
Since BytesTrie is still not yet implemented, we could make a version specialized for ICU4X that stores either variable-length or 16-bit values and is optimized for data and code size. If we make it ASCII-only, we get an extra bit that we could use in interesting ways, too.
CC @pdogr @robertbastian @Manishearth @markusicu
The text was updated successfully, but these errors were encountered: