Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specialized zerovec collections for stringy types #2721

Closed
sffc opened this issue Oct 5, 2022 · 2 comments · Fixed by #2722
Closed

Specialized zerovec collections for stringy types #2721

sffc opened this issue Oct 5, 2022 · 2 comments · Fixed by #2722
Assignees
Labels
A-design Area: Architecture or design A-performance Area: Performance (CPU, Memory) C-zerovec Component: Yoke, ZeroVec, DataBake question Unresolved questions; type unclear T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Oct 5, 2022

Currently if we want to store a string as a ZeroMap key, the best we can do is either a ZeroVec<[u8; 4]> (for fixed-length strings or tinystrs) or VarZeroVec<[u8]> (for variable-length strings). When performing lookup, we perform a full binary search operation, potentially comparing up to the whole string each time.

However, without changing the serialized format, we may be able to look up items in either of these collections with a more efficient algorithm, something along the lines of a radix search. First we find the range of values with the same first byte; then, within that range, the same second byte; and so on. I'm not actually sure if this is faster (it might not play nicely with cache locality), but it might be an area worth exploring.

An extension would be to explore other data structures optimized for storing ASCII strings. One is BytesTrie (#1155) and another is Char16Trie, which we already have implemented. Should we explore the implications of optionally using these structures as the key store in a ZeroMap?

Since BytesTrie is still not yet implemented, we could make a version specialized for ICU4X that stores either variable-length or 16-bit values and is optimized for data and code size. If we make it ASCII-only, we get an extra bit that we could use in interesting ways, too.

CC @pdogr @robertbastian @Manishearth @markusicu

@sffc sffc added T-enhancement Type: Nice-to-have but not required question Unresolved questions; type unclear A-performance Area: Performance (CPU, Memory) A-design Area: Architecture or design labels Oct 5, 2022
@Manishearth
Copy link
Member

An extension would be to explore other data structures optimized for storing ASCII strings.

I feel like those might be cleaner if done as separate ZeroMap-style collections? I'm wary of stuffing everything into ZeroMap since the traits get really icky really quickly.

At least for the trie based ones; I think we can almost certainly have zero-copy tries as a separate type.

However, without changing the serialized format, we may be able to look up items in either of these collections with a more efficient algorithm, something along the lines of a radix search. First we find the range of values with the same first byte; then, within that range, the same second byte; and so on. I'm not actually sure if this is faster (it might not play nicely with cache locality), but it might be an area worth exploring.

This would be cool. I think the way to go here is similar to #2312 where we have

struct StringySearch<K: VarULE + RadixSearchable>(pub K);
struct StringySearchVec<K: VarULE>(VarZeroVec<K>)

impl ZeroMapKV for StringySearch<K> {
type Container = StringySearchVec<K>;
}

The annoying thing is that the lookup pattern is a property of the vector not of the key so we will need small explosion of StringySearchVec wrappers (we might be able to get away with a single generic wrapper type though)

@sffc
Copy link
Member Author

sffc commented Oct 5, 2022

Sketch of a potential AsciiTrie. I realize this is fairly similar to @markusicu's BytesTrie but it's a fun exercise nonetheless.

Input: a trie byte slice (a pointer and a length) and an input string byte slice.

Output: Some(x) if the trie contains a value for the string or None.

Lookup algorithm:

  1. If the trie slice is empty, return None.
  2. Read the first byte of the trie slice.
    1. If the byte is 0xxxxxxx, this is an ASCII byte. If the string is empty, return None. Else, match the byte against the first byte of the string.
      1. If it doesn't match, return None.
        • Note: if the input string has 1 as its high bit, it will never match; this is OK since we are defining this structure for ASCII bytes only.
      2. If it matches, increment the start of both the trie slice and the string slice by 1 and return to step 1.
    2. If the byte is 10xxxxxx, this is a value node. Read the varint to x (using the varint algorithm) and increment the start of the trie slice to the first byte after the varint. Then,
      1. If the string is empty, return Some(x).
      2. Else, return to step 1.
    3. Otherwise, the byte is 11xxxxxx and this is a branch node. If the string is empty, return None. Else, continue to step 3.
  3. Read the varint to x (using the varint algorithm) and increment the start of the trie slice to the first byte after the varint. Then, interpret the remainder of the slice as a ZeroMap<u8, [u8]>, encoded as a ZeroVec<u8> of length x followed by a VarZeroVec<[u8]>, with the following caveats:
    1. The VarZeroVec does not contain a length; the length is presumed to be x.
    2. The index width of the VarZeroVec is determined by the following formula:
      1. If the length of the VZV slice is 256 or shorter, the index width is 1.
      2. If the length of the VZV slice is between 256 and 65536, the index width is 2.
      3. Otherwise, either return None (unlikely that we need larger indices), or set the index width to the general formulaceil(log2(length) / 8).
  4. Look up the first byte of the string in the ZeroMap (note: this performs a binary search over the first x bytes of the current trie slice).
    1. If not found, return None.
    2. Else, increment the start of the string slice by 1, set the trie slice to the slice from the data section of the ZeroMap, and return to step 1.

The varint algorithm can be similar to the one used in Postcard. If at any point the varint requires reading a byte that is beyond the end of the slice (bad data), return None.

@sffc sffc self-assigned this Oct 17, 2022
@sffc sffc added this to the ICU4X 1.1 milestone Oct 17, 2022
@sffc sffc modified the milestones: ICU4X 1.1, ICU4X 1.x Dec 22, 2022
@sffc sffc added the C-zerovec Component: Yoke, ZeroVec, DataBake label Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design A-performance Area: Performance (CPU, Memory) C-zerovec Component: Yoke, ZeroVec, DataBake question Unresolved questions; type unclear T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants