-
Notifications
You must be signed in to change notification settings - Fork 694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
measure performance of other variable-length integer encoding schemes #601
Comments
@tterribe points out that we could reverse the prefix bits so that |
@lukewagner Now that you mention it, wouldn't this encoding be nice?
This is a combination of two different encoding styles, depending on the top 4 bits, and I'm just proposing it on the guess that an optimized implementation of this would be simpler than the
Note: whether to use sign-extension or zigzag is an orthogonal decision. P.S. if one wants to reduce the number of possible encodings for numbers <= 28 bits, use |
Good input, thanks! I wasn't intending to drive this to resolution now (anyone else feel free to though), just filing this as something to ask and measure before MVP. |
@mbebenita Yay for solutions I don't think of because after 15-40 years, C++/C#/Java still don't (officially) support them. |
I ran some performance numbers for a modified version of the prefix encoded variable length integers described here. TL;DR: Decoding PrefixVarint is about 3x faster than decoding LEB128. The prefix encoding I used limits the integers to 64 bits so that the length of the encoded integer can be determined from the first byte:
This has the advantage that the total encoded length is limited to 9 bytes maximum, and it can be computed by a branch free function: static inline size_t
prefix_length(const uint8_t* p)
{
return 1 + count_trailing_zeros_32(*p | 0x100);
} Decoding the integer has a single highly predictable branch: static inline uint64_t
prefix_decode(const uint8_t* p, size_t length)
{
if (length < 9) {
size_t unused = 64 - 8 * length;
return unaligned_load_64(p) << unused >> (unused + length);
} else {
return unaligned_load_64(p + 1);
}
} I compared against LLVM's LEB128 implementation which is pretty straightforward. Both decoders depend on the branch predictor for good performance, LEB128 moreso than PrefixVarint, so it makes a difference how the numbers to be decoded are distributed. I used log-uniform random 64-bit numbers, which means that Since WebAssembly currently restricts its LEB128 numbers to 32 bits, I also measured with 32-bit log-uniform numbers. In this range, the PrefixVarint decoder is essentially branch-free since the single branch is 100% predictable. I measured on a Broadwell laptop (i7-5557U @ 3.10GHz) and a Nehalem iMac (i7-870 @ 2.93GHz). The table below shows how much faster PrefixVarint decodes 100,000 integers compared to LEB128.
LEB128 performs better when the range of the log-uniform numbers is reduced because the decoder loop has fewer turns and fewer branch mispredictions. PrefixVarint is only affected by branch mispredictions when the log-uniform numbers become larger than 56 bits, so it compares most favorably at 56-bit log-uniform numbers. Test program: varint.zip |
@stoklund Thanks for posting these! I'm a member of the Protocol Buffers team at Google, so I have some background and interest in all of this. I've been a fan of PrefixVarint for a while. A few comments:
|
Sorry, noob here. Does this make wasm generic between 32/64 bit, or help? It seems like you guys throw around solutions sometimes, but I can't keep track. |
Here are some real world data on LEB lengths, using the examples from https://github.com/WebAssembly/build-suite: Please note that these are generated via Binaryen's asm2wasm, so are not representative of what will be produced by the LLVM backend (which won't lower 64-bit operations to 32-bit operations). Also note that the 5-byte values are slightly more common than required, since the
|
Thanks, @binji! With such a heavy bias towards the 1-byte integers, I think we should also consider something like the SQLite variable-length integers. It might compress better. |
Out of curiosity, the number of significant bits of each value encoded with LEB128 in AngryBots:
|
@mbebenita strange, the sum of your counts is much less than my counts for the number of LEB128s in AngryBots. Are you including the signed i32.consts too? |
Note: the opcode-table-with-fixed-immediates change would likely significantly change the distribution of indices and, I expect, produce a more even distribution among the byte lengths. |
@binji I instrumented this function: https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h#L107 It looks like Binaryen does't use LEB128 for i32.const, which is why the counts are probably off - https://github.com/WebAssembly/binaryen/blob/master/src/wasm-binary.h#L785 |
I updated my benchmark to read a test vector from a file instead of generating random numbers. I used
to generate 465936 integers with a more realistic distribution. These numbers encode to 1.059 bytes/integer. With such a heavy bias to single-byte encoded values, the PrefixVarint decoder is actually faster with an extra branch: void prefix_decode(const uint8_t *in, uint64_t *out, size_t count) {
while (count-- > 0) {
if (LIKELY(*in & 1)) {
*out++ = *in++ >> 1;
} else {
size_t length = prefix_length(in);
*out++ = prefix_get(in, length);
in += length;
}
}
} This makes me sad, but what can you do? I make a similar change to my LEB128 decoder for fairness. The PrefixVarint decoder is still 1.11x faster with this data:
Updated code in my repo. |
A modified version of the SQLite encoding also performs well on the BananaBread numbers:
|
FWIW, if you make prefix varint big-endian and special-case the one-byte encoding then you can remove the shift (and on x86 the bit test) from the fast path so you have simply I know that's a micro-optimisation, but it gives parity to leb128 and prefix fast paths. |
Implementors have discussed and concluded that while alternative encodings might offer improved size, the intention to support layer 1 (allowing outer wrapping format specific compression), and the degree to which LEBs are well known, outweighs this advantage for MVP. |
@haberman do you have any implementation/examples of Protocol Buffers decoder in wasm? This is something I'm interested in |
I know WebAssembly picked LEB128, but I was just curious why you didn't benchmark prefixVarInt using one's in the most significant bits ? The difference is that with one byte encoding (values < 128) which is also the most frequent, no shift are required. |
@chmike I would've liked to see that too. The whole process of MVP's design was really disheartening. @stoklund says "look, this solution is 2x-3x as fast as LEB128!" and the final one-sentence rationale doesn't even mention performance, instead making a nonsensical reference to "improved size" (PrefixVarInt and LEB128 are the same size) and "layer 1" (never mind that layer 1 by its very nature cannot make decoding performance better). It's a lot like my text format proposal that was closed with no rationale. I would have been happy to work full time pro bono on that text format, but oh well. |
The only argument in favor of LEB128 is that it allows to locate its end with random access. It is impossible with prefix var ints. The choice a priori between slower and impossible may be toward slower to avoid the impossible. It make sense for some applications like UTF8 encoding to have picked LEB128 encoding, because random access is very likely to happen. Would one use random access with web assembly ? What might be the use cases ? |
Edit: fix a link in the details and expand a bit more on some things. I just wanted to chime in and note that this could be significantly accelerated through the use of bit field manipulation and some limited overfetching to parallelize the loads and stores. I'm not aware of any implementations that do this, though. But in any case, I'm not persuaded by the PrefixVarint format having nearly the performance benefits, not when you can overfetch and avoid the load -> load dependency bottleneck entirely, even if the data doesn't fall neatly onto a single cache line. (Here's an archive.org link, as the repo appears to be since deleted.) Unless the data is all somehow in L1 cache (not something you can rely on), most the time spent is going to be waiting on memory loads and stores, and so there's little the PrefixVarint offers that helps eliminate that final load that I've done here. The general intuition here for my algorithm is to treat values post-normalization as essentially a SWAR of 1-bit values and to leverage the implicit modulo you get from ignoring extra bits to both be able to do more things in parallel and get away with overfetching and overstoring to further enhance this:
Here's the general algorithm itself
|
You should benchmark it. I’m working with the Go programming language and explored many different varying length integer encoding. I also experimented with various different implementations. The most efficient code is made available here: https://github.com/chmike/varint. This code minimize the number of machine instructions to execute. I provide benchmarks comparing it with the LEB128 encoding provided in the Go standard library. Through my experiences I have seen that any bit fiddling make it less efficient. I thus saw that postfix encoding is less efficient. It was unexpected as I thought that writing the 64 bit value in little endian order would be faster. LEB128 encoding for small values is faster because the function is inlined. But even though my function is not inlined, my encoding function beats LEB128 encoding when the values have more than 14 significant bits. The LEB128 decoding is significantly worse. My code could beat LEB128 if the encoding of small values up to 14 significant bits could be inlined. It is currently not possible with the Go compiler but it may change in the future. Note that I only benchmark the encoding and decoding operations. There are other operations around it that do also affect the overall performance like testing the return value, or handling the panic (Go’s exceptions). Note also that my code benchmarks may have been advantaged by branch predictions. A better test would be to use a realistic random distribution of value size. |
Note that the scatter and gather steps in my above algorithm are fairly special-cased to LEB128, but only because the bit vs data are in the same position in each byte and so I can move the value bits into place without dependency on the length. For PrefixVarint, moving bits into place is a simple shift (2 shifts with an OR if the whole value exceeds pointer width), but I would have to do a @chmike If I can find time, I might be able to give you hard numbers for 32- and 64-bit encode and decode. I find it unlikely the 6-μop 18-cycle PEXT in AMD processors all the way up through Zen 2 would give me much of a speed-up, but I may try anyways as well if I can (as I'm running a Zen 2 processor that has the slow microcoded version). I won't have ARM numbers for you, though. At least for now, here's a code comparison between the two formats for x86-64 and AArch64.
|
As suggested by this HN comment we should measure the comparative decode-time differences between LEB128 and the described PrefixVarint. Some of the advantages won't apply since we're mostly dealing with 1- or 2-byte varints, but, still, having the tag bits in the LSB might allow an optimized decoder to reduce branching.
The text was updated successfully, but these errors were encountered: