Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openness to significant contributions? #25

Closed
marcusdarmstrong opened this issue Dec 30, 2021 · 8 comments
Closed

Openness to significant contributions? #25

marcusdarmstrong opened this issue Dec 30, 2021 · 8 comments

Comments

@marcusdarmstrong
Copy link
Contributor

Hi there,

To start, thanks a ton for this project! It's been the starting point for a huge performance improvement in some of my employer's build tooling (which I'm working on open sourcing). As you may recall a while back (#9), I mentioned I'd forked this library to add some streaming xxhash64 functionality with the (timeline-naive) hope of upstreaming those additions last spring.

At this point, our internal fork has accumulated a few significant differences (performance-focused) that I'd also be interested in contributing, but given some of the implications of both the streaming additions and the performance improvements I figured I should reach out and confirm your interest in them...

To start, I'll share the relevant benchmarking runs (my fork is locally named xxhash64-wasm, as I had no need for the xxhash32 implementation):

xxhash-wasm#h32 x 1,821,445 ops/sec ±0.36% (97 runs sampled)
xxhash-wasm#h64 x 1,694,388 ops/sec ±0.28% (102 runs sampled)
xxhash64-wasm#hash x 5,320,306 ops/sec ±0.13% (99 runs sampled)
Benchmark 1 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 1,801,677 ops/sec ±0.72% (94 runs sampled)
xxhash-wasm#h64 x 1,698,874 ops/sec ±0.48% (99 runs sampled)
xxhash64-wasm#hash x 5,192,886 ops/sec ±0.15% (98 runs sampled)
Benchmark 10 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 2,462,734 ops/sec ±0.63% (99 runs sampled)
xxhash-wasm#h64 x 1,993,989 ops/sec ±0.53% (94 runs sampled)
xxhash64-wasm#hash x 3,868,241 ops/sec ±0.29% (98 runs sampled)
Benchmark 100 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 469,974 ops/sec ±0.72% (99 runs sampled)
xxhash-wasm#h64 x 481,806 ops/sec ±0.43% (99 runs sampled)
xxhash64-wasm#hash x 1,076,227 ops/sec ±0.66% (91 runs sampled)
Benchmark 1000 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 65,342 ops/sec ±0.54% (94 runs sampled)
xxhash-wasm#h64 x 67,771 ops/sec ±0.48% (99 runs sampled)
xxhash64-wasm#hash x 150,943 ops/sec ±0.51% (95 runs sampled)
Benchmark 10000 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 6,782 ops/sec ±0.31% (98 runs sampled)
xxhash-wasm#h64 x 7,085 ops/sec ±0.33% (96 runs sampled)
xxhash64-wasm#hash x 17,440 ops/sec ±0.48% (96 runs sampled)
Benchmark 100000 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 684 ops/sec ±0.13% (98 runs sampled)
xxhash-wasm#h64 x 717 ops/sec ±0.18% (97 runs sampled)
xxhash64-wasm#hash x 1,790 ops/sec ±0.61% (95 runs sampled)
Benchmark 1000000 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 69.13 ops/sec ±0.21% (72 runs sampled)
xxhash-wasm#h64 x 72.10 ops/sec ±0.26% (75 runs sampled)
xxhash64-wasm#hash x 158 ops/sec ±0.20% (81 runs sampled)
Benchmark 10000000 bytes - Fastest is xxhash64-wasm#hash
xxhash-wasm#h32 x 6.75 ops/sec ±0.31% (21 runs sampled)
xxhash-wasm#h64 x 7.03 ops/sec ±0.26% (22 runs sampled)
xxhash64-wasm#hash x 16.25 ops/sec ±0.68% (45 runs sampled)
Benchmark 100000000 bytes - Fastest is xxhash64-wasm#hash

These improvements are largely due to two changes:

1. Replacement of the String => Uint8Array mechanism

While this isn't generalizable to the browser, Buffer.from significantly outperforms node's TextEncoder implementation:

Buffer.from (UTF-16LE) x 3,621,691 ops/sec ±0.30% (143 runs sampled)
Buffer.from x 3,063,490 ops/sec ±0.67% (142 runs sampled)
TextEncoder.encode x 1,218,150 ops/sec ±3.42% (118 runs sampled)

This is an important consideration for my usecase, but if you'd prefer to leave this type of choice to the h64Raw API that certainly would seem fair to me.

2. Replacement of the u64 => Hex String mechanism

It's a bit uglier, but precomputing a byte-to-two-digit-hex-string mapping is substantially faster than the current DataView-based approach, and that's particularly important in those small-input benchmarks. There's potentially a question about how this interacts with the API given that the shared codepath in master uses a dataview for the raw API as well as the string, but either way, the improvement proved worthwhile for my use case.

There are two other, more minor improvements also contributing to the above results, that I expect to be less controversial:

  1. Reuse of the TypedArray wrapping the WebAssembly.Memory instance
  2. Passing the xx64 seed halves into wasm via arguments rather than memory operations

Lastly, there's the matter of the streaming implementation—one key item there is that the streaming algorithm relies on memcpy and to make that performant my fork enables bulk memory operations and utilizes the memory.copy instruction. Obviously use of that new target would be a significant breaking change for the library, so I wanted to make sure there was interest in changing such things before jumping through the relevant hoops. Implementation-wise, the addition of the streaming API also requires extracting finalize into it's own function, but I don't expect that you'll care very much about that. The only other thing to mention about the streaming implementation is that I've only implemented the 64bit algorithm, so we'd be talking about some additional API asymmetry.

In terms of the streaming API, it's currently:

const { update } = create(seedHigh, seedLow);
update(input).update(input).digest();

Certainly open to feedback there, but it's proven pretty functional to mirror the crypto API for my use case.

So all that said, please let me know what you think and I'll get to work on a PR (or just publishing my fork wholesale if you're not interested).

Thanks,
Marcus

@jungomi
Copy link
Owner

jungomi commented Jan 10, 2022

Hello,

I'm generally very open to all kinds of contributions, especially when they improve the performance or usability. I do not have any plans to actively implement any new features myself, but I'm happy to discuss and review any new features anyone wants to contribute.
Your improvements look very interesting and seem to be particularly effective for smaller inputs. I do have a few questions and maybe concerns that need to be addressed.

1. Replacement of the String => Uint8Array mechanism

That is interesting that the Node's Buffer.from is that much faster. I'm assuming that you still need to convert it to a Uint8Array in order to use it in WebAssembly, something like:

const buf = Buffer.from(input);
const strBuffer = new Uint8Array(buf.buffer, buf.byteOffset, buf.byteLength);

On my machine running Node 17.3.0 the Buffer.from is also about 3 times faster for small inputs, but at 1000 bytes it starts becoming slower, not as dramatically, but it is still slower.

TextEncoder x 2,427,883 ops/sec ±0.70% (87 runs sampled)
Buffer.from x 6,813,888 ops/sec ±0.89% (95 runs sampled)
Buffer.from -> Uint8Array x 5,553,867 ops/sec ±0.44% (95 runs sampled)
Benchmark 1 bytes - Fastest is Buffer.from
TextEncoder x 2,357,626 ops/sec ±1.83% (79 runs sampled)
Buffer.from x 6,059,737 ops/sec ±0.23% (99 runs sampled)
Buffer.from -> Uint8Array x 5,001,263 ops/sec ±0.27% (96 runs sampled)
Benchmark 10 bytes - Fastest is Buffer.from
TextEncoder x 2,214,023 ops/sec ±1.96% (67 runs sampled)
Buffer.from x 4,890,564 ops/sec ±0.86% (95 runs sampled)
Buffer.from -> Uint8Array x 4,175,874 ops/sec ±0.47% (99 runs sampled)
Benchmark 100 bytes - Fastest is Buffer.from
TextEncoder x 1,432,891 ops/sec ±1.34% (64 runs sampled)
Buffer.from x 2,237,759 ops/sec ±1.92% (91 runs sampled)
Buffer.from -> Uint8Array x 2,168,851 ops/sec ±0.47% (92 runs sampled)
Benchmark 1000 bytes - Fastest is Buffer.from
TextEncoder x 429,057 ops/sec ±1.24% (90 runs sampled)
Buffer.from x 307,961 ops/sec ±0.93% (92 runs sampled)
Buffer.from -> Uint8Array x 299,445 ops/sec ±0.95% (94 runs sampled)
Benchmark 10000 bytes - Fastest is TextEncoder
TextEncoder x 57,403 ops/sec ±1.79% (89 runs sampled)
Buffer.from x 37,743 ops/sec ±0.88% (88 runs sampled)
Buffer.from -> Uint8Array x 35,774 ops/sec ±1.31% (88 runs sampled)
Benchmark 100000 bytes - Fastest is TextEncoder
TextEncoder x 5,162 ops/sec ±4.60% (76 runs sampled)
Buffer.from x 3,215 ops/sec ±2.84% (81 runs sampled)
Buffer.from -> Uint8Array x 3,428 ops/sec ±1.76% (87 runs sampled)
Benchmark 1000000 bytes - Fastest is TextEncoder
TextEncoder x 327 ops/sec ±4.94% (64 runs sampled)
Buffer.from x 186 ops/sec ±2.59% (69 runs sampled)
Buffer.from -> Uint8Array x 227 ops/sec ±2.87% (77 runs sampled)
Benchmark 10000000 bytes - Fastest is TextEncoder
TextEncoder x 19.33 ops/sec ±1.57% (37 runs sampled)
Buffer.from x 16.26 ops/sec ±1.00% (45 runs sampled)
Buffer.from -> Uint8Array x 16.00 ops/sec ±1.08% (44 runs sampled)
Benchmark 100000000 bytes - Fastest is TextEncoder

Regarding the compatibility with the browser, that remains an important aspect, but in this case it's quite simple, as we only need to check if Buffer exists, which means it's running in Node and therefore it should be used instead. That would be just a check of const HAS_BUFFER = typeof Buffer !== "undefined".
I don't think it would be worth checking the number of bytes a string requires, because to get the accurate number of bytes, something like Buffer.byteLength needs to be used, and that would hurt the smaller inputs quite hard, so it's probably better to just the take the performance hit for larger inputs, as they are probably much less likely to be used often.

2. Replacement of the u64 => Hex String mechanism

That is definitely a clever workaround for the limitation of u64. Even though it is clever, I think I could be replaced entirely by using JavaScript's BigInt, which was still a proposal when I first created this library, but has now been widely supported for quite a while.
It would not only make it faster because all strings and workaround can be avoided, but also allow to return the h64 as a number (BigInt) instead of having to resort to a string, this would be a breaking change, so maybe to keep backwards-compatibility it could just default to returning a string.

Reuse of the TypedArray wrapping the WebAssembly.Memory instance

Sure, no issues there, as long as the correctness is ensured.

Passing the xx64 seed halves into wasm via arguments rather than memory operations

Absolutely, I don't even know why I didn't do that, probably just because I was already thinking of how to return it, that I wanted to just reuse the same mechanism.
Again here, using BigInt can replace them with a single argument rather than having the two halves.
Oh, I guess that would then be a breaking change as well for the h64 API, at this point it might be worth to either make all these breaking changes, or add it as a separate API to keep it backwards-compatible.

Streaming API

That sounds all rather good, but not having the 32bit version is definitely major concern, because as long as there is no technical limitation as to why only the 64bit can have a streaming API, it should be provided for both.

Obviously use of that new target would be a significant breaking change for the library [...]

Is it? I'm not familiar with the streaming implementation of xxhash, but isn't it an entirely separate API? As far as I was aware, it's suggested to use the regular API for small inputs.

In terms of the API, that looks fine to me, it will never be as elegant as the regular versions and keeping them in line with the crypto API certainly is a good idea.

Thank you for the interest and your proposals.

@marcusdarmstrong
Copy link
Contributor Author

Buffer actually is a Uint8Array subclass, so there's no need for the additional wrapper. I also personally have been using UTF-16LE decodes as the overhead of the UTF-8 transformation exceeds the cost of hashing the additional bytes from my measurement, since as long as I'm internally consistent with my encodings there's no correctness problem there.

Re: Bigint I'm happy to do the analysis... Given the fact that Bigint is arbitrary precision I was assuming there's be a performance hit vs u32 math.

Is it? I'm not familiar with the streaming implementation of xxhash, but isn't it an entirely separate API? As far as I was aware, it's suggested to use the regular API for small inputs.

Yes, it's a separate API, the thing I was trying to describe is that the wasm will now be using a new wasm instruction (memory.copy) and thus the binary will presumably fail to validate against wasm engines that don't have bulk memory instructions available.

@marcusdarmstrong
Copy link
Contributor Author

I was surprised to see the large-byte-size degradation of the Buffer.from(utf8) mechanism since that's not what the large benchmarks show overall, and indeed that does seem to come down to the encoding (also node 17.3.0):

TextEncoder x 7.58 ops/sec ±0.24% (23 runs sampled)
Buffer UTF-8 x 5.15 ops/sec ±0.20% (17 runs sampled)
Buffer UTF-16 x 23.49 ops/sec ±0.93% (45 runs sampled)
Benchmark 100000000 bytes - Fastest is Buffer UTF-16

@marcusdarmstrong
Copy link
Contributor Author

And I'm pleasantly surprised to see that BigInt should do nicely here. I benchmarked toString performance and there's minimal difference between my precomputed version and the BigInt toString call (with BigInt even coming out ahead frequently), so presumably removing the memory operations involved in the other approaches will yield a net win.

Bigint x 17,357,372 ops/sec ±0.44% (96 runs sampled)
Concatenated Ints x 12,257,712 ops/sec ±0.24% (99 runs sampled)
Precomputed Bytes x 16,643,248 ops/sec ±1.19% (92 runs sampled)

I'm happy to do the work to replace the relevant calls with BigInts and can probably swing doing a 32 bit streaming implementation... Though at this point it feels like releasing a major is likely the best approach for all of this, if you've got no concerns about that?

@jungomi
Copy link
Owner

jungomi commented Jan 10, 2022

Yes, it's a separate API, the thing I was trying to describe is that the wasm will now be using a new wasm instruction (memory.copy) and thus the binary will presumably fail to validate against wasm engines that don't have bulk memory instructions available.

Ah I see, I've only seen that memory.copy is supported by all major browsers an Node and it seems to have been implemented/proposed at around the same time as the BigInt JS to Wasm integration, which has landed roughly 2 years ago.
Hence, it's not really a concern in my eyes, unless you know of a specific target, which is still frequently used, that would not support it at this time.

I'm happy to do the work to replace the relevant calls with BigInts and can probably swing doing a 32 bit streaming implementation... Though at this point it feels like releasing a major is likely the best approach for all of this, if you've got no concerns about that?

Absolutely, that would be great. With the addition of the streaming API alongside these performance improvements, I think it's the perfect opportunity to release v1.0.0, so having the breaking changes with the BigInts seems fine.

@marcusdarmstrong
Copy link
Contributor Author

Great! I certainly don't know of an implementation that doesn't support bulk memory, just wanted to be explicit about changes.

One last question and I can get started: My current version uses top level await rather than a promise factory export—I'm assuming for CJS compatibility you'd prefer to keep the promise factory pattern, but figured I should check.

@jungomi
Copy link
Owner

jungomi commented Jan 10, 2022

I don't think I fully understand what you mean by using top-level await in this particular case. I'm assuming you mean that you initialise it immediately in the module, without having to call a function first, is that what you meant?

If yes, I'm definitely very apprehensive about that, just because having it initialised when the module is first loaded, makes it unpredictable when it's effectively loaded and impossible to avoid initialising it on certain code paths, where it's not used, without code splitting and dynamic imports, which not only isn't that trivial, it's also very inefficient with such a small library.
Although it might be quite insignificant, and even if it's async, it still takes away time from something else, it can add up and I'm all for avoiding unnecessary side-effects.

@marcusdarmstrong
Copy link
Contributor Author

marcusdarmstrong commented Jan 10, 2022

I don't think I fully understand what you mean by using top-level await in this particular case. I'm assuming you mean that you initialise it immediately in the module, without having to call a function first, is that what you meant?

Yep!

Sounds good, I'll maintain the current pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants