-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate h64 to BigInts, add h32 and h64 streaming APIs, plus performance improvements #26
Conversation
Thank you for your work. The CI is failing because it has Node 12 and 14 in there, but with the requirement now being 15, these need to be removed. While they are still on Maintenance LTS, it is okay to move on and only support active LTS (Node 16) and the current release. Although at least locally, I had an issue with installing the dependencies on Node 17, because For the test coverage, there is one branch that is not covered:
That's for the streaming API with raw inputs ( Also for the streaming API it would be nice to have a short usage example in the Readme. Regarding your open questions:
About the downsides:
Besides that, it looks good to me and the improvements are fantastic, particularly for the 64-bit version. |
Passing a number as a BigInt seed is a runtime error: I've added the relevant testcases, removed the no-longer-compatible CI targets, and updated the readme with 1. a streaming example, 2. a mention of the limitations of the string api and 3. a discussion of engine requirements. After some thought, I've also added two new API methods here—feedback of course welcome— |
Ah okay, then it's fine as it is. The Otherwise the rest looks fine to me. |
I completely agree. I think the risk of such a structural change going unnoticed on a major version update is quite minor, particularly if it's highlighted in the release notes. As far as naming goes, how about...
|
|
Works for me, I'll go with h64ToString(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last couple of remarks just for the README
Thank you very much for your contributions. I'll be looking to prepare the release so that |
Regarding the switch from two numbers to BigInt for 64-bit seeds: my understanding is that BigInts are much slower for almost all operations done on them--so is it safe to assume that the performance impact here for call overhead has been tested to be negligible? |
In short, yes. I did detailed performance evaluation for every change included in this PR—in practice, there’s very little in the way of actual operations that need to be performed on a bigint in the relevant code paths: the math is all done on i64 types inside of wasm and then boxed to a bigint on the way out. The only actual bigint operation is the u64-izing, which, while it has overhead, is significantly cheaper than the hi/low munging previously in use. |
Resolves #9 and #25.
This PR includes changes to significantly improve the performance of the existing APIs (mostly as discussed in #25), as well as changes to migrate the h64 APIs to use BigInts and add new streaming APIs for both h32 and h64.
Open Questions
I'll start with my "open questions"... I've made choices here for purposes of posting the PR, but I'd describe them as the most shaky portion of the changeset, and accordingly warrant some discussion.
string
behavior: I've maintained the original API's type contract of returning strings from theh32
andh64
functions, however, I don't feel great about that. The fact that these APIs take strings is a bit orthogonal to the idea that they should return them, and incurs a performance cost in the case where the raw numeric hash values are an appropriate result of a string hash. That said, there's certainly some ergonomics to be concerned with, as most users certainly would expect hex string return values for hashes. I've also zero-padded in the general case here, again to align with my perception of user expectation.BigInt
/number
asymmetry between 32-bit and 64-bit APIs: Since the 32 bit API can express its seed/value via anumber
I've retained the existing API contracts there, however, there's an argument to be made that we can/should useBigInt
s for both to provide a completely consistent API between the algorithms.Hash<T>.update
acceptsstring | Uint8Array
, rather than separateupdate
/updateRaw
APIs: this matches the nodecrypto
API, but diverges from theh*/h*Raw
dichotomy. Removing the branching might provide a performance improvement, so I'm certainly open to either option.Downsides
TextEncoder.encodeInto
: These changes use some new APIs and change the target of the library accordingly. Bulk Memory Operations are available in Chrome >=75, Firefox >= 79, Safari >= 15, and Node >= 12.5 (behind a flag) . WASM BigInt support is available in Chrome >=85, Firefox 78, Safari >= 14.1, and Node >= 15.TextEncoder.encodeInto
is available in Chrome >= 74, Firefox >= 66, Safari >= 14.1, and Node >= 12.11.The end result there is that the practical library target is now:
Benchmarks
Benchmarks run on a 2019 x64 MacBook Pro running Node 17.3.0.
xxhash-wasm-1
is obviously the version from this PR, whilexxhash-wasm
is 0.4.2.Change Overview
h64
now usesBigInt
s for both seeds as well as return values from wasm. The seed is now communicated via an argument rather than memory writes. Raw now directly returns aBigInt
, rather than aUint8Array
.Uint8Array
providing access to the wasm memory buffer is now reused and only allocated on memory growth.string
encoding has been moved to use TextEncoder.encodeInto, after a bunch of benchmarking. In Openness to significant contributions? #25 I had noted large performance differences between Buffer.from and TextEncoder—it turns out those differences are mostly ARM-specific (my cited benchmark runs in that discussion are from an M1 MacBook Pro). While Buffer.from does still outperform TextEncoder.encode on small inputs, eliminating the second memcpy involved in using those APIs by using TextEncoder.encodeInto to directly encode into the wasm linear memory yields substanial improvements, as documented above. The downside here is that we have to overprovision the wasm memory to account for the risk that we're encoding high-byte-size code points, but should that extra memory use be a consideration for a particular use case, the Raw APIs should provide a reasonable means of handling it.create32
andcreate64
, which allow creation of stateful hashes that can be used for streaming hash applications. These hashers are closed over an xxh state object which we push into the wasm memory whenever we need to update or digest the hash, and read out afterwards. The WAT is a direct port of the C implementation of xxh, done very manually.In general, the JS code is extremely divergent from the previous version, so it's probably easiest to review as if it were entirely new code.
Please let me know what adjustments you'd like to see and I'll happily get on them ASAP.