Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate h64 to BigInts, add h32 and h64 streaming APIs, plus performance improvements #26

Merged
merged 12 commits into from
Jan 14, 2022
Merged
1 change: 1 addition & 0 deletions .eslintrc.js
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ module.exports = {
}
},
env: {
es2020: true,
es6: true,
browser: true,
node: true
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/nodejs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ jobs:

strategy:
matrix:
node-version: [12.x, 14.x, 16.x]
node-version: [16.x]

steps:
- uses: actions/checkout@v2
Expand Down
175 changes: 115 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,9 @@ setup needed.
* [ES Modules](#es-modules)
* [UMD build](#umd-build)
* [Usage](#usage)
* [Streaming](#streaming)
* [Node](#node)
* [Performance]
* [API](#api)
* [Comparison to xxhashjs](#comparison-to-xxhashjs)
* [Benchmarks](#benchmarks)
Expand Down Expand Up @@ -89,6 +91,26 @@ h32(input); // ee563564
h64(input); // 502b0c5fc4a5704c
```

### Streaming

`xxhash-wasm` supports a `crypto`-like streaming api, useful for avoiding memory consumption when hashing large amounts of data:

```javascript
const { create32, create64 } = await xxhash();

// 32-bit version
create32()
.update("some data")
.update(Uint8Array.from([1, 2, 3]))
jungomi marked this conversation as resolved.
Show resolved Hide resolved
.digest(); // 955607085

// 64-bit version
create64()
.update("some data")
.update(Uint8Array.from([1, 2, 3]))
.digest(); // 883044157688673477n
```

### Node

This was initially meant for the browser, but Node 8 also added support for
Expand Down Expand Up @@ -116,6 +138,12 @@ the `module` field in `package.json`, such as webpack or Rollup, you will need
to explicitly import `xxhash-wasm/cjs/xxhash-wasm` otherwise the browser version
is used.

### Performance

For performance sensitive applications, `xxhash-wasm` provides the `h**String` and `h**Raw` APIs, which return raw numeric hash results rather than zero-padded hex strings. The overhead of the string conversion can be as much as 20% of overall runtime when hashing small byte-size inputs, and the string result is often inconsequential (if one is simply going to compare the results). When necessary, getting a zero-padded hex string from the provided `number` or `BigInt` results is easily achieved via `result.toString(16).padStart(16, "0")`.

The `h**`, `h**String`, and streaming APIs make use of `TextEncoder.encodeInto` to directly encode strings as a stream of UTF-8 bytes into the webassembly memory buffer, meaning that for string-hashing purposes, these APIs will be significantly faster than converting the string to bytes externally and using the `Raw` API. That said, for large strings it may be beneficial to consider the streaming API or another approach to encoding, as `encodeInto` is forced to allocate 3-times the string length to account for the chance the input string contains high-byte-count code units.

## API

`const { h32, h64 } = await xxhash()`
Expand All @@ -124,43 +152,53 @@ Create a WebAssembly instance.

`h32(input: string, [seed: u32]): string`

Generate a 32-bit hash of `input`. The optional `seed` is a `u32` and any number
greater than the maximum (`0xffffffff`) is wrapped, which means that
`0xffffffff + 1 = 0`.
Generate a 32-bit hash of the UTF-8 encoded bytes of `input`. The optional
`seed` is a `u32` and any number greater than the maximum (`0xffffffff`) is
wrapped, which means that `0xffffffff + 1 = 0`.

Returns a string of the hash in hexadecimal.
Returns a string of the hash in hexadecimal, zero padded.

`h32String(input: string, [seed: u32]): number`

Same as `h32`, but returning a `number`. This avoids the overhead of the string formatting of the result.

`h32Raw(input: Uint8Array, [seed: u32]): number`

Same as `h32` but with a Uint8Array as input instead of a string and returns the
hash as a number.
Same as `h32` but with a `Uint8Array` as input instead of a `string` and returns the
hash as a `number`.

`h64(input: string, [seedHigh: u32, seedLow: u32]): string`
`create32([seed: number]): Hash<number>`

Generate a 64-bit hash of `input`. Because JavaScript doesn't support `u64` the
seed is split into two `u32`, where `seedHigh` represents the first 32-bits of
the `u64` and `seedLow` the remaining 32-bits. For example:
Create a 32-bit hash for streaming applications. See `Hash<T>` below.

```javascript
// Hex
seed64: ffffffff22222222
seedhigh: ffffffff
seedLow: 22222222

// Binary
seed64: 1111111111111111111111111111111100100010001000100010001000100010
seedhigh: 11111111111111111111111111111111
seedLow: 00100010001000100010001000100010
```
`h64(input: string, [seed: BigInt]): string`

Generate a 64-bit hash of the UTF-8 encoded bytes of `input`. The optional
`seed` is a `u64` provided as a BigInt.

Returns a zero-padded string of the hash in hexadecimal.

Each individual part of the seed is a `u32` and they are also wrapped
individually for numbers greater than the maximum.
`h64String(input: string, [seed: BigInt]): BigInt`

Returns a string of the hash in hexadecimal.
Same as `h64`, but returning a `BigInt`. This avoids the overhead of the string formatting of the result.

`h64Raw(input: Uint8Array, [seedHigh: u32, seedLow: u32]): Uint8Array`
`h64Raw(input: Uint8Array, [seed: BigInt]): BigInt`

Same as `h64` but with a Uint8Array as input and output.
Same as `h64` but with a `Uint8Array` as input, returning an unformatted `BigInt`
hash value.

`create64([seed: BigInt]): Hash<BigInt>`

Create a 64-bit hash for streaming applications. See `Hash<T>` below.

`type Hash<T> {
update(input: string | Uint8Array): Hash<T>;
digest(): T
}`

The streaming API mirrors Node's built-in `crypto.createHash`, providing
`update` and `digest` methods to add data to the hash and compute the final hash
value, respectively.

## Comparison to [xxhashjs][xxhashjs]

Expand All @@ -178,43 +216,60 @@ further overhead. For the benchmarks below, the instantiation is done before the
benchmark and therefore it's excluded from the results, since it wouldn't make
sense to always create a new WebAssembly instance.

There is still the problem that JavaScript can't represent 64-bit integers and
both the seed and the result of the 64-bit algorithm are `u64`. To work around
this, the seed and the result are split into two `u32`, which are assembled and
disassembled into/from a `u64`. Splitting the seed into two `u32` isn't a big
deal, but the result is more problematic because to assemble the 64-bit hash in
JavaScript, 3 strings have to be created: The hex representation of the first
`u32` and the hex representation of the second `u32`, which are then
concatenated to a 64-bit hex representation. That are 2 additional strings and
this is a notable overhead when the input is small.

### Benchmarks

Benchmarks are using [Benchmark.js][benchmarkjs] with random strings of
different lengths. *Higher is better*

| String length | xxhashjs 32-bit | xxhashjs 64-bit | xxhash-wasm 32-bit | xxhash-wasm 64-bit |
| ------------------------: | ------------------ | ------------------ | ------------------------- | ---------------------- |
| 1 byte | 683,014 ops/sec | 12,048 ops/sec | ***1,475,214 ops/sec*** | 979,656 ops/sec |
| 10 bytes | 577,761 ops/sec | 12,073 ops/sec | ***1,427,115 ops/sec*** | 960,567 ops/sec |
| 100 bytes | 379,348 ops/sec | 10,242 ops/sec | ***1,186,211 ops/sec*** | 682,422 ops/sec |
| 1,000 bytes | 88,732 ops/sec | 7,755 ops/sec | ***522,107 ops/sec*** | 504,409 ops/sec |
| 10,000 bytes | 11,754 ops/sec | 1,694 ops/sec | 93,817 ops/sec | ***97,087 ops/sec*** |
| 100,000 bytes | 721 ops/sec | 174 ops/sec | 10,247 ops/sec | ***11,069 ops/sec*** |
| 1,000,000 bytes | 55.38 ops/sec | 15.98 ops/sec | 1,019 ops/sec | ***1,101 ops/sec*** |
| 10,000,000 bytes | 5.98 ops/sec | 1.77 ops/sec | 98.92 ops/sec | ***107 ops/sec*** |
| 100,000,000 bytes | 0.63 ops/sec* | 0.19 ops/sec* | 9.95 ops/sec | ***10.80 ops/sec*** |

`*` = Runs out of memory with the default heap size.

`xxhash-wasm` outperforms `xxhashjs` significantly, the 32-bit is up to 18 times
faster (increases as the size of the input grows), and the 64-bit is up to 81
times faster (decreases as the size of the input grows).

The 64-bit version is the faster algorithm, but it only starts to become faster
at a little over 1kB because of the above mentioned limitations of JavaScript
numbers. After that the 64-bit version is roughly 10% faster than the 32-bit
version. For `xxhashjs` the 64-bit is strictly worse.
| String length | xxhashjs 32-bit | xxhashjs 64-bit | xxhash-wasm 32-bit | xxhash-wasm 64-bit |
| ------------------------: | ------------------ | ------------------ | ----------------------- | ----------------------- |
| 1 byte | 513,517 ops/sec | 11,896 ops/sec | 2,122,322 ops/sec | ***3,556,437 ops/sec*** |
| 10 bytes | 552,133 ops/sec | 12,953 ops/sec | 2,075,154 ops/sec | ***3,196,204 ops/sec*** |
| 100 bytes | 425,277 ops/sec | 10,838 ops/sec | 1,867,002 ops/sec | ***3,169,921 ops/sec*** |
| 1,000 bytes | 102,165 ops/sec | 6,697 ops/sec | 1,468,336 ops/sec | ***2,393,906 ops/sec*** |
| 10,000 bytes | 13,010 ops/sec | 1,452 ops/sec | 446,415 ops/sec | ***787,430 ops/sec*** |
| 100,000 bytes | 477 ops/sec | 146 ops/sec | 58,339 ops/sec | ***91,895 ops/sec*** |
| 1,000,000 bytes | 36.40 ops/sec | 12.93 ops/sec | 5,443 ops/sec | ***8,379 ops/sec*** |
| 10,000,000 bytes | 3.12 ops/sec | 1.19 ops/sec | 307 ops/sec | ***417 ops/sec*** |
| 100,000,000 bytes | 0.31 ops/sec | 0.13 ops/sec | 28.70 ops/sec | ***35.63 ops/sec*** |

`xxhash-wasm` outperforms `xxhashjs` significantly, the 32-bit is up to 98 times
faster (generally increases as the size of the input grows), and the 64-bit is
up to 350 times faster (generally increases as the size of the input grows).

The 64-bit version is the faster algorithm, and retains a performance advantage
over all lengths over xxhashjs and the 32-bit algorithm.

`xxhash-wasm` also significantly outperforms Node's built-in hash algorithms,
making it suitable for use in a wide variety of situations. Benchmarks from
an x64 MacBook Pro running Node 17.3:

| String length | Node `crypto` md5 | Node `crypto` sha1 | xxhash-wasm 64-bit |
| ------------------------: | ------------------ | ------------------ | ----------------------- |
| 1 byte | 342,924 ops/sec | 352,825 ops/sec | ***3,556,437 ops/sec*** |
| 10 bytes | 356,596 ops/sec | 352,209 ops/sec | ***3,196,204 ops/sec*** |
| 100 bytes | 354,898 ops/sec | 355,024 ops/sec | ***3,169,921 ops/sec*** |
| 1,000 bytes | 249,242 ops/sec | 271,383 ops/sec | ***2,393,906 ops/sec*** |
| 10,000 bytes | 62,896 ops/sec | 80,986 ops/sec | ***787,430 ops/sec*** |
| 100,000 bytes | 7,316 ops/sec | 10,198 ops/sec | ***91,895 ops/sec*** |
| 1,000,000 bytes | 698 ops/sec | 966 ops/sec | ***8,379 ops/sec*** |
| 10,000,000 bytes | 58.98 ops/sec | 79.78 ops/sec | ***417 ops/sec*** |
| 100,000,000 bytes | 6.30 ops/sec | 8.20 ops/sec | ***35.63 ops/sec*** |

If suitable for your use case, the `Raw` API offers significant throughput
improvements over the string-hashing API, particularly for smaller inputs:

| String length | xxhash-wasm 64-bit Raw | xxhash-wasm 64-bit |
| ------------------------: | ----------------------- | ------------------- |
| 1 byte | ***9,342,811 ops/sec*** | 3,556,437 ops/sec |
| 10 bytes | ***9,668,989 ops/sec*** | 3,196,204 ops/sec |
| 100 bytes | ***8,775,845 ops/sec*** | 3,169,921 ops/sec |
| 1,000 bytes | ***5,541,403 ops/sec*** | 2,393,906 ops/sec |
| 10,000 bytes | ***1,079,866 ops/sec*** | 787,430 ops/sec |
| 100,000 bytes | ***113,350 ops/sec*** | 91,895 ops/sec |
| 1,000,000 bytes | ***9,779 ops/sec*** | 8,379 ops/sec |
| 10,000,000 bytes | ***563 ops/sec*** | 417 ops/sec |
| 100,000,000 bytes | ***43.77 ops/sec*** | 35.63 ops/sec |

### Bundle size

Expand All @@ -225,8 +280,8 @@ minified versions. *Lower is better*.

| | xxhashjs | xxhash-wasm |
| -------------- | ---------- | ------------- |
| Bundle size | 41.5kB | ***3.7kB*** |
| Gzipped Size | 10.3kB | ***1.2kB*** |
| Bundle size | 41.5kB | ***11.4kB*** |
| Gzipped Size | 10.3kB | ***2.3kB*** |

[actions-nodejs-badge]: https://github.com/jungomi/xxhash-wasm/actions/workflows/nodejs.yml/badge.svg
[actions-nodejs-link]: https://github.com/jungomi/xxhash-wasm/actions/workflows/nodejs.yml
Expand Down
8 changes: 4 additions & 4 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
"build-js": "yarn run build-browser && yarn run build-node",
"build-browser": "rollup -c",
"build-node": "TARGET=node rollup -c",
"build-wasm": "wasm-opt -O3 src/xxhash.wat -o src/xxhash.wasm",
"build-wasm": "wasm-opt --enable-bulk-memory -O4 src/xxhash.wat -o src/xxhash.wasm",
"clean": "rimraf coverage cjs esm umd",
"fix": "eslint . --fix",
"lint": "eslint .",
Expand Down Expand Up @@ -60,15 +60,15 @@
"bundlesize": [
{
"path": "./cjs/xxhash-wasm.js",
"maxSize": "1.5kb"
"maxSize": "2.2kb"
},
{
"path": "./esm/xxhash-wasm.js",
"maxSize": "1.5kb"
"maxSize": "2.2kb"
},
{
"path": "./umd/xxhash-wasm.js",
"maxSize": "1.5kb"
"maxSize": "2.3kb"
}
]
}
Loading