Optimize XxHash3 #77756

xoofx · 2022-11-01T22:26:14Z

Hey there!
This is a PR to optimize XxHash3 for large buffers (> 240 bytes) by trying to avoid loading/storing the accumulators on each 64 bytes but instead doing it after a whole batch.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.1.22431.12
  [Host]     : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2


|        Method |          data |     Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |---------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] | 27.81 us | 0.070 us | 0.062 us |  0.43 |
|          XXH3 | Byte[1048576] | 64.14 us | 0.102 us | 0.085 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 31.89 us | 0.070 us | 0.059 us |  0.50 |

XXH3Native is the C++ version
XXXH3 is the current C# version
XXH3Optimized is this PR C# version

So overall, this PR brings x2 performance improvement on large buffers and lower the performance gap with the native version (~10%).

Small note not covered by this PR: while checking with the native version, I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

ghost · 2022-11-01T22:26:28Z

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Hey there!
This is a PR to optimize XxHash3 for large buffers (> 240 bytes) by trying to avoid loading/storing the accumulators on each 64 bytes but instead doing it after a whole batch.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.1.22431.12
  [Host]     : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2


|        Method |          data |     Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |---------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] | 27.81 us | 0.070 us | 0.062 us |  0.43 |
|          XXH3 | Byte[1048576] | 64.14 us | 0.102 us | 0.085 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 31.89 us | 0.070 us | 0.059 us |  0.50 |

XXH3Native is the C++ version
XXXH3 is the current C# version
XXH3Optimized is this PR C# version

So overall, this PR brings x2 performance improvement on large buffers and lower the performance gap with the native version (~10%).

Small note not covered by this PR: while checking with the native version, I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

Author:	xoofx
Assignees:	-
Labels:	`area-System.IO`
Milestone:	-

xoofx · 2022-11-01T22:48:21Z

Another note: When I checked the implementation on a macOS/m1Pro I noticed that the ARM64 implementation is significantly slower than the x86 version (e.g x3). I haven't checked, but I saw on the xxHash repository that they are doing things slightly differently for ARM platform (e.g maybe using aligned loads). Will have to dig this one as well.

EgorBo · 2022-11-01T22:55:22Z

I noticed that the ARM64 implementation is significantly slower than the x86 version

I bet it's because of this: #76641 (comment)

operator * for Vector128<ulong> ends up in a slow fallback because it doesn't exist on arm cc @stephentoub
Namely, this line: https://github.com/stephentoub/runtime/blob/c3056f711fe315e71dc6e90ccd4184674bb01a73/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs#L784

src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs

stephentoub · 2022-11-02T00:51:51Z

I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

What native implementation are you calling and what are you comparing it against? I'd validated all of the test data against a locally built xxhash.dll calling XXH3_64bits_withSeed.

src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs

stephentoub · 2022-11-02T00:54:59Z

I noticed that the ARM64 implementation is significantly slower than the x86 version

I bet it's because of this: #76641 (comment)

operator * for Vector128<ulong> ends up in a slow fallback because it doesn't exist on arm cc @stephentoub Namely, this line: https://github.com/stephentoub/runtime/blob/c3056f711fe315e71dc6e90ccd4184674bb01a73/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs#L784

Yeah, you can see the comment I left in the code about it and my question about better ideas for dealing with it:
#76641 (comment)

xoofx · 2022-11-02T07:56:54Z

What native implementation are you calling and what are you comparing it against? I'd validated all of the test data against a locally built xxhash.dll calling XXH3_64bits_withSeed.

Yes, I validated against a locally built xxhash.
Now, I was just casting to a uint but generating the full hash gives me the following:

Native: 0xd36c0e13a3df139e
C#:     0x9e13dfa3130e6cd3

The C# bytes are read from the destination span, casting to a Span<ulong>. The native version is what is returned by the API.

So the bytes are actually in reverse order. Is it expected?

xoofx · 2022-11-02T10:28:23Z

Tests are failing, will dig further into these errors tonight.

stephentoub · 2022-11-02T10:43:05Z

So the bytes are actually in reverse order. Is it expected?

Yes.

runtime/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs

Line 21 in 9a138bf

/// the value is written in the Big Endian byte order.

#76279

#76641 (comment)

FYI, @bartonjs. I'm not alone here ;-)

ghost · 2022-11-02T12:09:27Z

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Hey there!
This is a PR to optimize XxHash3 for large buffers (> 240 bytes) by trying to avoid loading/storing the accumulators on each 64 bytes but instead doing it after a whole batch.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.1.22431.12
  [Host]     : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2


|        Method |          data |     Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |---------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] | 27.81 us | 0.070 us | 0.062 us |  0.43 |
|          XXH3 | Byte[1048576] | 64.14 us | 0.102 us | 0.085 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 31.89 us | 0.070 us | 0.059 us |  0.50 |

XXH3Native is the C++ version
XXXH3 is the current C# version
XXH3Optimized is this PR C# version

So overall, this PR brings x2 performance improvement on large buffers and lower the performance gap with the native version (~10%).

Small note not covered by this PR: while checking with the native version, I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

Author:	xoofx
Assignees:	-
Labels:	`area-System.IO`, `area-System.Security`, `community-contribution`
Milestone:	-

xoofx · 2022-11-02T18:22:42Z

Tests are failing, will dig further into these errors tonight.

Commit b72972c should fix the tests failing on the scalar path.

xoofx · 2022-11-02T18:35:13Z

FYI, @bartonjs. I'm not alone here ;-)

Not sure I understand the discussion about the endianess (intermixed with whether it should be commented on the XML doc API) but I concur that using Big Endian is really confusing and counter intuitive. Who is going to expect this frankly? Last time I programmed a big endian processor that was 30 years ago with the Motorola 68000 😅

XXHash is following the native endian order (except for 128 bits where they are using 2 64 bits) and I would hope that we could keep the same logic so that is doesn't cause such trouble when trying to match with the C++ implementation.

stephentoub · 2022-11-02T18:38:57Z

XXHash is following the native endian order

I don't see any APIs that are outputting bytes, only numbers. Can you elaborate?

xoofx · 2022-11-02T18:47:52Z

I don't see any APIs that are outputting bytes, only numbers. Can you elaborate?

Sorry, I meant that they output numbers (except for the 128 bits which are 2x64 bit numbers) that thus follow the native endianness. Unless there is something in the .NET HashAlgorithm that mandates that all of them should always emit byte[] in big endian form for any underlying hash algorithm?

stephentoub · 2022-11-02T19:02:11Z

Sorry, I meant that they output numbers (except for the 128 bits which are 2x64 bit numbers) that thus follow the native endianness

That's where #76279 will come in.

Unless there is something in the .NET HashAlgorithm that mandates that all of them should always emit byte[] in big endian form for any underlying hash algorithm?

The abstract base type here only has methods which output bytes rather than numbers. The xxhash implementations emit those bytes as big endian because of this:
https://github.com/Cyan4973/xxHash/blob/12d98c60fef0b7b6aee7943cf6ade7aba05e9c48/doc/xxhash_spec.md#step-7-output
"For systems which require to store and/or display the result in binary or hexadecimal format, the canonical format is defined to reproduce the same value as the natural decimal format, hence follows big-endian convention (most significant byte first)."

xoofx · 2022-11-02T19:15:06Z

"For systems which require to store and/or display the result in binary or hexadecimal format, the canonical format is defined to reproduce the same value as the natural decimal format, hence follows big-endian convention (most significant byte first)."

Oh, right, I can see it mentioned in the header file here

That makes sense then, thanks!

stephentoub · 2022-11-03T20:45:08Z

FWIW, I don't see 50% improvement. On my machine the best I see is ~25%:

Method	Toolchain	Size	Mean	Ratio
Hash	\main\corerun.exe	240	20.05 ns	1.00
Hash	\pr\corerun.exe	240	20.10 ns	1.00

Hash	\main\corerun.exe	512	31.55 ns	1.00
Hash	\pr\corerun.exe	512	27.77 ns	0.88

Hash	\main\corerun.exe	1024	52.49 ns	1.00
Hash	\pr\corerun.exe	1024	41.17 ns	0.78

Hash	\main\corerun.exe	1048576	83,650.94 ns	1.00
Hash	\pr\corerun.exe	1048576	61,600.73 ns	0.74

Still, a nice bump. I don't love the manual loop unrolling and associated code increase, but it's probably still worth it.

Thanks.

xoofx · 2022-11-03T21:37:54Z

FWIW, I don't see 50% improvement. On my machine the best I see is ~25%:

Oh interesting. What's your CPU?

stephentoub · 2022-11-03T21:38:59Z

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.674)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

xoofx · 2022-11-04T05:51:48Z

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.674)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

Thanks, I don't have a recent Intel CPU, but I will check on a older Kaby-Lake how it behaves.

In the meantime, I received a Windows Dev Kit 2023 with ARM64 and the results are:

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.755)
Snapdragon Compute Platform, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-rc.2.22477.23
  [Host]     : .NET 7.0.0 (7.0.22.47203), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 7.0.0 (7.0.22.47203), Arm64 RyuJIT AdvSIMD


|        Method |          data |      Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |----------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] |  70.16 us | 0.092 us | 0.086 us |  0.25 |
|          XXH3 | Byte[1048576] | 284.62 us | 0.882 us | 0.782 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 156.05 us | 0.089 us | 0.074 us |  0.55 |

Which gives a similar performance boost to x2 times, but it is still 2x slower than the native version.

Optimize XxHash3

941f6ff

dotnet-issue-labeler bot added the area-System.IO label Nov 1, 2022

ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 1, 2022

stephentoub reviewed Nov 2, 2022

View reviewed changes

src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs Outdated Show resolved Hide resolved

stephentoub reviewed Nov 2, 2022

View reviewed changes

src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs Outdated Show resolved Hide resolved

Update src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs

d2bd003

xoofx added 2 commits November 2, 2022 08:42

Fix var

2b32e8a

Fix compilation errors for < .NET 7.0

74de82e

build-analysis bot mentioned this pull request Nov 2, 2022

Tracking issue for CI build timeouts #76454

Closed

adamsitnik added the area-System.Security label Nov 2, 2022

adamsitnik added tenet-performance Performance related issue and removed area-System.IO labels Nov 2, 2022

Fix issue on scalar code path

b72972c

stephentoub approved these changes Nov 3, 2022

View reviewed changes

stephentoub merged commit ae1ada9 into dotnet:main Nov 3, 2022

xoofx mentioned this pull request Nov 4, 2022

Optimize XxHash3 on ARM platform #77881

Merged

adamsitnik added this to the 8.0.0 milestone Nov 4, 2022

jeffhandley added area-System.IO.Hashing and removed area-System.Security labels Nov 22, 2022

ghost locked as resolved and limited conversation to collaborators Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize XxHash3 #77756

Optimize XxHash3 #77756

xoofx commented Nov 1, 2022

ghost commented Nov 1, 2022

xoofx commented Nov 1, 2022

EgorBo commented Nov 1, 2022 •

edited

Loading

stephentoub commented Nov 2, 2022

stephentoub commented Nov 2, 2022

xoofx commented Nov 2, 2022

xoofx commented Nov 2, 2022

stephentoub commented Nov 2, 2022 •

edited

Loading

ghost commented Nov 2, 2022

xoofx commented Nov 2, 2022

xoofx commented Nov 2, 2022

stephentoub commented Nov 2, 2022

xoofx commented Nov 2, 2022 •

edited

Loading

stephentoub commented Nov 2, 2022

xoofx commented Nov 2, 2022

stephentoub commented Nov 3, 2022

xoofx commented Nov 3, 2022

stephentoub commented Nov 3, 2022

xoofx commented Nov 4, 2022

Optimize XxHash3 #77756

Optimize XxHash3 #77756

Conversation

xoofx commented Nov 1, 2022

ghost commented Nov 1, 2022

xoofx commented Nov 1, 2022

EgorBo commented Nov 1, 2022 • edited Loading

stephentoub commented Nov 2, 2022

stephentoub commented Nov 2, 2022

xoofx commented Nov 2, 2022

xoofx commented Nov 2, 2022

stephentoub commented Nov 2, 2022 • edited Loading

ghost commented Nov 2, 2022

xoofx commented Nov 2, 2022

xoofx commented Nov 2, 2022

stephentoub commented Nov 2, 2022

xoofx commented Nov 2, 2022 • edited Loading

stephentoub commented Nov 2, 2022

xoofx commented Nov 2, 2022

stephentoub commented Nov 3, 2022

xoofx commented Nov 3, 2022

stephentoub commented Nov 3, 2022

xoofx commented Nov 4, 2022

EgorBo commented Nov 1, 2022 •

edited

Loading

stephentoub commented Nov 2, 2022 •

edited

Loading

xoofx commented Nov 2, 2022 •

edited

Loading