Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize XxHash3 #77756

Merged
merged 5 commits into from
Nov 3, 2022
Merged

Optimize XxHash3 #77756

merged 5 commits into from
Nov 3, 2022

Conversation

xoofx
Copy link
Member

@xoofx xoofx commented Nov 1, 2022

Hey there!
This is a PR to optimize XxHash3 for large buffers (> 240 bytes) by trying to avoid loading/storing the accumulators on each 64 bytes but instead doing it after a whole batch.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.1.22431.12
  [Host]     : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2


|        Method |          data |     Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |---------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] | 27.81 us | 0.070 us | 0.062 us |  0.43 |
|          XXH3 | Byte[1048576] | 64.14 us | 0.102 us | 0.085 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 31.89 us | 0.070 us | 0.059 us |  0.50 |
  • XXH3Native is the C++ version
  • XXXH3 is the current C# version
  • XXH3Optimized is this PR C# version

So overall, this PR brings x2 performance improvement on large buffers and lower the performance gap with the native version (~10%).

Small note not covered by this PR: while checking with the native version, I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 1, 2022
@ghost
Copy link

ghost commented Nov 1, 2022

Tagging subscribers to this area: @dotnet/area-system-io
See info in area-owners.md if you want to be subscribed.

Issue Details

Hey there!
This is a PR to optimize XxHash3 for large buffers (> 240 bytes) by trying to avoid loading/storing the accumulators on each 64 bytes but instead doing it after a whole batch.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.1.22431.12
  [Host]     : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2


|        Method |          data |     Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |---------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] | 27.81 us | 0.070 us | 0.062 us |  0.43 |
|          XXH3 | Byte[1048576] | 64.14 us | 0.102 us | 0.085 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 31.89 us | 0.070 us | 0.059 us |  0.50 |
  • XXH3Native is the C++ version
  • XXXH3 is the current C# version
  • XXH3Optimized is this PR C# version

So overall, this PR brings x2 performance improvement on large buffers and lower the performance gap with the native version (~10%).

Small note not covered by this PR: while checking with the native version, I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

Author: xoofx
Assignees: -
Labels:

area-System.IO

Milestone: -

@xoofx
Copy link
Member Author

xoofx commented Nov 1, 2022

Another note: When I checked the implementation on a macOS/m1Pro I noticed that the ARM64 implementation is significantly slower than the x86 version (e.g x3). I haven't checked, but I saw on the xxHash repository that they are doing things slightly differently for ARM platform (e.g maybe using aligned loads). Will have to dig this one as well.

@EgorBo
Copy link
Member

EgorBo commented Nov 1, 2022

I noticed that the ARM64 implementation is significantly slower than the x86 version

I bet it's because of this: #76641 (comment)

operator * for Vector128<ulong> ends up in a slow fallback because it doesn't exist on arm cc @stephentoub
Namely, this line: https://github.com/stephentoub/runtime/blob/c3056f711fe315e71dc6e90ccd4184674bb01a73/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs#L784

@stephentoub
Copy link
Member

I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

What native implementation are you calling and what are you comparing it against? I'd validated all of the test data against a locally built xxhash.dll calling XXH3_64bits_withSeed.

@stephentoub
Copy link
Member

I noticed that the ARM64 implementation is significantly slower than the x86 version

I bet it's because of this: #76641 (comment)

operator * for Vector128<ulong> ends up in a slow fallback because it doesn't exist on arm cc @stephentoub Namely, this line: https://github.com/stephentoub/runtime/blob/c3056f711fe315e71dc6e90ccd4184674bb01a73/src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.cs#L784

Yeah, you can see the comment I left in the code about it and my question about better ideas for dealing with it:
#76641 (comment)

@xoofx
Copy link
Member Author

xoofx commented Nov 2, 2022

What native implementation are you calling and what are you comparing it against? I'd validated all of the test data against a locally built xxhash.dll calling XXH3_64bits_withSeed.

Yes, I validated against a locally built xxhash.
Now, I was just casting to a uint but generating the full hash gives me the following:

Native: 0xd36c0e13a3df139e
C#:     0x9e13dfa3130e6cd3

The C# bytes are read from the destination span, casting to a Span<ulong>. The native version is what is returned by the API.

So the bytes are actually in reverse order. Is it expected?

@xoofx
Copy link
Member Author

xoofx commented Nov 2, 2022

Tests are failing, will dig further into these errors tonight.

@stephentoub
Copy link
Member

stephentoub commented Nov 2, 2022

So the bytes are actually in reverse order. Is it expected?

Yes.

/// the value is written in the Big Endian byte order.

#76279

#76641 (comment)

FYI, @bartonjs. I'm not alone here ;-)

@ghost
Copy link

ghost commented Nov 2, 2022

Tagging subscribers to this area: @dotnet/area-system-security, @vcsjones
See info in area-owners.md if you want to be subscribed.

Issue Details

Hey there!
This is a PR to optimize XxHash3 for large buffers (> 240 bytes) by trying to avoid loading/storing the accumulators on each 64 bytes but instead doing it after a whole batch.

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22000.1098/21H2)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=7.0.100-rc.1.22431.12
  [Host]     : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2
  DefaultJob : .NET 7.0.0 (7.0.22.42610), X64 RyuJIT AVX2


|        Method |          data |     Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |---------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] | 27.81 us | 0.070 us | 0.062 us |  0.43 |
|          XXH3 | Byte[1048576] | 64.14 us | 0.102 us | 0.085 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 31.89 us | 0.070 us | 0.059 us |  0.50 |
  • XXH3Native is the C++ version
  • XXXH3 is the current C# version
  • XXH3Optimized is this PR C# version

So overall, this PR brings x2 performance improvement on large buffers and lower the performance gap with the native version (~10%).

Small note not covered by this PR: while checking with the native version, I noticed that the hash is different (for both the current and optimized version vs native). Will have to dig further why.

Author: xoofx
Assignees: -
Labels:

area-System.IO, area-System.Security, community-contribution

Milestone: -

@adamsitnik adamsitnik added tenet-performance Performance related issue and removed area-System.IO labels Nov 2, 2022
@xoofx
Copy link
Member Author

xoofx commented Nov 2, 2022

Tests are failing, will dig further into these errors tonight.

Commit b72972c should fix the tests failing on the scalar path.

@xoofx
Copy link
Member Author

xoofx commented Nov 2, 2022

FYI, @bartonjs. I'm not alone here ;-)

Not sure I understand the discussion about the endianess (intermixed with whether it should be commented on the XML doc API) but I concur that using Big Endian is really confusing and counter intuitive. Who is going to expect this frankly? Last time I programmed a big endian processor that was 30 years ago with the Motorola 68000 😅

XXHash is following the native endian order (except for 128 bits where they are using 2 64 bits) and I would hope that we could keep the same logic so that is doesn't cause such trouble when trying to match with the C++ implementation.

@stephentoub
Copy link
Member

XXHash is following the native endian order

I don't see any APIs that are outputting bytes, only numbers. Can you elaborate?

@xoofx
Copy link
Member Author

xoofx commented Nov 2, 2022

I don't see any APIs that are outputting bytes, only numbers. Can you elaborate?

Sorry, I meant that they output numbers (except for the 128 bits which are 2x64 bit numbers) that thus follow the native endianness. Unless there is something in the .NET HashAlgorithm that mandates that all of them should always emit byte[] in big endian form for any underlying hash algorithm?

@stephentoub
Copy link
Member

Sorry, I meant that they output numbers (except for the 128 bits which are 2x64 bit numbers) that thus follow the native endianness

That's where #76279 will come in.

Unless there is something in the .NET HashAlgorithm that mandates that all of them should always emit byte[] in big endian form for any underlying hash algorithm?

The abstract base type here only has methods which output bytes rather than numbers. The xxhash implementations emit those bytes as big endian because of this:
https://github.com/Cyan4973/xxHash/blob/12d98c60fef0b7b6aee7943cf6ade7aba05e9c48/doc/xxhash_spec.md#step-7-output
"For systems which require to store and/or display the result in binary or hexadecimal format, the canonical format is defined to reproduce the same value as the natural decimal format, hence follows big-endian convention (most significant byte first)."

@xoofx
Copy link
Member Author

xoofx commented Nov 2, 2022

"For systems which require to store and/or display the result in binary or hexadecimal format, the canonical format is defined to reproduce the same value as the natural decimal format, hence follows big-endian convention (most significant byte first)."

Oh, right, I can see it mentioned in the header file here

That makes sense then, thanks!

@stephentoub
Copy link
Member

FWIW, I don't see 50% improvement. On my machine the best I see is ~25%:

Method Toolchain Size Mean Ratio
Hash \main\corerun.exe 240 20.05 ns 1.00
Hash \pr\corerun.exe 240 20.10 ns 1.00
Hash \main\corerun.exe 512 31.55 ns 1.00
Hash \pr\corerun.exe 512 27.77 ns 0.88
Hash \main\corerun.exe 1024 52.49 ns 1.00
Hash \pr\corerun.exe 1024 41.17 ns 0.78
Hash \main\corerun.exe 1048576 83,650.94 ns 1.00
Hash \pr\corerun.exe 1048576 61,600.73 ns 0.74

Still, a nice bump. I don't love the manual loop unrolling and associated code increase, but it's probably still worth it.

Thanks.

@stephentoub stephentoub merged commit ae1ada9 into dotnet:main Nov 3, 2022
@xoofx
Copy link
Member Author

xoofx commented Nov 3, 2022

FWIW, I don't see 50% improvement. On my machine the best I see is ~25%:

Oh interesting. What's your CPU?

@stephentoub
Copy link
Member

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.674)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

@xoofx
Copy link
Member Author

xoofx commented Nov 4, 2022

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.674)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores

Thanks, I don't have a recent Intel CPU, but I will check on a older Kaby-Lake how it behaves.

In the meantime, I received a Windows Dev Kit 2023 with ARM64 and the results are:

BenchmarkDotNet=v0.13.2, OS=Windows 11 (10.0.22621.755)
Snapdragon Compute Platform, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-rc.2.22477.23
  [Host]     : .NET 7.0.0 (7.0.22.47203), Arm64 RyuJIT AdvSIMD
  DefaultJob : .NET 7.0.0 (7.0.22.47203), Arm64 RyuJIT AdvSIMD


|        Method |          data |      Mean |    Error |   StdDev | Ratio |
|-------------- |-------------- |----------:|---------:|---------:|------:|
|    XXH3Native | Byte[1048576] |  70.16 us | 0.092 us | 0.086 us |  0.25 |
|          XXH3 | Byte[1048576] | 284.62 us | 0.882 us | 0.782 us |  1.00 |
| XXH3Optimized | Byte[1048576] | 156.05 us | 0.089 us | 0.074 us |  0.55 |

Which gives a similar performance boost to x2 times, but it is still 2x slower than the native version.

@adamsitnik adamsitnik added this to the 8.0.0 milestone Nov 4, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Dec 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.IO.Hashing community-contribution Indicates that the PR has been added by a community member tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants