Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.IO.Compression.Crc32Helper should use Crc32 intrinsics #40244

Open
benaadams opened this issue Aug 2, 2020 · 7 comments
Open

System.IO.Compression.Crc32Helper should use Crc32 intrinsics #40244

benaadams opened this issue Aug 2, 2020 · 7 comments
Labels
Milestone

Comments

@benaadams
Copy link
Member

Rather than interoping to zlib when the Crc32 intrinsics are available on Arm and x86/x64

// Calculate CRC based on the old CRC and the new bytes
public static unsafe uint UpdateCrc32(uint crc32, byte[] buffer, int offset, int length)
{
Debug.Assert((buffer != null) && (offset >= 0) && (length >= 0) && (offset <= buffer.Length - length));
fixed (byte* bufferPtr = &buffer[offset])
{
return Interop.zlib.crc32(crc32, bufferPtr, length);
}
}
public static unsafe uint UpdateCrc32(uint crc32, ReadOnlySpan<byte> buffer)
{
fixed (byte* bufferPtr = &MemoryMarshal.GetReference(buffer))
{
return Interop.zlib.crc32(crc32, bufferPtr, buffer.Length);
}

/cc @tannergooding

@benaadams benaadams added the tenet-performance Performance related issue label Aug 2, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.IO.Compression untriaged New issue has not been triaged by the area owner labels Aug 2, 2020
@danmoseley
Copy link
Member

According to #2036, the intrinsics use different polynomials. If I understand right, zlib uses 0x04C11DB7/0xEDB88320 and SSE uses 0x1EDC6F41. Wikipedia has a large table.

Is that right?

@benaadams
Copy link
Member Author

Sigh might be; I think you might have to use PCLMULQDQ instead

@JimBobSquarePants
Copy link

Tada! Take what you want and feed back any improvements you can make.

https://github.com/SixLabors/ImageSharp/blob/f4f689ce67ecbcc35cebddba5aacb603e6d1068a/src/ImageSharp/Formats/Png/Zlib/Crc32.cs

@danmoseley danmoseley added this to the Future milestone Aug 2, 2020
@carlossanlop carlossanlop removed the untriaged New issue has not been triaged by the area owner label Aug 6, 2020
@brantburnett
Copy link
Contributor

I believe that this can be done now using the vectorized implementation now found in System.IO.Hashing (it uses PCLMULQDQ on Intel and equivalents on ARM).

My concern is that System.IO.Compression appears to be an always included part of the framework, while System.IO.Hashing is distributed on NuGet. We'd need to make System.IO.Hashing always available. I'm not sure what the procedures would be for getting such a change approved, or if it's worth it.

@stephentoub
Copy link
Member

Yeah, I don't think we're going to want to pull NonCryptographicHashAlgorithm down into corelib. We could consider alternate ways of sharing the code / implementation with System.IO.Compression, though. Note that the checksum is only used there as part of ZipArchive when writing out an entry. It is, however, a meaningful chunk of the time required to do so, something around 30% for typical use if memory serves.

@brantburnett
Copy link
Contributor

I did some testing to see if this improvement might be worthwhile. Here are the results I'm seeing on my Intel Windows machine. The test was compressing a single file to a ZipArchive using a similar approach to the other compression tests in the microbenchmarks. I may be able to squeeze out a bit more by using the static Update method if it's accessible by internalizing Crc32.

The biggest risk would be old Intel CPUs that don't support the intrinsics required for vectorization. I would suspect that the zlib scalar implementation may be better for those cases. The upside is we avoid a GC pin and transition and gain vectorization when available.

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22621.1635)
Intel Core i7-10850H CPU 2.70GHz, 1 CPU, 12 logical and 6 physical cores
.NET SDK=8.0.100-preview.3.23178.7
[Host] : .NET 8.0.0 (8.0.23.17408), X64 RyuJIT AVX2
Job-DUHTZA : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2
Job-MWSTQA : .NET 8.0.0 (42.42.42.42424), X64 RyuJIT AVX2

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:EnableUnsafeBinaryFormatterSerialization=true IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Job file Mean Error StdDev Median Min Max Ratio RatioSD
Compress Interop TestDocument.pdf 3,102.3 us 29.86 us 27.93 us 3,106.4 us 3,060.1 us 3,170.0 us 1.00 0.00
Compress C# TestDocument.pdf 3,046.6 us 21.40 us 20.02 us 3,045.1 us 3,016.4 us 3,080.2 us 0.98 0.01
Compress Interop alice29.txt 3,978.3 us 24.40 us 22.82 us 3,971.5 us 3,950.2 us 4,026.7 us 1.00 0.00
Compress C# alice29.txt 3,953.8 us 33.50 us 27.98 us 3,943.8 us 3,909.5 us 4,015.5 us 0.99 0.01
Compress Interop sum 662.0 us 11.95 us 11.18 us 658.1 us 650.2 us 684.0 us 1.00 0.00
Compress C# sum 639.7 us 2.26 us 1.77 us 640.3 us 637.1 us 642.2 us 0.96 0.02

@stephentoub
Copy link
Member

Thanks, @brantburnett. Just a few percent in throughput isn't worth bringing this functionality down from System.IO.Hashing nor including a large amount of code in System.IO.Compression. Based on these numbers, I'd speculate that the zlib being used already does some amount of vectorization in its crc32 calculation, e.g. the zlib from Intel that .NET uses on Windows does:
https://github.com/dotnet/runtime/blob/main/src/native/external/zlib-intel/crc_folding.c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants