-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ARM64] Performance regression: Utf8Encoding #41699
Comments
I have changed the tag to jit for now as this is most likely not the UTF8Encoding code itself. if proven otherwise, please re-tag it with encoding label again. |
Just adding a note here: Despite what the name suggests, the 4 regressions listed here are likely from methods in UTF8Utility and/or ASCIIUtility. The |
@echesakovMSFT please look into this. CC @AndyAyersMS |
For the following simple repro robox@DDARM64S-003:~/echesako/Runtime_41699$ cat Program.cs
using System.IO;
using System.Runtime.CompilerServices;
using System.Text;
namespace Runtime_41699
{
public class Program
{
public static void Main()
{
string unicode;
byte[] bytes;
UTF8Encoding utf8Encoding;
unicode = File.ReadAllText("/home/robox/echesako/Runtime_41699/EnglishMostlyAscii.txt");
utf8Encoding = new UTF8Encoding();
bytes = utf8Encoding.GetBytes(unicode);
while (true)
{
Consume(utf8Encoding.GetByteCount(unicode));
}
}
// public int GetByteCount() => _utf8Encoding.GetByteCount(_unicode);
// public byte[] GetBytes() => _utf8Encoding.GetBytes(_unicode);
// public string GetString() => _utf8Encoding.GetString(_bytes);
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Consume<T>(in T _) { }
}
} I am seeing 4 times more cache-misses in net5.0 and twice more stalled-cycles-backend. The following counter stat collections are done for the loop only.
|
Combining diff --git a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
index f2df0ccdf53..73a1ea29bec 100644
--- a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
+++ b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
@@ -146,16 +146,18 @@ static Utf16Utility()
Vector128<ushort> charIsThreeByteUtf8Encoded;
uint mask;
+ uint popcnt;
if (AdvSimd.IsSupported)
{
charIsThreeByteUtf8Encoded = AdvSimd.Subtract(vectorZero, AdvSimd.ShiftRightLogical(utf16Data, 11));
- mask = GetNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
+ popcnt = GetNonAsciiBytesAndPopCount(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
}
else
{
charIsThreeByteUtf8Encoded = Sse2.Subtract(vectorZero, Sse2.ShiftRightLogical(utf16Data, 11));
mask = (uint)Sse2.MoveMask(Sse2.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
+ popcnt = (uint)BitOperations.PopCount(mask);
}
// Each even bit of mask will be 1 only if the char was >= 0x0080,
@@ -182,8 +184,6 @@ static Utf16Utility()
// unpaired surrogates in our data. (Unpaired surrogates would invalidate
// our computed result and we'd have to throw it away.)
- uint popcnt = (uint)BitOperations.PopCount(mask);
-
// Surrogates need to be special-cased for two reasons: (a) we need
// to account for the fact that we over-counted in the addition above;
// and (b) they require separate validation.
@@ -485,6 +485,22 @@ static Utf16Utility()
return pInputBuffer;
}
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ private static uint GetNonAsciiBytesAndPopCount(Vector128<byte> value, Vector128<byte> bitMask128)
+ {
+ Debug.Assert(AdvSimd.Arm64.IsSupported);
+
+ Vector128<byte> mostSignificantBitIsSet = AdvSimd.ShiftRightArithmetic(value.AsSByte(), 7).AsByte();
+ Vector128<byte> extractedBits = AdvSimd.And(mostSignificantBitIsSet, bitMask128);
+
+ // self-pairwise add until all flags have moved to the first two bytes of the vector
+ extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+ extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+ extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+ Vector128<byte> popcnt = AdvSimd.PopCount(extractedBits);
+ return AdvSimd.Arm64.AddPairwise(popcnt, popcnt).ToScalar();
+ }
+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
{ seems to help a little bit with stalled-cycles-backend
This avoid moving |
Below measurement are done on
.NET Core 3.1.6 BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT
Job-VJGWPE : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 3.1 Arguments=/p:DebugType=portable
Toolchain=netcoreapp3.1 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
NET Core 5.0.0 BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-OUMKUS : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
NET Core 5.0.0 (with the suggested change to BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-XEOYJB : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
It's clear from the data for GetByteCount benchmark the issue with stalled cycles due to I am moving this to .NET 6.0. I don't believe this is a JIT issue, so I am relabeling this back to area-System.Text.Encoding. |
@jeffhandley I am assigning this to you now. |
Thanks @echesakovMSFT for your analysis. |
One more observation - if I remove the code under (AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-IXQCIC : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
(Vector.IsHardwareAccelerated) BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-ECUDLG : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
|
If I understand it correctly, you are saying that you reverted the changes in |
From offline conversation with @pgovind and @carlossanlop , I recall that the benchmarks touched the methods improved in #39506, #39508, #39050, #39041 and #38653 (correct me if I am wrong). While you are there, can you do similar change at other places these PRs touched (specially those methods that mimic SSE logic and can be improved by different algorithm for ARM64) to see if the vectorized implementation was fast enough? |
There are regressions in |
I opened dotnet/performance#1512 to track changing the benchmarks so that each benchmark is testing exactly one worker function. But the best evidence we have right now suggests that |
Re-opening this issue as #42052 was meant to be a temporary workaround and the underlying issue is still open |
@jeffhandley should this still be in the 5.0.0 milestone or be moved to 6.0/7.0? |
Assigning to @TIHan. |
I have some data comparing .NET 5, 7, 8: BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1574/21H2) PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20
It looks like |
The original issue reported regression when compared with .NET 3.1 (@adamsitnik do you remember if that is accurate)? If so we might need to compare with .NET 3.1. At that time, we just had linux arm64 though, so you will have to test it on linux arm64 box. |
I don't remember the details, but looking at my old description of the issue I am sure that you are right, it was a 3.1 vs 5.0 regression found on Ubuntu machines (the ones owned by the JIT Team, as back then I had no access to any other arm machines). |
This was on a linux ARM64 box. .NET 3.1 results:
.NET 7 results:
.NET 7 is an all-up improvement over .NET 3.1 results. cc @kunalspathak |
Closing as these are not regressions anymore. |
Thanks @TIHan for checking this. |
After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to
Utf8Encoding
. They are alll reproducible and I've verified that it's not a matter of loop alignment (by running them with--envVars COMPlus_JitAlignLoops:1
).It looks like it's ARM64 specific regression, I was not able to reproduce it for ARM (the 32 bit variant).
Repro
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04
Unknown processor
[Host] : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
Job-VTSQOV : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT
Job-RAMSQZ : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
Docs
Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository
cc @kunalspathak @carlossanlop @pgovind @tannergooding
The text was updated successfully, but these errors were encountered: