-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
For reference, here's the before and after codegen for the public static int Before(Dictionary<int, int> d) {
return d[42];
}
public static int After(MockDictionary<int, int> d)
{
return d[42];
} C.Before(System.Collections.Generic.Dictionary`2<Int32,Int32>)
L0000: push rsi
L0001: sub rsp, 0x20
L0005: mov rsi, rcx
L0008: mov ecx, [rsi]
L000a: mov rcx, rsi
L000d: mov edx, 0x2a
L0012: call System.Collections.Generic.Dictionary`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].FindEntry(Int32)
L0017: test eax, eax
L0019: jl L0035
L001b: mov rcx, [rsi+0x10]
L001f: cmp eax, [rcx+0x8]
L0022: jae L0040
L0024: movsxd rax, eax
L0027: shl rax, 0x4
L002b: mov eax, [rcx+rax+0x1c]
L002f: add rsp, 0x20
L0033: pop rsi
L0034: ret
L0035: mov ecx, 0x2a
L003a: call System.ThrowHelper.ThrowKeyNotFoundException[[System.Int32, System.Private.CoreLib]](Int32)
L003f: int3
L0040: call 0x7ff95393ef00
L0045: int3
C.After(MockDictionary`2<Int32,Int32>)
L0000: sub rsp, 0x28
L0004: mov edx, [rcx]
L0006: mov edx, 0x2a
L000b: call MockDictionary`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].FindValueOrNull(Int32)
L0010: test rax, rax
L0013: jz L001c
L0015: mov eax, [rax]
L0017: add rsp, 0x28
L001b: ret
L001c: mov ecx, 0x2a
L0021: call MockDictionary`2[[System.Int32, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]].ThrowKeyNotFoundException[[System.Int32, System.Private.CoreLib]](Int32)
L0026: int3 |
src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/RuntimeHelpers.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Runtime/CompilerServices/RuntimeHelpers.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
Wasn't happy with the asm this was generating; and the gnarly bool handling logic that it generated whenever it inlined. So trying a different approach The side-effect of the new that I'm not sure about is how it effects uses of - mov rdx, rsi
- call [Dictionary`2:FindEntry(ref):int:this]
- test eax, eax
- jge SHORT G_M9502_IG08
+ lea r8, bword ptr [rsp+20H]
+ mov rdx, rsi
+ call [Dictionary`2:TryGetValue(ref,byref):bool:this]
+ movzx rcx, al
+ xor rax, rax
+ mov gword ptr [rsp+20H], rax
+ test ecx, ecx
+ jne SHORT G_M9502_IG08 Otherwise the diffs look good
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, one minor nit
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Show resolved
Hide resolved
It is not unusual for the JIT to generate sub-optimal code for rarely used or brand new patterns. The JIT typically needs tweaking for best results. |
Shaved off a little more
|
I see this pattern come up a bit; I think its post inlining and in this case from mov ecx, dword ptr [r13+12]
cmp ecx, esi
sete cl
movzx rcx, cl
test ecx, ecx
jne G_M59263_IG12 Would I correct in thinking it could just be: mov ecx, dword ptr [r13+12]
cmp ecx, esi
je G_M59263_IG12 Where the |
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
There is a further optimisation to be had here as the dec r13d
test r13d, r13d
jl SHORT G_M59263_IG15 though doesn't look like the Jit currently does that optimization https://github.com/dotnet/coreclr/issues/7566 |
ContainsKey which I was worried about looks ok System.Collections.ContainsKeyTrue_String, String
System.Collections.ContainsKeyFalse_String, String
System.Collections.ContainsKeyTrue_Int32, Int32
System.Collections.ContainsKeyFalse_Int32, Int32
TryGetValue is happy System.Collections.TryGetValueTrue_String, String
System.Collections.TryGetValueFalse_String, String
System.Collections.TryGetValueTrue_Int32, Int32
System.Collections.TryGetValueFalse_Int32, Int32
|
KNucleotide changes
/cc @danmosemsft |
// more costly idiv instruction. | ||
|
||
const uint CrcReflected = 0xEDB88320; | ||
return (uint)((Sse42.Crc32(CrcReflected, index) * (ulong)range) >> 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikedn @AndyAyersMS running the hashcode through Crc32 gets around the distribution issue with unknown hashcodes being high entropy in their LSB when using @lemire's fastrange. (int
being an example here)
Though alas we can only do this when hardware accelerated as otherwise Crc32 is too slow; so fall back to modulo when its not supported in hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm from my experience that fastrange is very dangerous when used as a hash reducer. I tried it in the past and it resulted in massive collisions in common cases like sequential numbers, zip codes, etc. - i.e. stuff that has entropy in low bits.
I did not consider hardware-accelerated mixer like Crc32 - cannot comment on that. It was not available at the time. It is an interesting idea.
My guess it may still regress cases like when dictionary is used as a sparse list and keys have clusters. Modulus, for good or bad, happens to preserve the locality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to reverse the bits to reverse the entropy; and then you get something closer to the behaviour of modulus with fastrange. Alas there still isn't a fast way to do that in hardware; and reversing the endianness still leaves the wrong order entropy per byte.
Re: locality #27149 (comment) for Dictionary specifically it has a lesser impact; than for example Hashtable, as its only changing the locality of the smaller int[] array that contains the indexes and the larger Entry[] array would maintain the same very packed locality. But yes it would have an impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One approach could be to use a bucket array that was of the correct type for the capacity (e.g. dictionary size 1-255 could be a byte[]
, 256-65536 could be an ushort[]
, etc)
One wrinkle with this change is it also does Crc32 on top of Marvin32 for strings which is a bit weird; but shouldn't really make it it more predictable? /cc @GrabYourPitchforks |
We'd need to run the CRC32 logic by the crypto board to make sure that we're not improperly compressing the range of the hash function. The final step of CRC32 is a finite field polynomial division, which likely introduces a slight bias toward zero. It's probably safe but needs an extra set of eyes from somebody more steeped in this field. |
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
|
||
const uint CrcReflected = 0xEDB88320; | ||
return (uint)((Sse42.Crc32(CrcReflected, index) * (ulong)range) >> 32); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a neat idea. I can think about two potential problems with this:
- This will make Dictionary slower when tiered JIT is disabled, or during startup until tiered JIT kicks in. @mjsabby Are you concerned about Dictionary being slower with tiered JIT disabled?
- There may be cases today where the
div
gives a good locality for Dictionary access. This locality will disappear withcrc32
and the overall performance will regress. I do not know what is the chance of this happening in real world workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will make Dictionary slower when tiered JIT is disabled, or during startup until tiered JIT kicks in.
Does the support check get cached in a static rather than always being a CPUID? Looking at the R2R asm (call
to jmp
) still have quite a few cycles to spare vs the div
branch:
call [Sse42:get_IsSupported():bool]
test al, al
je SHORT G_M59265_IG10
mov eax, 0xD1FFAB1E
crc32 eax, r12d
mov edx, r13d
imul rax, rdx
shr rax, 32
mov ecx, eax
jmp SHORT G_M59265_IG11
G_M59265_IG10: ;; bbWeight=0.50
mov eax, r12d
xor rdx, rdx
div edx:eax, r13d
mov ecx, edx
G_M59265_IG11: ;; bbWeight=0.50
Actually, can test it. Will COMPlus_TieredCompilation=0
mean it will always use the R2R version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be cases today where the
div
gives a good locality for Dictionary access. This locality will disappear withcrc32
and the overall performance will regress.
This should have a lower impact with Dictionary
; than for example Hashtable
as its only changing the locality of the smaller int[]
array that contains the indexes and the larger Entry[]
array would maintain the same very packed locality. But yes it would have an impact.
Probably couldn't use in in Hashtable
due to this greater effect as it would additionally scatter the data. Although... Hashtable
calls modulo for every collision (whereas Dictionary
only calls it once); so might it might balance?
However; I wouldn't want to do the work to verify that 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Dictionary is one of the hottest pieces of code we run and we run with tiered jitting disabled and would have a pretty negative perf impact. I would love to see this without a regression to Dictionary in the non tiered case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still is an improvement for non-tiered (I assume Sse42 is available, its just tiering that's disabled?)
| Method | Toolchain | Mean | Ratio |
|-------------- |---------- |----------:|------:|
| KNucleotide_1 | base | 375.6 ms | 1.00 |
| KNucleotide_1 | diff | 340.6 ms | 0.91 |
| KNucleotide_9 | base | 100.13 ms | 1.00 |
| KNucleotide_9 | diff | 97.05 ms | 0.97 |
System.Collections.TryGetValueFalse_Int, Int
| Method | Toolchain | Size | Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary | base | 512 | 7.033 us | 1.00 |
| Dictionary | diff | 512 | 3.381 us | 0.48 |
System.Collections.TryGetValueFalse_String, String
| Method | Toolchain | Size | Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary | base | 512 | 16.46 us | 1.00 |
| Dictionary | diff | 512 | 13.72 us | 0.83 |
System.Collections.TryGetValueTrue_Int, Int
| Method | Toolchain | Size | Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary | base | 512 | 5.965 us | 1.00 |
| Dictionary | diff | 512 | 3.755 us | 0.63 |
System.Collections.TryGetValueTrue_String, String
| Method | Toolchain | Size | Mean | Ratio |
|----------- |----------- |----- |---------:|------:|
| Dictionary | base | 512 | 17.92 us | 1.00 |
| Dictionary | diff | 512 | 15.09 us | 0.84 |
@adamsitnik does setting the COMPlus_TieredCompilation=0
environment variable stick for the https://github.com/dotnet/performance repository or should I be passing a flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also seems like the perf repo tests are using unique randomly distributed keys. Might be worth running some perf examples where we have less well distributed key sets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer this hashcode-to-bucket mapping change to be a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure why we are using the reflected polynomial of CRC-32 (CrcReflected
) as the initial hash value for CRC-32C calculation implemented by the intrinsic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also seems like the perf repo tests are using unique randomly distributed keys. Might be worth running some perf examples where we have less well distributed key sets?
Added some test for that, and large values dotnet/performance#938
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
src/System.Private.CoreLib/shared/System/Collections/Generic/Dictionary.cs
Outdated
Show resolved
Hide resolved
Typos Co-Authored-By: Joni <[email protected]>
Yes, this should work fine (the host process is going to derive the env var from your shell and the benchmark process from the host process). The alternative is to use our python3 script which allows for even more:
py .\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter $theFilter --coreRun $coreRuns --dotnet-compilation-mode NoTiering |
private struct Entry | ||
{ | ||
public uint hashCode; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reordered the fields ( I assume by the order of access) and made layout Auto. Just curious if this was on purpose or a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On purpose; order of access, but also Auto in case the TKey and TValue are unusual size structs so the reordering doesn't work for the packing (e.g. byte
, byte
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If Auto considers the initial layout (I do not know), it would be a nice trick to learn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wondered too, maybe worth a comment.
Sequential keys TryGetValue looks good;
Indeed it does
|
@@ -228,26 +232,36 @@ public void Clear() | |||
} | |||
|
|||
public bool ContainsKey(TKey key) | |||
=> FindEntry(key) >= 0; | |||
=> TryGetValue(key, out _); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this would do an unnecessary TValue
copy on success.
Another way to avoid copying the value could be by using It may not be as efficient as dealing with unsafe, but could be not that much worse. |
Doesn't help with ContainsKey as the caller then has to default the ref; and when true it would additionally do the assignment (unless additionally passed an enum to indicate mode, but then that eats a register). Returning |
Yes, that is why it may not be as efficient as unsafe version.
That is just in case you (or JIT) stop liking unsafe for some reason :-) |
😉 |
My C# background cries when I see From IL point of view it is ok though and I see how it is safe and efficient in this case. |
@VSadov That should be |
|
I think Then We are digressing :-) |
PR without the hashcode-to-bucket mapping change, will follow up with that in a different PR #27195 |
@benaadams I just came across an interesting way to map hash values to smaller ranges. Linked it in case you want to have a look: |
Opened PR for the other half #27299 |
System.Collections.TryGetValueTrue_String, String
System.Collections.TryGetValueFalse_String, String
System.Collections.TryGetValueTrue_Int32, Int32
System.Collections.TryGetValueFalse_Int32, Int32
More #27149 (comment)