-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use popcount intrinsincs in BitOperations #85944
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsI was trying to add this in #85842, but thought this should be evaluated separately on how much TP impact it makes.
|
I do see we replace the 2 steps comparison with popcnt. The diffs are still off, but I don't think it will regress the execution time. here is the analysis for minopts benchmarks.run windows/x64:
|
@dotnet/jit-contrib |
#elif HOST_ARM64 | ||
return _CountOneBits(value); | ||
#else | ||
return __popcnt(value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't safe. It will always emit popcnt
which requires SSE4.2
We'd need a cached CPUID check and a branch to use it, falling back to the bit twiddling logic if unsupported.
@@ -116,7 +116,7 @@ inline bool genMaxOneBit(T value) | |||
template <typename T> | |||
inline bool genExactlyOneBit(T value) | |||
{ | |||
return ((value != 0) && genMaxOneBit(value)); | |||
return BitOperations::PopCount(value) == 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we need a branch to test for popcnt
support, I expect the old logic may actually be faster as it generates:
test edi, edi
jz false
lea eax, [rdi - 1]
test edi, eax
sete al
ret
false:
xor eax, eax
ret
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So existing logic has 2 branches vs. just 1 branch with popcount()
. Do you think existing logic would still be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing logic is just 1 branch. setcc
is considered branchless and is specially handled by the CPU. The lea eax, [rdi - 1]
, test edi, eax
, sete al
is 3 cycles which is the same as for popcnt
.
So it'd likely balance out, but with there now being 2 branches for anyone with "very old" hardware. I don't have a particular preference for which we do given that popcnt
has been around for 15 years now, so we're unlikely to have people without it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it'd likely balance out
In that case, I don't think I should go ahead of using popcnt
in genExactlyOneBit()
, and just replace the existing bit twiddling logic with popcnt
on supported hardware. That way the future BitOperations::PopCount()
consumer will be optimized from popcnt
. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems reasonable to me.
@kunalspathak Do you want to convert this to "Draft" so it doesn't get considered stale? |
There is still work to be done to detect if |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
I was trying to add this in #85842, but thought this should be evaluated separately on how much TP impact it makes.