Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use popcount intrinsincs in BitOperations #85944

Closed
wants to merge 3 commits into from

Conversation

kunalspathak
Copy link
Member

I was trying to add this in #85842, but thought this should be evaluated separately on how much TP impact it makes.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2023
@ghost ghost assigned kunalspathak May 8, 2023
@ghost
Copy link

ghost commented May 8, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

I was trying to add this in #85842, but thought this should be evaluated separately on how much TP impact it makes.

Author: kunalspathak
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member Author

I do see we replace the 2 steps comparison with popcnt.

image

The diffs are still off, but I don't think it will regress the execution time.
cc: @tannergooding

image

here is the analysis for minopts benchmarks.run windows/x64:

Base: 533135268, Diff: 533175333, +0.0075%

?BuildDefs@LinearScan@@AEAAXPEAUGenTree@@H_K@Z                                                : 794531  : NA      : 32.83% : +0.1490%
?newRefPosition@LinearScan@@AEAAPEAVRefPosition@@PEAVInterval@@IW4RefType@@PEAUGenTree@@_KI@Z : 304820  : +2.51%  : 12.60% : +0.0572%
?BuildCall@LinearScan@@AEAAHPEAUGenTreeCall@@@Z                                               : 123098  : +6.53%  : 5.09%  : +0.0231%
?buildInternalRegisterUses@LinearScan@@AEAAXXZ                                                : 3591    : +1.02%  : 0.15%  : +0.0007%
?genFnProlog@CodeGen@@IEAAXXZ                                                                 : -2498   : -0.28%  : 0.10%  : -0.0005%
?BuildBlockStore@LinearScan@@AEAAHPEAUGenTreeBlk@@@Z                                          : -3270   : -3.42%  : 0.14%  : -0.0006%
?PostOrderVisit@ForwardSubVisitor@@QEAA?AW4fgWalkResult@Compiler@@PEAPEAUGenTree@@PEAU4@@Z    : -6030   : -9.86%  : 0.25%  : -0.0011%
?lvaAssignVirtualFrameOffsetsToLocals@Compiler@@QEAAXXZ                                       : -23002  : -3.14%  : 0.95%  : -0.0043%
?genFinalizeFrame@CodeGen@@IEAAXXZ                                                            : -24372  : -32.17% : 1.01%  : -0.0046%
??$select@$0A@@RegisterSelection@LinearScan@@QEAA_KPEAVInterval@@PEAVRefPosition@@@Z          : -168819 : -0.64%  : 6.98%  : -0.0317%
?BuildDef@LinearScan@@AEAAPEAVRefPosition@@PEAUGenTree@@_KH@Z                                 : -460405 : -7.16%  : 19.02% : -0.0864%
?BuildDefsWithKills@LinearScan@@AEAAXPEAUGenTree@@H_K1@Z                                      : -499582 : -88.58% : 20.64% : -0.0937%

@kunalspathak kunalspathak marked this pull request as ready for review May 9, 2023 05:06
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

#elif HOST_ARM64
return _CountOneBits(value);
#else
return __popcnt(value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't safe. It will always emit popcnt which requires SSE4.2

We'd need a cached CPUID check and a branch to use it, falling back to the bit twiddling logic if unsupported.

@@ -116,7 +116,7 @@ inline bool genMaxOneBit(T value)
template <typename T>
inline bool genExactlyOneBit(T value)
{
return ((value != 0) && genMaxOneBit(value));
return BitOperations::PopCount(value) == 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we need a branch to test for popcnt support, I expect the old logic may actually be faster as it generates:

    test edi, edi
    jz false
    lea eax, [rdi - 1]
    test edi, eax
    sete al
    ret
false:
    xor eax, eax
    ret

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So existing logic has 2 branches vs. just 1 branch with popcount(). Do you think existing logic would still be faster?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing logic is just 1 branch. setcc is considered branchless and is specially handled by the CPU. The lea eax, [rdi - 1], test edi, eax, sete al is 3 cycles which is the same as for popcnt.

So it'd likely balance out, but with there now being 2 branches for anyone with "very old" hardware. I don't have a particular preference for which we do given that popcnt has been around for 15 years now, so we're unlikely to have people without it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it'd likely balance out

In that case, I don't think I should go ahead of using popcnt in genExactlyOneBit(), and just replace the existing bit twiddling logic with popcnt on supported hardware. That way the future BitOperations::PopCount() consumer will be optimized from popcnt. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable to me.

@BruceForstall
Copy link
Member

@kunalspathak Do you want to convert this to "Draft" so it doesn't get considered stale?

@kunalspathak
Copy link
Member Author

@kunalspathak Do you want to convert this to "Draft" so it doesn't get considered stale?

There is still work to be done to detect if popcnt is supported or not and if yes, then use it. I will mark it for "Ready" once that is done.

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Aug 7, 2023
@JulieLeeMSFT JulieLeeMSFT marked this pull request as draft August 14, 2023 16:33
@ghost ghost closed this Sep 13, 2023
@ghost
Copy link

ghost commented Sep 13, 2023

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 13, 2023
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants