Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions #144

Closed
wants to merge 44 commits into from

Conversation

jatin-bhateja
Copy link
Member

@jatin-bhateja jatin-bhateja commented Sep 13, 2020

Summary:

  1. Partial in-lining technique avoids call overhead penalty for sub-word type small array copy operations with size less than 32 bytes.
  2. At runtime, a conditional check based on copy length either calls an array-copy stub or executes an optimized instruction sequence using AVX-512 masked instructions emitted at the call site.
  3. New runtime flag ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.

Performance Results:
System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
ArrayCopyPartialInlineSize : 32

JMH Block Size Baseline (ns/op) Partial Inling (ns/op) Gain
ArrayCopyAligned.testByte 1 5.417 2.696 2.009272997
ArrayCopyAligned.testByte 3 5.494 2.702 2.03330866
ArrayCopyAligned.testByte 5 5.417 2.637 2.05422829
ArrayCopyAligned.testByte 10 5.343 2.703 1.976692564
ArrayCopyAligned.testByte 20 5.837 2.636 2.214339909
ArrayCopyAligned.testByte 70 5.86 6 0.976666667
ArrayCopyAligned.testByte 150 6.766 6.906 0.979727773
ArrayCopyAligned.testByte 300 7.605 7.952 0.956363179
ArrayCopyAligned.testByte 600 11.989 12.007 0.998500874
ArrayCopyAligned.testByte 1200 16.447 16.585 0.991679228
ArrayCopyAligned.testChar 1 5.02 2.828 1.775106082
ArrayCopyAligned.testChar 3 5.129 2.762 1.85698769
ArrayCopyAligned.testChar 5 5.041 2.762 1.82512672
ArrayCopyAligned.testChar 10 5.716 2.762 2.069514844
ArrayCopyAligned.testChar 20 5.111 5.399 0.946656788
ArrayCopyAligned.testChar 70 6.271 6.242 1.004645947
ArrayCopyAligned.testChar 150 7.45 7.599 0.980392157
ArrayCopyAligned.testChar 300 9.904 10.112 0.97943038
ArrayCopyAligned.testChar 600 17.131 17.167 0.997902953
ArrayCopyAligned.testChar 1200 29.556 29.851 0.990117584
ArrayCopyUnalignedBoth.testByte 1 5.419 2.702 2.005551443
ArrayCopyUnalignedBoth.testByte 3 5.558 2.636 2.108497724
ArrayCopyUnalignedBoth.testByte 5 5.43 2.636 2.059939302
ArrayCopyUnalignedBoth.testByte 10 5.378 2.637 2.039438756
ArrayCopyUnalignedBoth.testByte 20 5.914 2.636 2.243550835
ArrayCopyUnalignedBoth.testByte 70 5.882 5.954 0.987907289
ArrayCopyUnalignedBoth.testByte 150 6.784 6.88 0.986046512
ArrayCopyUnalignedBoth.testByte 300 7.635 7.968 0.958207831
ArrayCopyUnalignedBoth.testByte 600 12.226 12.129 1.007997362
ArrayCopyUnalignedBoth.testByte 1200 16.992 20.717 0.820195974
ArrayCopyUnalignedBoth.testChar 1 5.019 2.828 1.774752475
ArrayCopyUnalignedBoth.testChar 3 5.163 2.763 1.868621064
ArrayCopyUnalignedBoth.testChar 5 5.042 2.827 1.783516095
ArrayCopyUnalignedBoth.testChar 10 5.718 2.828 2.021923621
ArrayCopyUnalignedBoth.testChar 20 5.111 5.404 0.945780903
ArrayCopyUnalignedBoth.testChar 70 6.367 6.235 1.02117081
ArrayCopyUnalignedBoth.testChar 150 7.367 8.269 0.890917886
ArrayCopyUnalignedBoth.testChar 300 10.358 10.642 0.973313287
ArrayCopyUnalignedBoth.testChar 600 20.84 17.522 1.189361945
ArrayCopyUnalignedBoth.testChar 1200 31.895 31.892 1.000094067
ArrayCopyUnalignedDst.testByte 1 5.455 2.637 2.068638604
ArrayCopyUnalignedDst.testByte 3 5.562 2.702 2.058475204
ArrayCopyUnalignedDst.testByte 5 5.427 2.702 2.008512213
ArrayCopyUnalignedDst.testByte 10 5.367 2.696 1.990727003
ArrayCopyUnalignedDst.testByte 20 5.839 2.637 2.214258627
ArrayCopyUnalignedDst.testByte 70 5.888 5.968 0.986595174
ArrayCopyUnalignedDst.testByte 150 6.785 6.773 1.001771741
ArrayCopyUnalignedDst.testByte 300 7.606 7.972 0.954089313
ArrayCopyUnalignedDst.testByte 600 11.986 21.195 0.565510734
ArrayCopyUnalignedDst.testByte 1200 16.54 16.784 0.985462345
ArrayCopyUnalignedDst.testChar 1 5.02 2.827 1.775733994
ArrayCopyUnalignedDst.testChar 3 5.131 2.762 1.857711803
ArrayCopyUnalignedDst.testChar 5 5.038 2.762 1.82404055
ArrayCopyUnalignedDst.testChar 10 5.718 2.762 2.070238957
ArrayCopyUnalignedDst.testChar 20 5.113 5.401 0.946676541
ArrayCopyUnalignedDst.testChar 70 6.222 6.214 1.001287416
ArrayCopyUnalignedDst.testChar 150 7.367 8.125 0.906707692
ArrayCopyUnalignedDst.testChar 300 10.204 10.082 1.012100774
ArrayCopyUnalignedDst.testChar 600 16.978 17.135 0.990837467
ArrayCopyUnalignedDst.testChar 1200 32.351 31.996 1.011095137
ArrayCopyUnalignedSrc.testByte 1 5.414 2.696 2.008160237
ArrayCopyUnalignedSrc.testByte 3 5.494 2.637 2.083428138
ArrayCopyUnalignedSrc.testByte 5 5.431 2.637 2.059537353
ArrayCopyUnalignedSrc.testByte 10 5.344 2.703 1.977062523
ArrayCopyUnalignedSrc.testByte 20 5.834 2.696 2.163946588
ArrayCopyUnalignedSrc.testByte 70 5.883 6.009 0.979031453
ArrayCopyUnalignedSrc.testByte 150 6.729 6.87 0.979475983
ArrayCopyUnalignedSrc.testByte 300 7.603 7.97 0.953952321
ArrayCopyUnalignedSrc.testByte 600 12.004 12.16 0.987171053
ArrayCopyUnalignedSrc.testByte 1200 16.534 16.643 0.9934507
ArrayCopyUnalignedSrc.testChar 1 5.021 2.762 1.81788559
ArrayCopyUnalignedSrc.testChar 3 5.13 2.762 1.857349747
ArrayCopyUnalignedSrc.testChar 5 5.042 2.827 1.783516095
ArrayCopyUnalignedSrc.testChar 10 5.726 2.761 2.073886273
ArrayCopyUnalignedSrc.testChar 20 5.112 5.401 0.94649139
ArrayCopyUnalignedSrc.testChar 70 6.113 6.227 0.981692629
ArrayCopyUnalignedSrc.testChar 150 7.493 7.888 0.949923935
ArrayCopyUnalignedSrc.testChar 300 10.234 10.501 0.97457385
ArrayCopyUnalignedSrc.testChar 600 17.175 17.142 1.001925096
ArrayCopyUnalignedSrc.testChar 1200 31.926 31.987 0.998092975

Detailed Reports:
Baseline : http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt
WithOpt : http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed

Issue

  • JDK-8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions

Download

$ git fetch https://git.openjdk.java.net/jdk pull/144/head:pull/144
$ git checkout pull/144

…l inlining using AVX-512 masked instructions.
@bridgekeeper
Copy link

bridgekeeper bot commented Sep 13, 2020

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk openjdk bot added the rfr Pull request is ready for review label Sep 13, 2020
@openjdk
Copy link

openjdk bot commented Sep 13, 2020

@jatin-bhateja The following label will be automatically applied to this pull request: hotspot.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

@jatin-bhateja
Copy link
Member Author

/label add hotspot-compiler-dev

@openjdk
Copy link

openjdk bot commented Sep 13, 2020

@jatin-bhateja
The hotspot-compiler label was successfully added.

@mlbridge
Copy link

mlbridge bot commented Sep 13, 2020

Webrevs

@dholmes-ora
Copy link
Member

/csr needed

Adding a new product flag requires a CSR request to be filed.

@openjdk openjdk bot added the csr Pull request needs approved CSR before integration label Sep 14, 2020
@openjdk
Copy link

openjdk bot commented Sep 14, 2020

@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request.
@jatin-bhateja please create a CSR request and add link to it in JDK-8252848. This pull request cannot be integrated until the CSR request is approved.

@jatin-bhateja
Copy link
Member Author

/csr needed

Adding a new product flag requires a CSR request to be filed.

@dholmes-ora , with 5144190e there has been a clean up of options and product options now accept DIAGNOSTIC as an additional parameter. Newly added flag is a DIAGNOSTIC flag.

@mlbridge
Copy link

mlbridge bot commented Sep 14, 2020

Mailing list message from Andrew Haley on hotspot-dev:

On 13/09/2020 20:12, Jatin Bhateja wrote:

1) Partial in-lining technique avoids call overhead penalty for
sub-word type small array copy operations with size less than 32
bytes. 2) At runtime, a conditional check based on copy length
either calls an array-copy stub or executes an optimized instruction
sequence using AVX-512 masked instructions emitted at the call site.

This may not be a good idea. See my reply at
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043114.html
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043155.html

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@jatin-bhateja
Copy link
Member Author

Mailing list message from Andrew Haley on hotspot-dev:
On 13/09/2020 20:12, Jatin Bhateja wrote:

  1. Partial in-lining technique avoids call overhead penalty for
    sub-word type small array copy operations with size less than 32
    bytes. 2) At runtime, a conditional check based on copy length
    either calls an array-copy stub or executes an optimized instruction
    sequence using AVX-512 masked instructions emitted at the call site.

This may not be a good idea. See my reply at
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043114.html
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043155.html

Frequency level switchover is sensitive to vector size, this has been taken care of by using a 32 byte vector masked operations in default mode.

Default value of ArrayCopyPartialInlineSize is 32 i.e. copy sizes b/w 1-32 are partially in lined at the call site using masked vector moves operating over YMM registers.
Only if user sets it to 64 we use ZMMs registers which forces a frequency level switch over to a lower frequency level (LVL1).

So an AVX512 lite instruction working over a 32 byte vector (YMM) will operate a maximum frequency level (LVL0).

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. https://www.redhat.com
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@mlbridge
Copy link

mlbridge bot commented Sep 14, 2020

Mailing list message from Andrew Haley on hotspot-dev:

On 14/09/2020 14:18, Jatin Bhateja wrote:

Frequency level switchover is sensitive to vector size, this has
been taken care of by using a 32 byte vector masked operations in
default mode.

Default value of ArrayCopyPartialInlineSize is 32 i.e. copy sizes
b/w 1-32 are partially in lined at the call site using masked vector
moves operating over YMM registers. Only if user sets it to 64 we
use ZMMs registers which forces a frequency level switch over to a
lower frequency level (LVL1).

So an AVX512 lite instruction working over a 32 byte vector (YMM)
will operate a maximum frequency level (LVL0).

OK, as long as you're keeping watch on this issue. We really do not
want all Java workloads to be running at lower frequency or higher
power just because of some intrinsics. Sure, if we're doing high-power
vector calculations that's fine.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

@dholmes-ora
Copy link
Member

/csr needed
Adding a new product flag requires a CSR request to be filed.

@dholmes-ora , with 5144190 there has been a clean up of options and product options now accept DIAGNOSTIC as an additional parameter. Newly added flag is a DIAGNOSTIC flag.

Apologies for that. Yes I got caught out by the new format.

/csr notneeded

@openjdk
Copy link

openjdk bot commented Sep 15, 2020

@dholmes-ora usage: /csr [needed|unneeded], requires that the issue the pull request refers to links to an approved CSR request.

@dholmes-ora
Copy link
Member

/csr unneeded

@openjdk openjdk bot removed the csr Pull request needs approved CSR before integration label Sep 15, 2020
@openjdk
Copy link

openjdk bot commented Sep 15, 2020

@dholmes-ora determined that a CSR request is no longer needed for this pull request.

jddarcy and others added 15 commits September 15, 2020 20:41
Reviewed-by: coleenp, adityam, thartmann
…cros

Remove the KILL_COMPILE_ON_FATAL_ and KILL_COMPILE_ON_ANY macros, replacing uses
of KILL_COMPILE_ON_FATAL_ with CHECK_AND_CLEAR_. Unlike KILL_COMPILE_ON_FATAL_,
CHECK_AND_CLEAR_ ignores ThreadDeath exceptions, which compiler threads should
not receive anyway.

Reviewed-by: vlivanov, neliasso
Reviewed-by: tschatzl, pliden, rkennke, sjohanss
@openjdk
Copy link

openjdk bot commented Sep 16, 2020

⚠️ @jatin-bhateja This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

@jatin-bhateja jatin-bhateja deleted the JDK-8252848 branch September 16, 2020 12:50
@mlbridge
Copy link

mlbridge bot commented Sep 16, 2020

Mailing list message from Bhateja, Jatin on hotspot-compiler-dev:

Hi Nils,
I have closed this pull request-144 and will re-open a new one for partial in-lining.

There is a code overlap with PR-61 because both these issues were related to one parent JBS (JDK-8251871).
Different pull requests PR61 and PR144 were created for each of the sub-tasks (JDK-8252847 and JDK-8252848).
For completeness of the independent patches there is some duplication of assembler routines.

But, I guess it will be difficult to integrate them post review since bot may encounter merge conflicts.

Is there a way to get them review in parallel as independent patches without creating one unified patch?

Regards,
Jatin

openjdk-notifier bot pushed a commit that referenced this pull request Oct 5, 2021
r18 should not be used as it is reserved as platform register. Linux is
fine with userspace using it, but Windows and also recently macOS (
openjdk/jdk11u-dev#301 (comment) )
are actually using it on the kernel side.

The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to
specify which registers to spill; fortunately this helper is only used
here:
https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404

I haven't seen causing this particular instance any issues in practice
_yet_, presumably because it looks hard to align the stars in order to
trigger a problem (between stp and ldp of r18 a transition to kernel
space must happen *and* the kernel needs to do something with r18). But
jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that
causes troubles as explained in the link above.

Output of `-XX:+PrintInterpreter` before this change:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000138809b00, 0x000000013880a280]  1920 bytes
--------------------------------------------------------------------------------
  0x0000000138809b00:   ldr x2, [x12, #16]
  0x0000000138809b04:   ldrh    w2, [x2, #44]
  0x0000000138809b08:   add x24, x20, x2, uxtx #3
  0x0000000138809b0c:   sub x24, x24, #0x8
[...]
  0x0000000138809fa4:   stp x16, x17, [sp, #128]
  0x0000000138809fa8:   stp x18, x19, [sp, #144]
  0x0000000138809fac:   stp x20, x21, [sp, #160]
[...]
  0x0000000138809fc0:   stp x30, xzr, [sp, #240]
  0x0000000138809fc4:   mov x0, x28
 ;; 0x10864ACCC
  0x0000000138809fc8:   mov x9, #0xaccc                 // #44236
  0x0000000138809fcc:   movk    x9, #0x864, lsl #16
  0x0000000138809fd0:   movk    x9, #0x1, lsl #32
  0x0000000138809fd4:   blr x9
  0x0000000138809fd8:   ldp x2, x3, [sp, #16]
[...]
  0x0000000138809ff4:   ldp x16, x17, [sp, #128]
  0x0000000138809ff8:   ldp x18, x19, [sp, #144]
  0x0000000138809ffc:   ldp x20, x21, [sp, #160]
```

After:
```
----------------------------------------------------------------------
method entry point (kind = native)  [0x0000000108e4db00, 0x0000000108e4e280]  1920 bytes

--------------------------------------------------------------------------------
  0x0000000108e4db00:   ldr x2, [x12, #16]
  0x0000000108e4db04:   ldrh    w2, [x2, #44]
  0x0000000108e4db08:   add x24, x20, x2, uxtx #3
  0x0000000108e4db0c:   sub x24, x24, #0x8
[...]
  0x0000000108e4dfa4:   stp x16, x17, [sp, #128]
  0x0000000108e4dfa8:   stp x19, x20, [sp, #144]
  0x0000000108e4dfac:   stp x21, x22, [sp, #160]
[...]
  0x0000000108e4dfbc:   stp x29, x30, [sp, #224]
  0x0000000108e4dfc0:   mov x0, x28
 ;; 0x107E4A06C
  0x0000000108e4dfc4:   mov x9, #0xa06c                 // #41068
  0x0000000108e4dfc8:   movk    x9, #0x7e4, lsl #16
  0x0000000108e4dfcc:   movk    x9, #0x1, lsl #32
  0x0000000108e4dfd0:   blr x9
  0x0000000108e4dfd4:   ldp x2, x3, [sp, #16]
[...]
  0x0000000108e4dff0:   ldp x16, x17, [sp, #128]
  0x0000000108e4dff4:   ldp x19, x20, [sp, #144]
  0x0000000108e4dff8:   ldp x21, x22, [sp, #160]
[...]
```
lewurm added a commit to lewurm/openjdk that referenced this pull request Oct 6, 2021
Restore looks like this now:
```
  0x0000000106e4dfcc:   movk    x9, #0x5e4, lsl openjdk#16
  0x0000000106e4dfd0:   movk    x9, #0x1, lsl openjdk#32
  0x0000000106e4dfd4:   blr x9
  0x0000000106e4dfd8:   ldp x2, x3, [sp, openjdk#16]
  0x0000000106e4dfdc:   ldp x4, x5, [sp, openjdk#32]
  0x0000000106e4dfe0:   ldp x6, x7, [sp, openjdk#48]
  0x0000000106e4dfe4:   ldp x8, x9, [sp, openjdk#64]
  0x0000000106e4dfe8:   ldp x10, x11, [sp, openjdk#80]
  0x0000000106e4dfec:   ldp x12, x13, [sp, openjdk#96]
  0x0000000106e4dff0:   ldp x14, x15, [sp, openjdk#112]
  0x0000000106e4dff4:   ldp x16, x17, [sp, openjdk#128]
  0x0000000106e4dff8:   ldp x0, x1, [sp], openjdk#144
  0x0000000106e4dffc:   ldp xzr, x19, [sp], openjdk#16
  0x0000000106e4e000:   ldp x22, x23, [sp, openjdk#16]
  0x0000000106e4e004:   ldp x24, x25, [sp, openjdk#32]
  0x0000000106e4e008:   ldp x26, x27, [sp, openjdk#48]
  0x0000000106e4e00c:   ldp x28, x29, [sp, openjdk#64]
  0x0000000106e4e010:   ldp x30, xzr, [sp, openjdk#80]
  0x0000000106e4e014:   ldp x20, x21, [sp], openjdk#96
  0x0000000106e4e018:   ldur    x12, [x29, #-24]
  0x0000000106e4e01c:   ldr x22, [x12, openjdk#16]
  0x0000000106e4e020:   add x22, x22, #0x30
  0x0000000106e4e024:   ldr x8, [x28, openjdk#8]
```
robehn pushed a commit to robehn/jdk that referenced this pull request Nov 13, 2023
fg1417 added a commit to fg1417/jdk that referenced this pull request Jul 12, 2024
…g to pointer

In the cases like:
```
  UNSAFE.putLong(address + off1 + 1030, lseed);
  UNSAFE.putLong(address + 1023, lseed);
  UNSAFE.putLong(address + off2 + 1001, lseed);
```

Unsafe intrinsifies direct memory access using a long as
the base address, generating a `CastX2P` node converting
long to pointer in C2. Then we get optoassembly code like:
```
  ldr  R10, [R15, openjdk#120]    # int ! Field: address
  ldr  R11, [R16, openjdk#136]    # int ! Field: off1
  ldr  R12, [R16, openjdk#144]    # int ! Field: off2
  add  R11, R11, R10
  mov R11, R11    # long -> ptr
  add  R12, R12, R10
  mov R10, R10    # long -> ptr
  add R11, R11, openjdk#1030    # ptr
  str  R17, [R11]    # int
  add R10, R10, openjdk#1023    # ptr
  str  R17, [R10]    # int
  mov R10, R12    # long -> ptr
  add R10, R10, openjdk#1001    # ptr
  str  R17, [R10]    # int
```

In aarch64, the conversion from long to pointer could be
a nop but C2 doesn't know it. On the existing code, we
do nothing for `mov dst src` only when `dst` == `src` [1],
then we have assembly:
```
  ldr    x10, [x15,openjdk#120]
  ldp    x11, x12, [x16,openjdk#136]
  add    x11, x11, x10
  add    x12, x12, x10
  add    x11, x11, #0x406
  str    x17, [x11]
  add    x10, x10, #0x3ff
  str    x17, [x10]
  mov    x10, x12  <--- extra register copy
  add    x10, x10, #0x3e9
  str    x17, [x10]
```

There is still one extra register copy, which we're trying
to remove in this patch.

This patch folds `CastX2P` into memory operands by introducing
`indirectX2P` and `indOffX2P`. We also create a new opclass
`iRegPorL2P` to remove extra copies from `CastX2P` in pointer
addition.

Tier 1~3 passed on aarch64. No obvious change in size
of libjvm.so

[1] https://github.com/openjdk/jdk/blob/5c612c230b0a852aed5fd36e58b82ebf2e1838af/src/hotspot/cpu/aarch64/aarch64.ad#L7906
fg1417 added a commit to fg1417/jdk that referenced this pull request Jul 12, 2024
This patch forces `CastX2P` to be a two-address instruction,
so that C2 could allocate the same register for dst and
src. Then we can remove the instruction completely in the
assembly.

The motivation comes from some cast operations like `castPP`.
The difference for ADLC between `castPP` and `CastX2P` lies in
that `CastX2P` always has different types for dst and src.
We can force ADLC to generate an extra `two_adr()` for `CastX2P`
like it does automatically for `castPP`, which could tell register
allocator that the instruction needs the same register for dst
and src.

However, sometimes, RA and GCM in C2 can't work as we expected.

For example, we have Assembly on the existing code:
```
  ldp    x10, x11, [x17,openjdk#136]
  add    x10, x10, x15
  add    x11, x11, x10
  ldr    x12, [x17,openjdk#152]
  str    x16, [x10]
  add    x10, x12, x15
  str    x16, [x11]
  str    x16, [x10]
```

After applying the patch, the assembly is:
```
  ldr    x10, [x16,openjdk#136]  <--- 1
  add    x10, x10, x15
  ldr    x11, [x16,openjdk#144]  <--- 2
  mov    x13, x10         <--- 3
  str    x17, [x13]
  ldr    x12, [x16,openjdk#152]
  add    x10, x11, x10
  str    x17, [x10]
  add    x10, x12, x15
  str    x17, [x10]
```

C2 generate a totally extra mov, see 3, and we even lost the chance
to merge load pair, see 1 and 2. That's terrible.

Although this scenario would disappear after combining with
openjdk#20157, I'm
still not sure if this patch is worthwhile.
pfirmstone added a commit to pfirmstone/jdk-with-authorization that referenced this pull request Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.