-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions #144
Conversation
…l inlining using AVX-512 masked instructions.
👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into |
@jatin-bhateja The following label will be automatically applied to this pull request: When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the |
/label add hotspot-compiler-dev |
@jatin-bhateja |
Webrevs
|
Reviewed-by: coleenp
Reviewed-by: iklam, ccheung
/csr needed Adding a new product flag requires a CSR request to be filed. |
@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request. |
@dholmes-ora , with 5144190e there has been a clean up of options and product options now accept DIAGNOSTIC as an additional parameter. Newly added flag is a DIAGNOSTIC flag. |
Reviewed-by: kbarrett, stefank, eosterlund
Reviewed-by: iklam, dholmes
Reviewed-by: kvn, jcm
…s that require type inference Reviewed-by: vromero
Mailing list message from Andrew Haley on hotspot-dev: On 13/09/2020 20:12, Jatin Bhateja wrote:
This may not be a good idea. See my reply at -- |
Frequency level switchover is sensitive to vector size, this has been taken care of by using a 32 byte vector masked operations in default mode. Default value of ArrayCopyPartialInlineSize is 32 i.e. copy sizes b/w 1-32 are partially in lined at the call site using masked vector moves operating over YMM registers. So an AVX512 lite instruction working over a 32 byte vector (YMM) will operate a maximum frequency level (LVL0).
|
Mailing list message from Andrew Haley on hotspot-dev: On 14/09/2020 14:18, Jatin Bhateja wrote:
OK, as long as you're keeping watch on this issue. We really do not -- |
Reviewed-by: vromero
Reviewed-by: dholmes, lucy
Reviewed-by: kcr, herrick, naoto
Apologies for that. Yes I got caught out by the new format. /csr notneeded |
@dholmes-ora usage: |
/csr unneeded |
@dholmes-ora determined that a CSR request is no longer needed for this pull request. |
Reviewed-by: erikj, adityam
…ig screens Reviewed-by: prr
Reviewed-by: lancea, joehw
Reviewed-by: coleenp, adityam, thartmann
Reviewed-by: thartmann, adityam
Reviewed-by: adityam, vlivanov
…cros Remove the KILL_COMPILE_ON_FATAL_ and KILL_COMPILE_ON_ANY macros, replacing uses of KILL_COMPILE_ON_FATAL_ with CHECK_AND_CLEAR_. Unlike KILL_COMPILE_ON_FATAL_, CHECK_AND_CLEAR_ ignores ThreadDeath exceptions, which compiler threads should not receive anyway. Reviewed-by: vlivanov, neliasso
Reviewed-by: rkennke
Reviewed-by: prr, serb
Reviewed-by: shade, dfuchs, alanb, chegar
Reviewed-by: tschatzl
Reviewed-by: tschatzl, kbarrett
Reviewed-by: tschatzl, pliden, rkennke, sjohanss
… num_queues() Reviewed-by: rkennke, zgu
…rify Reviewed-by: rkennke, zgu
|
…l inlining using AVX-512 masked instructions.
… during pattern matching in ArrayCopyNode::may_modify().
…opy before Memory Barrier.
Mailing list message from Bhateja, Jatin on hotspot-compiler-dev: Hi Nils, There is a code overlap with PR-61 because both these issues were related to one parent JBS (JDK-8251871). But, I guess it will be difficult to integrate them post review since bot may encounter merge conflicts. Is there a way to get them review in parallel as independent patches without creating one unified patch? Regards, |
r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```
Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```
…g to pointer In the cases like: ``` UNSAFE.putLong(address + off1 + 1030, lseed); UNSAFE.putLong(address + 1023, lseed); UNSAFE.putLong(address + off2 + 1001, lseed); ``` Unsafe intrinsifies direct memory access using a long as the base address, generating a `CastX2P` node converting long to pointer in C2. Then we get optoassembly code like: ``` ldr R10, [R15, openjdk#120] # int ! Field: address ldr R11, [R16, openjdk#136] # int ! Field: off1 ldr R12, [R16, openjdk#144] # int ! Field: off2 add R11, R11, R10 mov R11, R11 # long -> ptr add R12, R12, R10 mov R10, R10 # long -> ptr add R11, R11, openjdk#1030 # ptr str R17, [R11] # int add R10, R10, openjdk#1023 # ptr str R17, [R10] # int mov R10, R12 # long -> ptr add R10, R10, openjdk#1001 # ptr str R17, [R10] # int ``` In aarch64, the conversion from long to pointer could be a nop but C2 doesn't know it. On the existing code, we do nothing for `mov dst src` only when `dst` == `src` [1], then we have assembly: ``` ldr x10, [x15,openjdk#120] ldp x11, x12, [x16,openjdk#136] add x11, x11, x10 add x12, x12, x10 add x11, x11, #0x406 str x17, [x11] add x10, x10, #0x3ff str x17, [x10] mov x10, x12 <--- extra register copy add x10, x10, #0x3e9 str x17, [x10] ``` There is still one extra register copy, which we're trying to remove in this patch. This patch folds `CastX2P` into memory operands by introducing `indirectX2P` and `indOffX2P`. We also create a new opclass `iRegPorL2P` to remove extra copies from `CastX2P` in pointer addition. Tier 1~3 passed on aarch64. No obvious change in size of libjvm.so [1] https://github.com/openjdk/jdk/blob/5c612c230b0a852aed5fd36e58b82ebf2e1838af/src/hotspot/cpu/aarch64/aarch64.ad#L7906
This patch forces `CastX2P` to be a two-address instruction, so that C2 could allocate the same register for dst and src. Then we can remove the instruction completely in the assembly. The motivation comes from some cast operations like `castPP`. The difference for ADLC between `castPP` and `CastX2P` lies in that `CastX2P` always has different types for dst and src. We can force ADLC to generate an extra `two_adr()` for `CastX2P` like it does automatically for `castPP`, which could tell register allocator that the instruction needs the same register for dst and src. However, sometimes, RA and GCM in C2 can't work as we expected. For example, we have Assembly on the existing code: ``` ldp x10, x11, [x17,openjdk#136] add x10, x10, x15 add x11, x11, x10 ldr x12, [x17,openjdk#152] str x16, [x10] add x10, x12, x15 str x16, [x11] str x16, [x10] ``` After applying the patch, the assembly is: ``` ldr x10, [x16,openjdk#136] <--- 1 add x10, x10, x15 ldr x11, [x16,openjdk#144] <--- 2 mov x13, x10 <--- 3 str x17, [x13] ldr x12, [x16,openjdk#152] add x10, x11, x10 str x17, [x10] add x10, x12, x15 str x17, [x10] ``` C2 generate a totally extra mov, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible. Although this scenario would disappear after combining with openjdk#20157, I'm still not sure if this patch is worthwhile.
…river/api/io/X500PrincipalSerializer.java openjdk#144
Summary:
Performance Results:
System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
ArrayCopyPartialInlineSize : 32
Detailed Reports:
Baseline : http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt
WithOpt : http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt
Progress
Issue
Download
$ git fetch https://git.openjdk.java.net/jdk pull/144/head:pull/144
$ git checkout pull/144