8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions #144

jatin-bhateja · 2020-09-13T19:02:59Z

Summary:

Partial in-lining technique avoids call overhead penalty for sub-word type small array copy operations with size less than 32 bytes.
At runtime, a conditional check based on copy length either calls an array-copy stub or executes an optimized instruction sequence using AVX-512 masked instructions emitted at the call site.
New runtime flag ArrayCopyPartialInlineSize=0/32(default)/64 bytes determines the maximum size for partial in-lining.

Performance Results:
System : CascadeLake Server, Intel(R) Xeon(R) Platinum 8280L CPU @ 2.70GHz
Micros : test/micro/org/openjdk/bench/java/lang/ArrayCopy*.java
ArrayCopyPartialInlineSize : 32

JMH	Block Size	Baseline (ns/op)	Partial Inling (ns/op)	Gain
ArrayCopyAligned.testByte	1	5.417	2.696	2.009272997
ArrayCopyAligned.testByte	3	5.494	2.702	2.03330866
ArrayCopyAligned.testByte	5	5.417	2.637	2.05422829
ArrayCopyAligned.testByte	10	5.343	2.703	1.976692564
ArrayCopyAligned.testByte	20	5.837	2.636	2.214339909
ArrayCopyAligned.testByte	70	5.86	6	0.976666667
ArrayCopyAligned.testByte	150	6.766	6.906	0.979727773
ArrayCopyAligned.testByte	300	7.605	7.952	0.956363179
ArrayCopyAligned.testByte	600	11.989	12.007	0.998500874
ArrayCopyAligned.testByte	1200	16.447	16.585	0.991679228
ArrayCopyAligned.testChar	1	5.02	2.828	1.775106082
ArrayCopyAligned.testChar	3	5.129	2.762	1.85698769
ArrayCopyAligned.testChar	5	5.041	2.762	1.82512672
ArrayCopyAligned.testChar	10	5.716	2.762	2.069514844
ArrayCopyAligned.testChar	20	5.111	5.399	0.946656788
ArrayCopyAligned.testChar	70	6.271	6.242	1.004645947
ArrayCopyAligned.testChar	150	7.45	7.599	0.980392157
ArrayCopyAligned.testChar	300	9.904	10.112	0.97943038
ArrayCopyAligned.testChar	600	17.131	17.167	0.997902953
ArrayCopyAligned.testChar	1200	29.556	29.851	0.990117584
ArrayCopyUnalignedBoth.testByte	1	5.419	2.702	2.005551443
ArrayCopyUnalignedBoth.testByte	3	5.558	2.636	2.108497724
ArrayCopyUnalignedBoth.testByte	5	5.43	2.636	2.059939302
ArrayCopyUnalignedBoth.testByte	10	5.378	2.637	2.039438756
ArrayCopyUnalignedBoth.testByte	20	5.914	2.636	2.243550835
ArrayCopyUnalignedBoth.testByte	70	5.882	5.954	0.987907289
ArrayCopyUnalignedBoth.testByte	150	6.784	6.88	0.986046512
ArrayCopyUnalignedBoth.testByte	300	7.635	7.968	0.958207831
ArrayCopyUnalignedBoth.testByte	600	12.226	12.129	1.007997362
ArrayCopyUnalignedBoth.testByte	1200	16.992	20.717	0.820195974
ArrayCopyUnalignedBoth.testChar	1	5.019	2.828	1.774752475
ArrayCopyUnalignedBoth.testChar	3	5.163	2.763	1.868621064
ArrayCopyUnalignedBoth.testChar	5	5.042	2.827	1.783516095
ArrayCopyUnalignedBoth.testChar	10	5.718	2.828	2.021923621
ArrayCopyUnalignedBoth.testChar	20	5.111	5.404	0.945780903
ArrayCopyUnalignedBoth.testChar	70	6.367	6.235	1.02117081
ArrayCopyUnalignedBoth.testChar	150	7.367	8.269	0.890917886
ArrayCopyUnalignedBoth.testChar	300	10.358	10.642	0.973313287
ArrayCopyUnalignedBoth.testChar	600	20.84	17.522	1.189361945
ArrayCopyUnalignedBoth.testChar	1200	31.895	31.892	1.000094067
ArrayCopyUnalignedDst.testByte	1	5.455	2.637	2.068638604
ArrayCopyUnalignedDst.testByte	3	5.562	2.702	2.058475204
ArrayCopyUnalignedDst.testByte	5	5.427	2.702	2.008512213
ArrayCopyUnalignedDst.testByte	10	5.367	2.696	1.990727003
ArrayCopyUnalignedDst.testByte	20	5.839	2.637	2.214258627
ArrayCopyUnalignedDst.testByte	70	5.888	5.968	0.986595174
ArrayCopyUnalignedDst.testByte	150	6.785	6.773	1.001771741
ArrayCopyUnalignedDst.testByte	300	7.606	7.972	0.954089313
ArrayCopyUnalignedDst.testByte	600	11.986	21.195	0.565510734
ArrayCopyUnalignedDst.testByte	1200	16.54	16.784	0.985462345
ArrayCopyUnalignedDst.testChar	1	5.02	2.827	1.775733994
ArrayCopyUnalignedDst.testChar	3	5.131	2.762	1.857711803
ArrayCopyUnalignedDst.testChar	5	5.038	2.762	1.82404055
ArrayCopyUnalignedDst.testChar	10	5.718	2.762	2.070238957
ArrayCopyUnalignedDst.testChar	20	5.113	5.401	0.946676541
ArrayCopyUnalignedDst.testChar	70	6.222	6.214	1.001287416
ArrayCopyUnalignedDst.testChar	150	7.367	8.125	0.906707692
ArrayCopyUnalignedDst.testChar	300	10.204	10.082	1.012100774
ArrayCopyUnalignedDst.testChar	600	16.978	17.135	0.990837467
ArrayCopyUnalignedDst.testChar	1200	32.351	31.996	1.011095137
ArrayCopyUnalignedSrc.testByte	1	5.414	2.696	2.008160237
ArrayCopyUnalignedSrc.testByte	3	5.494	2.637	2.083428138
ArrayCopyUnalignedSrc.testByte	5	5.431	2.637	2.059537353
ArrayCopyUnalignedSrc.testByte	10	5.344	2.703	1.977062523
ArrayCopyUnalignedSrc.testByte	20	5.834	2.696	2.163946588
ArrayCopyUnalignedSrc.testByte	70	5.883	6.009	0.979031453
ArrayCopyUnalignedSrc.testByte	150	6.729	6.87	0.979475983
ArrayCopyUnalignedSrc.testByte	300	7.603	7.97	0.953952321
ArrayCopyUnalignedSrc.testByte	600	12.004	12.16	0.987171053
ArrayCopyUnalignedSrc.testByte	1200	16.534	16.643	0.9934507
ArrayCopyUnalignedSrc.testChar	1	5.021	2.762	1.81788559
ArrayCopyUnalignedSrc.testChar	3	5.13	2.762	1.857349747
ArrayCopyUnalignedSrc.testChar	5	5.042	2.827	1.783516095
ArrayCopyUnalignedSrc.testChar	10	5.726	2.761	2.073886273
ArrayCopyUnalignedSrc.testChar	20	5.112	5.401	0.94649139
ArrayCopyUnalignedSrc.testChar	70	6.113	6.227	0.981692629
ArrayCopyUnalignedSrc.testChar	150	7.493	7.888	0.949923935
ArrayCopyUnalignedSrc.testChar	300	10.234	10.501	0.97457385
ArrayCopyUnalignedSrc.testChar	600	17.175	17.142	1.001925096
ArrayCopyUnalignedSrc.testChar	1200	31.926	31.987	0.998092975

Detailed Reports:
Baseline : http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_Baseline.txt
WithOpt : http://cr.openjdk.java.net/~jbhateja/8252848/JMH_results/JMH_With_PI_Opts.txt

Progress

Change must not contain extraneous whitespace
Commit message must refer to an issue
Change must be properly reviewed

Issue

JDK-8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions

Download

$ git fetch https://git.openjdk.java.net/jdk pull/144/head:pull/144
$ git checkout pull/144

…l inlining using AVX-512 masked instructions.

bridgekeeper · 2020-09-13T19:04:24Z

👋 Welcome back jbhateja! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2020-09-13T19:05:18Z

@jatin-bhateja The following label will be automatically applied to this pull request: hotspot.

When this pull request is ready to be reviewed, an RFR email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label (add|remove) "label" command.

jatin-bhateja · 2020-09-13T19:05:58Z

/label add hotspot-compiler-dev

openjdk · 2020-09-13T19:06:21Z

@jatin-bhateja
The hotspot-compiler label was successfully added.

mlbridge · 2020-09-13T19:12:02Z

Webrevs

Reviewed-by: coleenp

Reviewed-by: iklam, ccheung

dholmes-ora · 2020-09-14T04:02:10Z

/csr needed

Adding a new product flag requires a CSR request to be filed.

openjdk · 2020-09-14T04:02:50Z

@dholmes-ora has indicated that a compatibility and specification (CSR) request is needed for this pull request.
@jatin-bhateja please create a CSR request and add link to it in JDK-8252848. This pull request cannot be integrated until the CSR request is approved.

jatin-bhateja · 2020-09-14T05:01:24Z

/csr needed

Adding a new product flag requires a CSR request to be filed.

@dholmes-ora , with 5144190e there has been a clean up of options and product options now accept DIAGNOSTIC as an additional parameter. Newly added flag is a DIAGNOSTIC flag.

Reviewed-by: kbarrett, stefank, eosterlund

Reviewed-by: iklam, dholmes

Reviewed-by: kvn, jcm

…s that require type inference Reviewed-by: vromero

mlbridge · 2020-09-14T08:41:20Z

Mailing list message from Andrew Haley on hotspot-dev:

On 13/09/2020 20:12, Jatin Bhateja wrote:

1) Partial in-lining technique avoids call overhead penalty for
sub-word type small array copy operations with size less than 32
bytes. 2) At runtime, a conditional check based on copy length
either calls an array-copy stub or executes an optimized instruction
sequence using AVX-512 masked instructions emitted at the call site.

This may not be a good idea. See my reply at
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043114.html
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043155.html

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

jatin-bhateja · 2020-09-14T13:16:02Z

Mailing list message from Andrew Haley on hotspot-dev:
On 13/09/2020 20:12, Jatin Bhateja wrote:

Partial in-lining technique avoids call overhead penalty for
sub-word type small array copy operations with size less than 32
bytes. 2) At runtime, a conditional check based on copy length
either calls an array-copy stub or executes an optimized instruction
sequence using AVX-512 masked instructions emitted at the call site.

This may not be a good idea. See my reply at
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043114.html
https://mail.openjdk.java.net/pipermail/hotspot-dev/2020-September/043155.html

Frequency level switchover is sensitive to vector size, this has been taken care of by using a 32 byte vector masked operations in default mode.

Default value of ArrayCopyPartialInlineSize is 32 i.e. copy sizes b/w 1-32 are partially in lined at the call site using masked vector moves operating over YMM registers.
Only if user sets it to 64 we use ZMMs registers which forces a frequency level switch over to a lower frequency level (LVL1).

So an AVX512 lite instruction working over a 32 byte vector (YMM) will operate a maximum frequency level (LVL0).

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. https://www.redhat.com
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

mlbridge · 2020-09-14T16:45:16Z

Mailing list message from Andrew Haley on hotspot-dev:

On 14/09/2020 14:18, Jatin Bhateja wrote:

Frequency level switchover is sensitive to vector size, this has
been taken care of by using a 32 byte vector masked operations in
default mode.

Default value of ArrayCopyPartialInlineSize is 32 i.e. copy sizes
b/w 1-32 are partially in lined at the call site using masked vector
moves operating over YMM registers. Only if user sets it to 64 we
use ZMMs registers which forces a frequency level switch over to a
lower frequency level (LVL1).

So an AVX512 lite instruction working over a 32 byte vector (YMM)
will operate a maximum frequency level (LVL0).

OK, as long as you're keeping watch on this issue. We really do not
want all Java workloads to be running at lower frequency or higher
power just because of some intrinsics. Sure, if we're doing high-power
vector calculations that's fine.

--
Andrew Haley (he/him)
Java Platform Lead Engineer
Red Hat UK Ltd. <https://www.redhat.com>
https://keybase.io/andrewhaley
EAC8 43EB D3EF DB98 CC77 2FAD A5CD 6035 332F A671

Reviewed-by: vromero

Reviewed-by: dholmes, lucy

Reviewed-by: kcr, herrick, naoto

dholmes-ora · 2020-09-15T02:22:49Z

/csr needed
Adding a new product flag requires a CSR request to be filed.

@dholmes-ora , with 5144190 there has been a clean up of options and product options now accept DIAGNOSTIC as an additional parameter. Newly added flag is a DIAGNOSTIC flag.

Apologies for that. Yes I got caught out by the new format.

/csr notneeded

openjdk · 2020-09-15T02:23:29Z

@dholmes-ora usage: /csr [needed|unneeded], requires that the issue the pull request refers to links to an approved CSR request.

dholmes-ora · 2020-09-15T02:26:48Z

/csr unneeded

openjdk · 2020-09-15T02:27:14Z

@dholmes-ora determined that a CSR request is no longer needed for this pull request.

Reviewed-by: erikj, adityam

…ig screens Reviewed-by: prr

Reviewed-by: lancea, joehw

Reviewed-by: coleenp, adityam, thartmann

Reviewed-by: thartmann, adityam

Reviewed-by: adityam, vlivanov

…cros Remove the KILL_COMPILE_ON_FATAL_ and KILL_COMPILE_ON_ANY macros, replacing uses of KILL_COMPILE_ON_FATAL_ with CHECK_AND_CLEAR_. Unlike KILL_COMPILE_ON_FATAL_, CHECK_AND_CLEAR_ ignores ThreadDeath exceptions, which compiler threads should not receive anyway. Reviewed-by: vlivanov, neliasso

Reviewed-by: rkennke

Reviewed-by: prr, serb

Reviewed-by: shade, dfuchs, alanb, chegar

Reviewed-by: tschatzl

Reviewed-by: tschatzl, kbarrett

Reviewed-by: tschatzl, pliden, rkennke, sjohanss

… num_queues() Reviewed-by: rkennke, zgu

…rify Reviewed-by: rkennke, zgu

openjdk · 2020-09-16T12:38:42Z

⚠️ @jatin-bhateja This pull request contains merges that bring in commits not present in the target repository. Since this is not a "merge style" pull request, these changes will be squashed when this pull request in integrated. If this is your intention, then please ignore this message. If you want to preserve the commit structure, you must change the title of this pull request to Merge <project>:<branch> where <project> is the name of another project in the OpenJDK organization (for example Merge jdk:master).

…l inlining using AVX-512 masked instructions.

… during pattern matching in ArrayCopyNode::may_modify().

…opy before Memory Barrier.

mlbridge · 2020-09-16T13:10:18Z

Mailing list message from Bhateja, Jatin on hotspot-compiler-dev:

Hi Nils,
I have closed this pull request-144 and will re-open a new one for partial in-lining.

There is a code overlap with PR-61 because both these issues were related to one parent JBS (JDK-8251871).
Different pull requests PR61 and PR144 were created for each of the sub-tasks (JDK-8252847 and JDK-8252848).
For completeness of the independent patches there is some duplication of assembler routines.

But, I guess it will be difficult to integrate them post review since bot may encounter merge conflicts.

Is there a way to get them review in parallel as independent patches without creating one unified patch?

Regards,
Jatin

r18 should not be used as it is reserved as platform register. Linux is fine with userspace using it, but Windows and also recently macOS ( openjdk/jdk11u-dev#301 (comment) ) are actually using it on the kernel side. The macro assembler uses the bit pattern `0x7fffffff` (== `r0-r30`) to specify which registers to spill; fortunately this helper is only used here: https://github.com/openjdk/jdk/blob/c05dc268acaf87236f30cf700ea3ac778e3b20e5/src/hotspot/cpu/aarch64/templateInterpreterGenerator_aarch64.cpp#L1400-L1404 I haven't seen causing this particular instance any issues in practice _yet_, presumably because it looks hard to align the stars in order to trigger a problem (between stp and ldp of r18 a transition to kernel space must happen *and* the kernel needs to do something with r18). But jdk11u-dev has more usages of the `::pusha`/`::popa` macro and that causes troubles as explained in the link above. Output of `-XX:+PrintInterpreter` before this change: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000138809b00, 0x000000013880a280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000138809b00: ldr x2, [x12, #16] 0x0000000138809b04: ldrh w2, [x2, #44] 0x0000000138809b08: add x24, x20, x2, uxtx #3 0x0000000138809b0c: sub x24, x24, #0x8 [...] 0x0000000138809fa4: stp x16, x17, [sp, #128] 0x0000000138809fa8: stp x18, x19, [sp, #144] 0x0000000138809fac: stp x20, x21, [sp, #160] [...] 0x0000000138809fc0: stp x30, xzr, [sp, #240] 0x0000000138809fc4: mov x0, x28 ;; 0x10864ACCC 0x0000000138809fc8: mov x9, #0xaccc // #44236 0x0000000138809fcc: movk x9, #0x864, lsl #16 0x0000000138809fd0: movk x9, #0x1, lsl #32 0x0000000138809fd4: blr x9 0x0000000138809fd8: ldp x2, x3, [sp, #16] [...] 0x0000000138809ff4: ldp x16, x17, [sp, #128] 0x0000000138809ff8: ldp x18, x19, [sp, #144] 0x0000000138809ffc: ldp x20, x21, [sp, #160] ``` After: ``` ---------------------------------------------------------------------- method entry point (kind = native) [0x0000000108e4db00, 0x0000000108e4e280] 1920 bytes -------------------------------------------------------------------------------- 0x0000000108e4db00: ldr x2, [x12, #16] 0x0000000108e4db04: ldrh w2, [x2, #44] 0x0000000108e4db08: add x24, x20, x2, uxtx #3 0x0000000108e4db0c: sub x24, x24, #0x8 [...] 0x0000000108e4dfa4: stp x16, x17, [sp, #128] 0x0000000108e4dfa8: stp x19, x20, [sp, #144] 0x0000000108e4dfac: stp x21, x22, [sp, #160] [...] 0x0000000108e4dfbc: stp x29, x30, [sp, #224] 0x0000000108e4dfc0: mov x0, x28 ;; 0x107E4A06C 0x0000000108e4dfc4: mov x9, #0xa06c // #41068 0x0000000108e4dfc8: movk x9, #0x7e4, lsl #16 0x0000000108e4dfcc: movk x9, #0x1, lsl #32 0x0000000108e4dfd0: blr x9 0x0000000108e4dfd4: ldp x2, x3, [sp, #16] [...] 0x0000000108e4dff0: ldp x16, x17, [sp, #128] 0x0000000108e4dff4: ldp x19, x20, [sp, #144] 0x0000000108e4dff8: ldp x21, x22, [sp, #160] [...] ```

Restore looks like this now: ``` 0x0000000106e4dfcc: movk x9, #0x5e4, lsl openjdk#16 0x0000000106e4dfd0: movk x9, #0x1, lsl openjdk#32 0x0000000106e4dfd4: blr x9 0x0000000106e4dfd8: ldp x2, x3, [sp, openjdk#16] 0x0000000106e4dfdc: ldp x4, x5, [sp, openjdk#32] 0x0000000106e4dfe0: ldp x6, x7, [sp, openjdk#48] 0x0000000106e4dfe4: ldp x8, x9, [sp, openjdk#64] 0x0000000106e4dfe8: ldp x10, x11, [sp, openjdk#80] 0x0000000106e4dfec: ldp x12, x13, [sp, openjdk#96] 0x0000000106e4dff0: ldp x14, x15, [sp, openjdk#112] 0x0000000106e4dff4: ldp x16, x17, [sp, openjdk#128] 0x0000000106e4dff8: ldp x0, x1, [sp], openjdk#144 0x0000000106e4dffc: ldp xzr, x19, [sp], openjdk#16 0x0000000106e4e000: ldp x22, x23, [sp, openjdk#16] 0x0000000106e4e004: ldp x24, x25, [sp, openjdk#32] 0x0000000106e4e008: ldp x26, x27, [sp, openjdk#48] 0x0000000106e4e00c: ldp x28, x29, [sp, openjdk#64] 0x0000000106e4e010: ldp x30, xzr, [sp, openjdk#80] 0x0000000106e4e014: ldp x20, x21, [sp], openjdk#96 0x0000000106e4e018: ldur x12, [x29, #-24] 0x0000000106e4e01c: ldr x22, [x12, openjdk#16] 0x0000000106e4e020: add x22, x22, #0x30 0x0000000106e4e024: ldr x8, [x28, openjdk#8] ```

…g to pointer In the cases like: ``` UNSAFE.putLong(address + off1 + 1030, lseed); UNSAFE.putLong(address + 1023, lseed); UNSAFE.putLong(address + off2 + 1001, lseed); ``` Unsafe intrinsifies direct memory access using a long as the base address, generating a `CastX2P` node converting long to pointer in C2. Then we get optoassembly code like: ``` ldr R10, [R15, openjdk#120] # int ! Field: address ldr R11, [R16, openjdk#136] # int ! Field: off1 ldr R12, [R16, openjdk#144] # int ! Field: off2 add R11, R11, R10 mov R11, R11 # long -> ptr add R12, R12, R10 mov R10, R10 # long -> ptr add R11, R11, openjdk#1030 # ptr str R17, [R11] # int add R10, R10, openjdk#1023 # ptr str R17, [R10] # int mov R10, R12 # long -> ptr add R10, R10, openjdk#1001 # ptr str R17, [R10] # int ``` In aarch64, the conversion from long to pointer could be a nop but C2 doesn't know it. On the existing code, we do nothing for `mov dst src` only when `dst` == `src` [1], then we have assembly: ``` ldr x10, [x15,openjdk#120] ldp x11, x12, [x16,openjdk#136] add x11, x11, x10 add x12, x12, x10 add x11, x11, #0x406 str x17, [x11] add x10, x10, #0x3ff str x17, [x10] mov x10, x12 <--- extra register copy add x10, x10, #0x3e9 str x17, [x10] ``` There is still one extra register copy, which we're trying to remove in this patch. This patch folds `CastX2P` into memory operands by introducing `indirectX2P` and `indOffX2P`. We also create a new opclass `iRegPorL2P` to remove extra copies from `CastX2P` in pointer addition. Tier 1~3 passed on aarch64. No obvious change in size of libjvm.so [1] https://github.com/openjdk/jdk/blob/5c612c230b0a852aed5fd36e58b82ebf2e1838af/src/hotspot/cpu/aarch64/aarch64.ad#L7906

This patch forces `CastX2P` to be a two-address instruction, so that C2 could allocate the same register for dst and src. Then we can remove the instruction completely in the assembly. The motivation comes from some cast operations like `castPP`. The difference for ADLC between `castPP` and `CastX2P` lies in that `CastX2P` always has different types for dst and src. We can force ADLC to generate an extra `two_adr()` for `CastX2P` like it does automatically for `castPP`, which could tell register allocator that the instruction needs the same register for dst and src. However, sometimes, RA and GCM in C2 can't work as we expected. For example, we have Assembly on the existing code: ``` ldp x10, x11, [x17,openjdk#136] add x10, x10, x15 add x11, x11, x10 ldr x12, [x17,openjdk#152] str x16, [x10] add x10, x12, x15 str x16, [x11] str x16, [x10] ``` After applying the patch, the assembly is: ``` ldr x10, [x16,openjdk#136] <--- 1 add x10, x10, x15 ldr x11, [x16,openjdk#144] <--- 2 mov x13, x10 <--- 3 str x17, [x13] ldr x12, [x16,openjdk#152] add x10, x11, x10 str x17, [x10] add x10, x12, x15 str x17, [x10] ``` C2 generate a totally extra mov, see 3, and we even lost the chance to merge load pair, see 1 and 2. That's terrible. Although this scenario would disappear after combining with openjdk#20157, I'm still not sure if this patch is worthwhile.

…river/api/io/X500PrincipalSerializer.java openjdk#144

8252848: Optimize small primitive arrayCopy operations through partia…

1601fba

…l inlining using AVX-512 masked instructions.

openjdk bot added the rfr Pull request is ready for review label Sep 13, 2020

openjdk bot added the hotspot [email protected] label Sep 13, 2020

openjdk bot added the hotspot-compiler [email protected] label Sep 13, 2020

iklam and others added 2 commits September 13, 2020 19:20

8248186: Move CDS C++ vtable code to cppVtables.cpp

c5e63b6

Reviewed-by: coleenp

8252689: Classes are loaded from jrt:/java.base even when CDS is used

f978f6f

Reviewed-by: iklam, ccheung

openjdk bot added the csr Pull request needs approved CSR before integration label Sep 14, 2020

pliden and others added 4 commits September 14, 2020 07:06

8253030: ZGC: Change ZMarkCompleteTimeout unit to microseconds

07da3a1

Reviewed-by: kbarrett, stefank, eosterlund

8253084: Zero VM is broken after JDK-8252689

779d2c3

Reviewed-by: iklam, dholmes

8252898: remove bulk registration of JFR CompilerPhaseType names

b05290a

Reviewed-by: kvn, jcm

8240658: Code completion not working for lambdas in method invocation…

68da63d

…s that require type inference Reviewed-by: vromero

Pavel Rappo and others added 3 commits September 14, 2020 17:21

8252882: Clean up jdk.javadoc and the related parts of jdk.compiler

e6a493a

Reviewed-by: vromero

8253029: [PPC64] Remove obsolete Power6 code

9c24a56

Reviewed-by: dholmes, lucy

8223187: Remove setLocale() call in jpackage native launcher

ac9d1b0

Reviewed-by: kcr, herrick, naoto

openjdk bot removed the csr Pull request needs approved CSR before integration label Sep 15, 2020

jddarcy and others added 15 commits September 15, 2020 20:41

8253034: Update symbol generation to accomodate Git as the SCM

fc36328

Reviewed-by: erikj, adityam

8253147: The javax/swing/JPopupMenu/7154841/bug7154841.java fail on b…

65bfe09

…ig screens Reviewed-by: prr

8220483: Calendar.setTime(Date date) throws NPE with Date date = null

57f92d2

Reviewed-by: lancea, joehw

8250668: Clean up method_oop names in adlc

2caa20a

Reviewed-by: coleenp, adityam, thartmann

8253146: C2: Purge unused MachCallNode::_arg_size field

7c564e1

Reviewed-by: thartmann, adityam

8253040: Remove unused Matcher::regnum_to_fpu_offset()

fbf4699

Reviewed-by: adityam, vlivanov

8253222: Shenandoah: unused AlwaysTrueClosure after JDK-8246591

dd43533

Reviewed-by: rkennke

8253016: Box.Filler components should be unfocusable by default

60c4902

Reviewed-by: prr, serb

8245309: Re-examine use of ThreadLocalCoders in sun.net.www.ParseUtil

e0cf023

Reviewed-by: shade, dfuchs, alanb, chegar

8253220: Epsilon: clean up unused code/declarations

7f9b5d9

Reviewed-by: tschatzl

8253219: Epsilon: clean up unnecessary includes

f509eb0

Reviewed-by: tschatzl, kbarrett

8253173: Print heap before and after GC lacks a newline

33f8e70

Reviewed-by: tschatzl, pliden, rkennke, sjohanss

8253224: Shenandoah: ShenandoahStrDedupQueue destructor calls virtual…

c781594

… num_queues() Reviewed-by: rkennke, zgu

8253226: Shenandoah: remove unimplemented ShenandoahStrDedupQueue::ve…

300b851

…rify Reviewed-by: rkennke, zgu

Jatin Bhateja and others added 4 commits September 16, 2020 18:10

8252848: Optimize small primitive arrayCopy operations through partia…

ec0ca37

…l inlining using AVX-512 masked instructions.

8252848: Updating pull request-144, added a safety check on node type…

780b344

… during pattern matching in ArrayCopyNode::may_modify().

8252848: Strengthening the check to detect partially in-lined array c…

53f58e0

…opy before Memory Barrier.

8252848: Rebase patch with branch tip.

b9eaa46

jatin-bhateja closed this Sep 16, 2020

jatin-bhateja deleted the JDK-8252848 branch September 16, 2020 12:50

vnkozlov mentioned this pull request Sep 22, 2020

8173585: Intrinsify StringLatin1.indexOf(char) #71

Closed

3 tasks

robehn pushed a commit to robehn/jdk that referenced this pull request Nov 13, 2023

Use rivos-sdk sysroot to cross-compile JDK (openjdk#144)

ff334da

pfirmstone added a commit to pfirmstone/jdk-with-authorization that referenced this pull request Nov 18, 2024

Snyk identified bug in JGDMS/jgdms-platform/src/main/java/org/apache/…

a60ffdd

…river/api/io/X500PrincipalSerializer.java openjdk#144

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions #144

8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions #144

jatin-bhateja commented Sep 13, 2020 •

edited by openjdk bot

Loading

bridgekeeper bot commented Sep 13, 2020

openjdk bot commented Sep 13, 2020

jatin-bhateja commented Sep 13, 2020

openjdk bot commented Sep 13, 2020

mlbridge bot commented Sep 13, 2020 •

edited

Loading

dholmes-ora commented Sep 14, 2020

openjdk bot commented Sep 14, 2020

jatin-bhateja commented Sep 14, 2020

mlbridge bot commented Sep 14, 2020

jatin-bhateja commented Sep 14, 2020

mlbridge bot commented Sep 14, 2020

dholmes-ora commented Sep 15, 2020

openjdk bot commented Sep 15, 2020

dholmes-ora commented Sep 15, 2020

openjdk bot commented Sep 15, 2020

openjdk bot commented Sep 16, 2020

mlbridge bot commented Sep 16, 2020

8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions #144

8252848: Optimize small primitive arrayCopy operations through partial inlining using AVX-512 masked instructions #144

Conversation

jatin-bhateja commented Sep 13, 2020 • edited by openjdk bot Loading

Progress

Issue

Download

bridgekeeper bot commented Sep 13, 2020

openjdk bot commented Sep 13, 2020

jatin-bhateja commented Sep 13, 2020

openjdk bot commented Sep 13, 2020

mlbridge bot commented Sep 13, 2020 • edited Loading

Webrevs

dholmes-ora commented Sep 14, 2020

openjdk bot commented Sep 14, 2020

jatin-bhateja commented Sep 14, 2020

mlbridge bot commented Sep 14, 2020

jatin-bhateja commented Sep 14, 2020

mlbridge bot commented Sep 14, 2020

dholmes-ora commented Sep 15, 2020

openjdk bot commented Sep 15, 2020

dholmes-ora commented Sep 15, 2020

openjdk bot commented Sep 15, 2020

openjdk bot commented Sep 16, 2020

mlbridge bot commented Sep 16, 2020

jatin-bhateja commented Sep 13, 2020 •

edited by openjdk bot

Loading

mlbridge bot commented Sep 13, 2020 •

edited

Loading