JIT: align Tier1 methods at 16 byte boundaries for xarch [WIP] #21518

AndyAyersMS · 2018-12-13T01:28:39Z

Align Tier1, small and IBC hot methods to 16 byte boundaries for x64 and x86.
Consensus from various folks I polled was that this isn't as helpful for arm
architectures, so for now this is xarch only.

This ensures that instruction prefetch pulls in as much code as possible.

It should also improve performance stability in some benchmarks, as well as
opening the door for possible loop-top aligment padding.

Resolves #16873.

Align Tier1, small and IBC hot methods to 16 byte boundaries for x64 and x86. Consensus from various folks I polled was that this isn't as helpful for arm architectures, so for now this is xarch only. This ensures that instruction prefetch pulls in as much code as possible. It should also improve performance stability in some benchmarks, as well as opening the door for possible loop-top aligment padding. Resolves #16873.

AndyAyersMS · 2018-12-13T01:51:41Z

Keeping this as [WIP] for now since I'd like to see some of the other R2R and tiering perf changes land first, and evaluation of this change may be tricky, and nothing depends on this. And I'll be on vacation a fair amount in the coming weeks.

I don't expect 16 byte alignment to universally improve perf as whatever method alignment we get now on a given run may work out better. But over time this alignment should be more perf stable and also gives the jit the opportunity to internally pad to avoid bad instruction fetch issues for hot loops and eventually end up in an overall better place.

I plan to make multiple measurements on different machines and over time to try and show results are indeed more stable and (for benchmarks that consistently stand out as better or worse) drill in and try to understand why.

I don't think this will cause much extra code space fragmentation -- though aligning small methods may have a notable impact. So another goal is to actually measure the size impact, especially on realistic apps.

Using the performance repo I have done a couple of base and diff runs of local builds on a Skylake i6700 and compared them . Data below is filtered by absolute difference of at least 1ms as results for shorter running tests seem to swing wildly. @adamsitnik if you want to experiment with this change and see why there's such volatility in the shorter tests I'd love to see any insights you uncover.

cc @dotnet/jit-contrib @fiigii

BASE vs DIFF 1

Slower	diff/base	Base Median (ns)	Diff Median (ns)	Modality
LinqBenchmarks.Order00ManualX	1.13	128679250.00	145741725.00
SciMark2.kernel.benchMonteCarlo	1.06	720936900.00	761832900.00	can have several modes
LinqBenchmarks.Where00ForX	1.04	407904300.00	424595200.00
BenchmarksGame.BinaryTrees_2.RunBench	1.04	100067050.00	103815100.00
ByteMark.BenchNeural	1.03	684599100.00	706311100.00
CscBench.DatflowTest	1.03	375537650.00	386855100.00
Benchstone.BenchF.NewtE.Test	1.02	396415000.00	405177150.00	bimodal
Benchstone.BenchF.Regula.Test	1.02	238591750.00	243644100.00
LinqBenchmarks.Count00ForX	1.02	328557500.00	334972000.00
LinqBenchmarks.Where01LinqMethodNestedX	1.01	367130500.00	370483800.00
LinqBenchmarks.Count00LinqMethodX	1.00	829226900.00	832690600.00

Faster	base/diff	Base Median (ns)	Diff Median (ns)
Burgers.Test3	2.54	1063852600.00	418235300.00
SciMark2.kernel.benchFFT	1.11	767675800.00	694001500.00
SeekUnroll.Test(boxedIndex: 27)	1.02	1679506400.00	1649932800.00
Benchstone.BenchF.LLoops.Test	1.02	566127600.00	556443250.00
Benchstone.BenchF.FFT.Test	1.02	172081000.00	169311400.00
SeekUnroll.Test(boxedIndex: 19)	1.02	1358964600.00	1338244400.00

BASE vs DIFF 2

Slower	diff/base	Base Median (ns)	Diff Median (ns)
Benchstone.BenchF.DMath.Test	1.08	651020750.00	701478200.00
SIMD.ConsoleMandel.VectorFloatSinglethreadADTNoInt	1.06	276949100.00	293837350.00
LinqBenchmarks.Order00LinqQueryX	1.05	88074675.00	92741887.50
LinqBenchmarks.Count00ForX	1.05	328557500.00	344875100.00
PerfLabTests.DelegatePerf.MulticastDelegateCombineInvoke	1.03	206379175.00	212145600.00
LinqBenchmarks.Where00ForX	1.03	407904300.00	419107750.00
ByteMark.BenchNumericSortJagged	1.02	1182589800.00	1209204000.00
Benchstone.BenchF.Regula.Test	1.02	238591750.00	243943800.00
ByteMark.BenchFourier	1.02	483880550.00	492823600.00
Benchstone.BenchF.NewtE.Test	1.01	396415000.00	402287800.00
SciMark2.kernel.benchMonteCarlo	1.01	720936900.00	731035700.00
LinqBenchmarks.Where00LinqQueryX	1.01	534897850.00	541721300.00
CscBench.DatflowTest	1.01	375537650.00	379996200.00
ByteMark.BenchEmFloat	1.01	3250239200.00	3273381850.00

Faster	base/diff	Base Median (ns)	Diff Median (ns)
Burgers.Test3	2.53	1063852600.00	420375300.00
SciMark2.kernel.benchFFT	1.16	767675800.00	663166950.00
Benchstone.BenchI.BubbleSort2.Test	1.05	30900981.25	29298325.00
Benchstone.BenchI.NDhrystone.Test	1.03	356696700.00	347896800.00
BenchmarksGame.MandelBrot_7.Bench(size: 4000, lineLength: 500, checksum: "C7-E6-66-43-66-73-F8-A8-D3	1.02	117919350.00	115333875.00
Benchstone.BenchF.LLoops.Test	1.02	566127600.00	556375000.00
LinqBenchmarks.Where01LinqQueryX	1.01	320124900.00	316781200.00
SIMD.RayTracerBench.Bench	1.01	470632650.00	467229150.00
SeekUnroll.Test(boxedIndex: 27)	1.01	1679506400.00	1669661100.00
SeekUnroll.Test(boxedIndex: 19)	1.01	1358964600.00	1351673500.00
Benchstone.BenchF.Whetsto.Test	1.00	529907750.00	527334600.00

tannergooding · 2018-12-13T01:51:51Z

This ensures that instruction prefetch pulls in as much code as possible.

For AMD machines, the optimization manuals going back to 2014 (so family 15h and onward) indicate that they have a 32-byte aligned fetch window. This could potentially be beneficial for the Azure Lv2-Series VMs (powered by the EPYC-series processors).

fiigii · 2018-12-13T02:09:40Z

as well as opening the door for possible loop-top alignment padding.

That would be great to extend this optimization to other branch targets (loops).

adamsitnik · 2018-12-13T16:17:12Z

@AndyAyersMS awesome! I am going to clone your fork and re-run the tests on my machines.

@AndreyAkinshin you might be also interested in this change in case you need to update your book before publishing it ;)

fiigii · 2018-12-18T20:23:31Z

@AndyAyersMS I did a bit VTune profiling for this change on Burgers, but I did not get such big perf difference (directly run the compiled tests without tiered JIT). How can I run the benchmark with ProfileData?
Additionally, I detected that Vector<T>.op_Multiply is not inlined (the function body generates SIMD code). Is it intentional?

That causes lots of caller-saving code in Burgers::GetCalculated3

Disasm of GetCalculated3 from VTune

Address	Assembly
0x7ffe75d17fd0	Block 1:
0x7ffe75d17fd0	push r15
0x7ffe75d17fd2	push r14
0x7ffe75d17fd4	push r13
0x7ffe75d17fd6	push r12
0x7ffe75d17fd8	push rdi
0x7ffe75d17fd9	push rsi
0x7ffe75d17fda	push rbp
0x7ffe75d17fdb	push rbx
0x7ffe75d17fdc	sub rsp, 0x188
0x7ffe75d17fe3	vzeroupper
0x7ffe75d17fe6	vmovaps xmmword ptr [rsp+0x170], xmm6
0x7ffe75d17fef	vmovaps xmmword ptr [rsp+0x160], xmm7
0x7ffe75d17ff8	vmovaps xmmword ptr [rsp+0x150], xmm8
0x7ffe75d18001	vmovaps xmmword ptr [rsp+0x140], xmm9
0x7ffe75d1800a	vmovaps xmmword ptr [rsp+0x130], xmm10
0x7ffe75d18013	vmovaps xmmword ptr [rsp+0x120], xmm11
0x7ffe75d1801c	vmovaps xmmword ptr [rsp+0x110], xmm12
0x7ffe75d18025	vmovaps xmmword ptr [rsp+0x100], xmm13
0x7ffe75d1802e	vmovaps xmmword ptr [rsp+0xf0], xmm14
0x7ffe75d18037	vmovaps xmmword ptr [rsp+0xe0], xmm15
0x7ffe75d18040	mov edi, ecx
0x7ffe75d18042	mov esi, edx
0x7ffe75d18044	vmovaps xmm6, xmm2
0x7ffe75d18048	vmovaps xmm7, xmm3
0x7ffe75d1804c	mov rbx, qword ptr [rsp+0x1f8]
0x7ffe75d18054	mov edx, esi
0x7ffe75d18056	sar edx, 0x1f
0x7ffe75d18059	and edx, 0x3
0x7ffe75d1805c	add edx, esi
0x7ffe75d1805e	and edx, 0xfffffffc
0x7ffe75d18061	mov ecx, esi
0x7ffe75d18063	sub ecx, edx
0x7ffe75d18065	mov edx, ecx
0x7ffe75d18067	neg edx
0x7ffe75d18069	lea ebp, ptr [rdx+rsi*1+0x4]
0x7ffe75d1806d	movsxd rdx, ebp
0x7ffe75d18070	mov rcx, 0x7ffed54d3940
0x7ffe75d1807a	call 0x7ffed5880900
0x7ffe75d1807f	Block 2:
0x7ffe75d1807f	mov r14, rax
0x7ffe75d18082	movsxd rdx, ebp
0x7ffe75d18085	mov rcx, 0x7ffed54d3940
0x7ffe75d1808f	call 0x7ffed5880900
0x7ffe75d18094	Block 3:
0x7ffe75d18094	mov r15, rax
0x7ffe75d18097	mov r8d, dword ptr [rbx+0x8]
0x7ffe75d1809b	mov rcx, rbx
0x7ffe75d1809e	mov rdx, r15
0x7ffe75d180a1	call 0x7ffed5144c80
0x7ffe75d180a6	Block 4:
0x7ffe75d180a6	vmovaps xmm0, xmm7
0x7ffe75d180aa	vmulsd xmm0, xmm0, qword ptr [rsp+0x1f0]
0x7ffe75d180b3	vdivsd xmm0, xmm0, xmm6
0x7ffe75d180b7	vmovsd xmm1, qword ptr [rip+0x3e9]
0x7ffe75d180bf	call 0x7ffed5ae2860
0x7ffe75d180c4	Block 5:
0x7ffe75d180c4	vmovaps xmm8, xmm0
0x7ffe75d180c8	xor ebx, ebx
0x7ffe75d180ca	test edi, edi
0x7ffe75d180cc	jle 0x7ffe75d183d5 <Block 28>
0x7ffe75d180d2	Block 6:
0x7ffe75d180d2	mov r12d, 0x1
0x7ffe75d180d8	lea r13d, ptr [rbp-0x3]
0x7ffe75d180dc	cmp r13d, 0x1
0x7ffe75d180e0	jle 0x7ffe75d182c6 <Block 21>
0x7ffe75d180e6	Block 7:
0x7ffe75d180e6	mov ecx, dword ptr [r15+0x8]
0x7ffe75d180ea	mov eax, ecx
0x7ffe75d180ec	vmovsd qword ptr [rsp+0x1e0], xmm6
0x7ffe75d180f5	vmovaps xmm9, xmm7
0x7ffe75d180f9	vdivsd xmm9, xmm9, xmm6
0x7ffe75d180fd	vmovsd qword ptr [rsp+0xb8], xmm8
0x7ffe75d18106	Block 8:
0x7ffe75d18106	cmp r12d, eax
0x7ffe75d18109	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1810f	Block 9:
0x7ffe75d1810f	lea ecx, ptr [r12+0x3]
0x7ffe75d18114	cmp ecx, eax
0x7ffe75d18116	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1811c	Block 10:
0x7ffe75d1811c	vmovupd ymm10, ymmword ptr [r15+r12*8+0x10]
0x7ffe75d18123	lea ecx, ptr [r12-0x1]
0x7ffe75d18128	cmp ecx, eax
0x7ffe75d1812a	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d18130	Block 11:
0x7ffe75d18130	lea ecx, ptr [r12+0x2]
0x7ffe75d18135	cmp ecx, eax
0x7ffe75d18137	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1813d	Block 12:
0x7ffe75d1813d	lea ecx, ptr [r12-0x1]
0x7ffe75d18142	vmovupd ymm11, ymmword ptr [r15+rcx*8+0x10]
0x7ffe75d18149	lea ecx, ptr [r12+0x1]
0x7ffe75d1814e	cmp ecx, eax
0x7ffe75d18150	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d18156	Block 13:
0x7ffe75d18156	lea ecx, ptr [r12+0x4]
0x7ffe75d1815b	mov dword ptr [rsp+0x2c], eax
0x7ffe75d1815f	cmp ecx, eax
0x7ffe75d18161	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d18167	Block 14:
0x7ffe75d18167	lea ecx, ptr [r12+0x1]
0x7ffe75d1816c	vmovupd ymm12, ymmword ptr [r15+rcx*8+0x10]
0x7ffe75d18173	vmovaps ymm13, ymm10
0x7ffe75d18178	lea rcx, ptr [rsp+0x90]
0x7ffe75d18180	vmovupd ymmword ptr [rsp+0x30], ymm10
0x7ffe75d18186	lea rdx, ptr [rsp+0x30]
0x7ffe75d1818b	vmovaps xmm2, xmm9
0x7ffe75d18190	vextractf128 xmm14, ymm10, 0x1
0x7ffe75d18196	vextractf128 xmm15, ymm13, 0x1
0x7ffe75d1819c	vextractf128 xmm4, ymm11, 0x1
0x7ffe75d181a2	vmovupd xmmword ptr [rsp+0xd0], xmm4
0x7ffe75d181ab	vextractf128 xmm4, ymm12, 0x1
0x7ffe75d181b1	vmovupd xmmword ptr [rsp+0xc0], xmm4
0x7ffe75d181ba	call 0x7ffe75d11388
0x7ffe75d181bf	Block 15:
0x7ffe75d181bf	vmovupd xmm4, xmmword ptr [rsp+0xc0]
0x7ffe75d181c8	vinsertf128 ymm12, ymm12, xmm4, 0x1
0x7ffe75d181ce	vmovupd xmm4, xmmword ptr [rsp+0xd0]
0x7ffe75d181d7	vinsertf128 ymm11, ymm11, xmm4, 0x1
0x7ffe75d181dd	vinsertf128 ymm13, ymm13, xmm15, 0x1
0x7ffe75d181e3	vinsertf128 ymm10, ymm10, xmm14, 0x1
0x7ffe75d181e9	vsubpd ymm1, ymm10, ymm11
0x7ffe75d181ee	vmovupd ymm0, ymmword ptr [rsp+0x90]
0x7ffe75d181f7	vmulpd ymm0, ymm0, ymm1
0x7ffe75d181fb	vsubpd ymm6, ymm13, ymm0
0x7ffe75d181ff	lea rcx, ptr [rsp+0x70]
0x7ffe75d18204	vmovupd ymmword ptr [rsp+0x30], ymm10
0x7ffe75d1820a	vmovsd xmm1, qword ptr [rip+0x29e]
0x7ffe75d18212	lea r8, ptr [rsp+0x30]
0x7ffe75d18217	vextractf128 xmm8, ymm6, 0x1
0x7ffe75d1821d	vextractf128 xmm10, ymm11, 0x1
0x7ffe75d18223	vextractf128 xmm13, ymm12, 0x1
0x7ffe75d18229	call 0x7ffe75d11390
0x7ffe75d1822e	Block 16:
0x7ffe75d1822e	vinsertf128 ymm12, ymm12, xmm13, 0x1
0x7ffe75d18234	vinsertf128 ymm11, ymm11, xmm10, 0x1
0x7ffe75d1823a	vinsertf128 ymm6, ymm6, xmm8, 0x1
0x7ffe75d18240	lea rcx, ptr [rsp+0x50]
0x7ffe75d18245	vmovupd ymm1, ymmword ptr [rsp+0x70]
0x7ffe75d1824b	vsubpd ymm12, ymm12, ymm1
0x7ffe75d1824f	vaddpd ymm12, ymm12, ymm11
0x7ffe75d18254	vmovupd ymmword ptr [rsp+0x30], ymm12
0x7ffe75d1825a	vmovsd xmm1, qword ptr [rsp+0xb8]
0x7ffe75d18263	lea r8, ptr [rsp+0x30]
0x7ffe75d18268	vextractf128 xmm8, ymm6, 0x1
0x7ffe75d1826e	call 0x7ffe75d11390
0x7ffe75d18273	Block 17:
0x7ffe75d18273	vinsertf128 ymm6, ymm6, xmm8, 0x1
0x7ffe75d18279	vmovupd ymm0, ymmword ptr [rsp+0x50]
0x7ffe75d1827f	vaddpd ymm0, ymm6, ymm0
0x7ffe75d18283	cmp r12d, dword ptr [r14+0x8]
0x7ffe75d18287	jnb 0x7ffe75d1844e <Block 30>
0x7ffe75d1828d	Block 18:
0x7ffe75d1828d	lea eax, ptr [r12+0x3]
0x7ffe75d18292	cmp eax, dword ptr [r14+0x8]
0x7ffe75d18296	jnb 0x7ffe75d18453 <Block 31>
0x7ffe75d1829c	Block 19:
0x7ffe75d1829c	vmovupd ymmword ptr [r14+r12*8+0x10], ymm0
0x7ffe75d182a3	add r12d, 0x4
0x7ffe75d182a7	cmp r12d, r13d
0x7ffe75d182aa	mov eax, dword ptr [rsp+0x2c]
0x7ffe75d182ae	jl 0x7ffe75d18106 <Block 8>
0x7ffe75d182b4	Block 20:
0x7ffe75d182b4	vmovsd xmm6, qword ptr [rsp+0x1e0]
0x7ffe75d182bd	vmovsd xmm8, qword ptr [rsp+0xb8]
0x7ffe75d182c6	Block 21:
0x7ffe75d182c6	mov r13d, dword ptr [r15+0x8]
0x7ffe75d182ca	cmp r13d, 0x0
0x7ffe75d182ce	jbe 0x7ffe75d18449 <Block 29>
0x7ffe75d182d4	Block 22:
0x7ffe75d182d4	vmovsd xmm2, qword ptr [r15+0x10]
0x7ffe75d182da	vmovaps xmm0, xmm2
0x7ffe75d182de	vmulsd xmm0, xmm0, xmm7
0x7ffe75d182e2	vdivsd xmm0, xmm0, xmm6
0x7ffe75d182e6	lea eax, ptr [rsi-0x1]
0x7ffe75d182e9	cmp eax, r13d
0x7ffe75d182ec	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d182f2	Block 23:
0x7ffe75d182f2	movsxd rcx, eax
0x7ffe75d182f5	vmovsd xmm1, qword ptr [r15+rcx*8+0x10]
0x7ffe75d182fc	vmovaps xmm3, xmm2
0x7ffe75d18300	vsubsd xmm3, xmm3, xmm1
0x7ffe75d18304	vmulsd xmm0, xmm0, xmm3
0x7ffe75d18308	vmovaps xmm3, xmm2
0x7ffe75d1830c	vsubsd xmm3, xmm3, xmm0
0x7ffe75d18310	cmp r13d, 0x1
0x7ffe75d18314	jbe 0x7ffe75d18449 <Block 29>
0x7ffe75d1831a	Block 24:
0x7ffe75d1831a	vmovsd xmm0, qword ptr [r15+0x18]
0x7ffe75d18320	vmulsd xmm2, xmm2, qword ptr [rip+0x190]
0x7ffe75d18328	vsubsd xmm0, xmm0, xmm2
0x7ffe75d1832c	vaddsd xmm0, xmm0, xmm1
0x7ffe75d18330	vmulsd xmm0, xmm0, xmm8
0x7ffe75d18335	vmovaps xmm2, xmm3
0x7ffe75d18339	vaddsd xmm2, xmm2, xmm0
0x7ffe75d1833d	cmp dword ptr [r14+0x8], 0x0
0x7ffe75d18342	jbe 0x7ffe75d18449 <Block 29>
0x7ffe75d18348	Block 25:
0x7ffe75d18348	vmovsd qword ptr [r14+0x10], xmm2
0x7ffe75d1834e	vmovsd xmm2, qword ptr [r15+rcx*8+0x10]
0x7ffe75d18355	vmovaps xmm0, xmm2
0x7ffe75d18359	vmulsd xmm0, xmm0, xmm7
0x7ffe75d1835d	vdivsd xmm0, xmm0, xmm6
0x7ffe75d18361	lea edx, ptr [rsi-0x2]
0x7ffe75d18364	cmp edx, r13d
0x7ffe75d18367	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1836d	Block 26:
0x7ffe75d1836d	lea edx, ptr [rsi-0x2]
0x7ffe75d18370	movsxd rdx, edx
0x7ffe75d18373	vmovsd xmm1, qword ptr [r15+rdx*8+0x10]
0x7ffe75d1837a	vmovaps xmm3, xmm2
0x7ffe75d1837e	vsubsd xmm3, xmm3, xmm1
0x7ffe75d18382	vmulsd xmm0, xmm0, xmm3
0x7ffe75d18386	vmovaps xmm3, xmm2
0x7ffe75d1838a	vsubsd xmm3, xmm3, xmm0
0x7ffe75d1838e	vmovsd xmm0, qword ptr [r15+0x10]
0x7ffe75d18394	vmulsd xmm2, xmm2, qword ptr [rip+0x124]
0x7ffe75d1839c	vsubsd xmm0, xmm0, xmm2
0x7ffe75d183a0	vaddsd xmm0, xmm0, xmm1
0x7ffe75d183a4	vmulsd xmm0, xmm0, xmm8
0x7ffe75d183a9	vmovaps xmm2, xmm3
0x7ffe75d183ad	vaddsd xmm2, xmm2, xmm0
0x7ffe75d183b1	cmp eax, dword ptr [r14+0x8]
0x7ffe75d183b5	jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d183bb	Block 27:
0x7ffe75d183bb	vmovsd qword ptr [r14+rcx*8+0x10], xmm2
0x7ffe75d183c2	mov r12, r15
0x7ffe75d183c5	mov r15, r14
0x7ffe75d183c8	inc ebx
0x7ffe75d183ca	cmp ebx, edi
0x7ffe75d183cc	mov r14, r12
0x7ffe75d183cf	jl 0x7ffe75d180d2 <Block 6>
0x7ffe75d183d5	Block 28:
0x7ffe75d183d5	mov rax, r15
0x7ffe75d183d8	vmovaps xmm6, xmmword ptr [rsp+0x170]
0x7ffe75d183e1	vmovaps xmm7, xmmword ptr [rsp+0x160]
0x7ffe75d183ea	vmovaps xmm8, xmmword ptr [rsp+0x150]
0x7ffe75d183f3	vmovaps xmm9, xmmword ptr [rsp+0x140]
0x7ffe75d183fc	vmovaps xmm10, xmmword ptr [rsp+0x130]
0x7ffe75d18405	vmovaps xmm11, xmmword ptr [rsp+0x120]
0x7ffe75d1840e	vmovaps xmm12, xmmword ptr [rsp+0x110]
0x7ffe75d18417	vmovaps xmm13, xmmword ptr [rsp+0x100]
0x7ffe75d18420	vmovaps xmm14, xmmword ptr [rsp+0xf0]
0x7ffe75d18429	vmovaps xmm15, xmmword ptr [rsp+0xe0]
0x7ffe75d18432	vzeroupper
0x7ffe75d18435	add rsp, 0x188
0x7ffe75d1843c	pop rbx
0x7ffe75d1843d	pop rbp
0x7ffe75d1843e	pop rsi
0x7ffe75d1843f	pop rdi
0x7ffe75d18440	pop r12
0x7ffe75d18442	pop r13
0x7ffe75d18444	pop r14
0x7ffe75d18446	pop r15
0x7ffe75d18448	ret
0x7ffe75d18449	Block 29:
0x7ffe75d18449	call 0x7ffed59bd450
0x7ffe75d1844e	Block 30:
0x7ffe75d1844e	call 0x7ffed59bddf0
0x7ffe75d18453	Block 31:
0x7ffe75d18453	call 0x7ffed59bdd50
0x7ffe75d18458	Block 32:
0x7ffe75d18458	int3

fiigii · 2018-12-18T21:05:42Z

Additionally, I detected that Vector.op_Multiply is not inlined (the function body generates SIMD code). Is it intentional?

Ah, Vector<T>.op_Multiply is not an intrinsic on (T factor, Vector<T> value) overloads, so we probably need an AggressiveInlining for it.

Method op_Multiply is NOT a SIMD intrinsic
  Known type SIMD Vector<Double>
Calling impNormStructVal on:
               [000174] ------------              *  LCL_VAR   simd32 V13 loc7         
  Known type SIMD Vector<Double>
resulting tree:
               [000177] n-----------              *  OBJ(32)   simd32
               [000176] L-----------              \--*  ADDR      byref 
               [000174] ------------                 \--*  LCL_VAR   simd32 V13 loc7         
INLINER: during 'impMarkInlineCandidate' result 'failed this callee' reason 'too many il bytes' for 'Burgers:GetCalculated3(int,int,double,double,double,ref):ref' calling 'Vector`1:op_Multiply(double,struct):struct'

INLINER: Marking Vector`1:op_Multiply(double,struct):struct as NOINLINE because of too many il bytes
INLINER: during 'impMarkInlineCandidate' result 'failed this callee' reason 'too many il bytes'

coreclr/src/System.Private.CoreLib/shared/System/Numerics/Vector.cs

Lines 2341 to 2348 in 12a8870

    
           // This method is intrinsic only for certain types. It cannot access fields directly unless we are sure the context is unaccelerated. 
        
           /// <summary> 
        
           /// Multiplies a vector by the given scalar. 
        
           /// </summary> 
        
           /// <param name="factor">The scalar value.</param> 
        
           /// <param name="value">The source vector.</param> 
        
           /// <returns>The scaled vector.</returns> 
        
           public static Vector<T> operator *(T factor, Vector<T> value)

tannergooding · 2018-12-18T21:24:32Z

Ah, Vector.op_Multiply is not an intrinsic on (T factor, Vector value) overloads, so we probably need an AggressiveInlining for it.

@fiigii, It's worth noting that, as of right now, the types in System.Numerics.Vectors do not actually respect the Intrinsic attribute:

The types not in CoreLib, which is everything but Vector<T>, aren't even considered for SetIsJitIntrinsic today
For types in CoreLib (so Vector and Vector<T>), impSIMDIntrinsic doesn't do any checks for CORINFO_FLG_JIT_INTRINSIC and instead just relies on clsHnd being a SIMD class and the method name/parameter types matching up to one of the entries in the SIMD intrinsic list

I actually hit this the other day when doing some investigation on https://github.com/dotnet/corefx/issues/31425. The fix to get the SIMD intrinsics to start respecting the [Intrinsic] flag is fairly trivial (and I have it done locally for both the SIMD assembly and CoreLib), but this causes some assembly diffs due to certain methods being treated as intrinsic implicitly.

I will put up a PR after I ensure no codegen diffs (which just requires marking some methods explicitly as intrinsic)
There are also some methods marked as Intrinsic which will never be handled, since we don't have entries in the intrinsic list. This should probably be looked at as well

fiigii · 2018-12-18T22:10:23Z

@tannergooding Thanks for noting. But this problem is not related to "intrinsic", let's discuss that in dotnet/corefx#31425 😄

AndyAyersMS · 2019-01-05T02:40:15Z

It turns out for x64 we're already doing 16 byte alignment of methods. The jit requests default alignment from the runtime, and so the alignment is ultimately governed by the definition of CODE_SIZE_ALIGN which is found in vm\amd64\cgencpu.h, and that is 16.

So impacts seen above are possibly just normal variation give 16 byte alignment -- note that there is some known perf impact on intel processors from even stronger alignments, say 32 bytes, which may explain some of the above variability.

We could of course consider asking for stronger alignments but it may make sense to look at adjusting internal alignments first to take advantage of the 16 byte alignments we already have.

So am going to close this one out.

fiigii mentioned this pull request Dec 18, 2018

Add AggressiveInlining and simplify software fallback for Vector<T>.op_Multiply #21587

Merged

tannergooding mentioned this pull request Dec 19, 2018

Ensure that the S.N.Vector methods are marked as [Intrinsic] and that the attribute is respected by the runtime #21601

Merged

AndyAyersMS closed this Jan 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: align Tier1 methods at 16 byte boundaries for xarch [WIP] #21518

JIT: align Tier1 methods at 16 byte boundaries for xarch [WIP] #21518

AndyAyersMS commented Dec 13, 2018

AndyAyersMS commented Dec 13, 2018

tannergooding commented Dec 13, 2018

fiigii commented Dec 13, 2018

adamsitnik commented Dec 13, 2018

fiigii commented Dec 18, 2018 •

edited

Loading

fiigii commented Dec 18, 2018 •

edited

Loading

tannergooding commented Dec 18, 2018

fiigii commented Dec 18, 2018

AndyAyersMS commented Jan 5, 2019

JIT: align Tier1 methods at 16 byte boundaries for xarch [WIP] #21518

JIT: align Tier1 methods at 16 byte boundaries for xarch [WIP] #21518

Conversation

AndyAyersMS commented Dec 13, 2018

AndyAyersMS commented Dec 13, 2018

BASE vs DIFF 1

BASE vs DIFF 2

tannergooding commented Dec 13, 2018

fiigii commented Dec 13, 2018

adamsitnik commented Dec 13, 2018

fiigii commented Dec 18, 2018 • edited Loading

fiigii commented Dec 18, 2018 • edited Loading

tannergooding commented Dec 18, 2018

fiigii commented Dec 18, 2018

AndyAyersMS commented Jan 5, 2019

fiigii commented Dec 18, 2018 •

edited

Loading

fiigii commented Dec 18, 2018 •

edited

Loading