Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

JIT: align Tier1 methods at 16 byte boundaries for xarch [WIP] #21518

Closed
wants to merge 1 commit into from

Conversation

AndyAyersMS
Copy link
Member

Align Tier1, small and IBC hot methods to 16 byte boundaries for x64 and x86.
Consensus from various folks I polled was that this isn't as helpful for arm
architectures, so for now this is xarch only.

This ensures that instruction prefetch pulls in as much code as possible.

It should also improve performance stability in some benchmarks, as well as
opening the door for possible loop-top aligment padding.

Resolves #16873.

Align Tier1, small and IBC hot methods to 16 byte boundaries for x64 and x86.
Consensus from various folks I polled was that this isn't as helpful for arm
architectures, so for now this is xarch only.

This ensures that instruction prefetch pulls in as much code as possible.

It should also improve performance stability in some benchmarks, as well as
opening the door for possible loop-top aligment padding.

Resolves #16873.
@AndyAyersMS
Copy link
Member Author

Keeping this as [WIP] for now since I'd like to see some of the other R2R and tiering perf changes land first, and evaluation of this change may be tricky, and nothing depends on this. And I'll be on vacation a fair amount in the coming weeks.

I don't expect 16 byte alignment to universally improve perf as whatever method alignment we get now on a given run may work out better. But over time this alignment should be more perf stable and also gives the jit the opportunity to internally pad to avoid bad instruction fetch issues for hot loops and eventually end up in an overall better place.

I plan to make multiple measurements on different machines and over time to try and show results are indeed more stable and (for benchmarks that consistently stand out as better or worse) drill in and try to understand why.

I don't think this will cause much extra code space fragmentation -- though aligning small methods may have a notable impact. So another goal is to actually measure the size impact, especially on realistic apps.

Using the performance repo I have done a couple of base and diff runs of local builds on a Skylake i6700 and compared them . Data below is filtered by absolute difference of at least 1ms as results for shorter running tests seem to swing wildly. @adamsitnik if you want to experiment with this change and see why there's such volatility in the shorter tests I'd love to see any insights you uncover.

cc @dotnet/jit-contrib @fiigii

BASE vs DIFF 1

Slower diff/base Base Median (ns) Diff Median (ns) Modality
LinqBenchmarks.Order00ManualX 1.13 128679250.00 145741725.00
SciMark2.kernel.benchMonteCarlo 1.06 720936900.00 761832900.00 can have several modes
LinqBenchmarks.Where00ForX 1.04 407904300.00 424595200.00
BenchmarksGame.BinaryTrees_2.RunBench 1.04 100067050.00 103815100.00
ByteMark.BenchNeural 1.03 684599100.00 706311100.00
CscBench.DatflowTest 1.03 375537650.00 386855100.00
Benchstone.BenchF.NewtE.Test 1.02 396415000.00 405177150.00 bimodal
Benchstone.BenchF.Regula.Test 1.02 238591750.00 243644100.00
LinqBenchmarks.Count00ForX 1.02 328557500.00 334972000.00
LinqBenchmarks.Where01LinqMethodNestedX 1.01 367130500.00 370483800.00
LinqBenchmarks.Count00LinqMethodX 1.00 829226900.00 832690600.00
Faster base/diff Base Median (ns) Diff Median (ns) Modality
Burgers.Test3 2.54 1063852600.00 418235300.00
SciMark2.kernel.benchFFT 1.11 767675800.00 694001500.00
SeekUnroll.Test(boxedIndex: 27) 1.02 1679506400.00 1649932800.00
Benchstone.BenchF.LLoops.Test 1.02 566127600.00 556443250.00
Benchstone.BenchF.FFT.Test 1.02 172081000.00 169311400.00
SeekUnroll.Test(boxedIndex: 19) 1.02 1358964600.00 1338244400.00

BASE vs DIFF 2

Slower diff/base Base Median (ns) Diff Median (ns) Modality
Benchstone.BenchF.DMath.Test 1.08 651020750.00 701478200.00
SIMD.ConsoleMandel.VectorFloatSinglethreadADTNoInt 1.06 276949100.00 293837350.00
LinqBenchmarks.Order00LinqQueryX 1.05 88074675.00 92741887.50
LinqBenchmarks.Count00ForX 1.05 328557500.00 344875100.00
PerfLabTests.DelegatePerf.MulticastDelegateCombineInvoke 1.03 206379175.00 212145600.00
LinqBenchmarks.Where00ForX 1.03 407904300.00 419107750.00
ByteMark.BenchNumericSortJagged 1.02 1182589800.00 1209204000.00
Benchstone.BenchF.Regula.Test 1.02 238591750.00 243943800.00
ByteMark.BenchFourier 1.02 483880550.00 492823600.00
Benchstone.BenchF.NewtE.Test 1.01 396415000.00 402287800.00
SciMark2.kernel.benchMonteCarlo 1.01 720936900.00 731035700.00
LinqBenchmarks.Where00LinqQueryX 1.01 534897850.00 541721300.00
CscBench.DatflowTest 1.01 375537650.00 379996200.00
ByteMark.BenchEmFloat 1.01 3250239200.00 3273381850.00
Faster base/diff Base Median (ns) Diff Median (ns) Modality
Burgers.Test3 2.53 1063852600.00 420375300.00
SciMark2.kernel.benchFFT 1.16 767675800.00 663166950.00
Benchstone.BenchI.BubbleSort2.Test 1.05 30900981.25 29298325.00
Benchstone.BenchI.NDhrystone.Test 1.03 356696700.00 347896800.00
BenchmarksGame.MandelBrot_7.Bench(size: 4000, lineLength: 500, checksum: "C7-E6-66-43-66-73-F8-A8-D3 1.02 117919350.00 115333875.00
Benchstone.BenchF.LLoops.Test 1.02 566127600.00 556375000.00
LinqBenchmarks.Where01LinqQueryX 1.01 320124900.00 316781200.00
SIMD.RayTracerBench.Bench 1.01 470632650.00 467229150.00
SeekUnroll.Test(boxedIndex: 27) 1.01 1679506400.00 1669661100.00
SeekUnroll.Test(boxedIndex: 19) 1.01 1358964600.00 1351673500.00
Benchstone.BenchF.Whetsto.Test 1.00 529907750.00 527334600.00

@tannergooding
Copy link
Member

This ensures that instruction prefetch pulls in as much code as possible.

For AMD machines, the optimization manuals going back to 2014 (so family 15h and onward) indicate that they have a 32-byte aligned fetch window. This could potentially be beneficial for the Azure Lv2-Series VMs (powered by the EPYC-series processors).

@fiigii
Copy link

fiigii commented Dec 13, 2018

as well as opening the door for possible loop-top alignment padding.

That would be great to extend this optimization to other branch targets (loops).
image

@adamsitnik
Copy link
Member

@AndyAyersMS awesome! I am going to clone your fork and re-run the tests on my machines.

@AndreyAkinshin you might be also interested in this change in case you need to update your book before publishing it ;)

@fiigii
Copy link

fiigii commented Dec 18, 2018

@AndyAyersMS I did a bit VTune profiling for this change on Burgers, but I did not get such big perf difference (directly run the compiled tests without tiered JIT). How can I run the benchmark with ProfileData?
Additionally, I detected that Vector<T>.op_Multiply is not inlined (the function body generates SIMD code). Is it intentional?
image

That causes lots of caller-saving code in Burgers::GetCalculated3

Disasm of GetCalculated3 from VTune

Address Assembly
0x7ffe75d17fd0 Block 1:
0x7ffe75d17fd0 push r15
0x7ffe75d17fd2 push r14
0x7ffe75d17fd4 push r13
0x7ffe75d17fd6 push r12
0x7ffe75d17fd8 push rdi
0x7ffe75d17fd9 push rsi
0x7ffe75d17fda push rbp
0x7ffe75d17fdb push rbx
0x7ffe75d17fdc sub rsp, 0x188
0x7ffe75d17fe3 vzeroupper
0x7ffe75d17fe6 vmovaps xmmword ptr [rsp+0x170], xmm6
0x7ffe75d17fef vmovaps xmmword ptr [rsp+0x160], xmm7
0x7ffe75d17ff8 vmovaps xmmword ptr [rsp+0x150], xmm8
0x7ffe75d18001 vmovaps xmmword ptr [rsp+0x140], xmm9
0x7ffe75d1800a vmovaps xmmword ptr [rsp+0x130], xmm10
0x7ffe75d18013 vmovaps xmmword ptr [rsp+0x120], xmm11
0x7ffe75d1801c vmovaps xmmword ptr [rsp+0x110], xmm12
0x7ffe75d18025 vmovaps xmmword ptr [rsp+0x100], xmm13
0x7ffe75d1802e vmovaps xmmword ptr [rsp+0xf0], xmm14
0x7ffe75d18037 vmovaps xmmword ptr [rsp+0xe0], xmm15
0x7ffe75d18040 mov edi, ecx
0x7ffe75d18042 mov esi, edx
0x7ffe75d18044 vmovaps xmm6, xmm2
0x7ffe75d18048 vmovaps xmm7, xmm3
0x7ffe75d1804c mov rbx, qword ptr [rsp+0x1f8]
0x7ffe75d18054 mov edx, esi
0x7ffe75d18056 sar edx, 0x1f
0x7ffe75d18059 and edx, 0x3
0x7ffe75d1805c add edx, esi
0x7ffe75d1805e and edx, 0xfffffffc
0x7ffe75d18061 mov ecx, esi
0x7ffe75d18063 sub ecx, edx
0x7ffe75d18065 mov edx, ecx
0x7ffe75d18067 neg edx
0x7ffe75d18069 lea ebp, ptr [rdx+rsi*1+0x4]
0x7ffe75d1806d movsxd rdx, ebp
0x7ffe75d18070 mov rcx, 0x7ffed54d3940
0x7ffe75d1807a call 0x7ffed5880900
0x7ffe75d1807f Block 2:
0x7ffe75d1807f mov r14, rax
0x7ffe75d18082 movsxd rdx, ebp
0x7ffe75d18085 mov rcx, 0x7ffed54d3940
0x7ffe75d1808f call 0x7ffed5880900
0x7ffe75d18094 Block 3:
0x7ffe75d18094 mov r15, rax
0x7ffe75d18097 mov r8d, dword ptr [rbx+0x8]
0x7ffe75d1809b mov rcx, rbx
0x7ffe75d1809e mov rdx, r15
0x7ffe75d180a1 call 0x7ffed5144c80
0x7ffe75d180a6 Block 4:
0x7ffe75d180a6 vmovaps xmm0, xmm7
0x7ffe75d180aa vmulsd xmm0, xmm0, qword ptr [rsp+0x1f0]
0x7ffe75d180b3 vdivsd xmm0, xmm0, xmm6
0x7ffe75d180b7 vmovsd xmm1, qword ptr [rip+0x3e9]
0x7ffe75d180bf call 0x7ffed5ae2860
0x7ffe75d180c4 Block 5:
0x7ffe75d180c4 vmovaps xmm8, xmm0
0x7ffe75d180c8 xor ebx, ebx
0x7ffe75d180ca test edi, edi
0x7ffe75d180cc jle 0x7ffe75d183d5 <Block 28>
0x7ffe75d180d2 Block 6:
0x7ffe75d180d2 mov r12d, 0x1
0x7ffe75d180d8 lea r13d, ptr [rbp-0x3]
0x7ffe75d180dc cmp r13d, 0x1
0x7ffe75d180e0 jle 0x7ffe75d182c6 <Block 21>
0x7ffe75d180e6 Block 7:
0x7ffe75d180e6 mov ecx, dword ptr [r15+0x8]
0x7ffe75d180ea mov eax, ecx
0x7ffe75d180ec vmovsd qword ptr [rsp+0x1e0], xmm6
0x7ffe75d180f5 vmovaps xmm9, xmm7
0x7ffe75d180f9 vdivsd xmm9, xmm9, xmm6
0x7ffe75d180fd vmovsd qword ptr [rsp+0xb8], xmm8
0x7ffe75d18106 Block 8:
0x7ffe75d18106 cmp r12d, eax
0x7ffe75d18109 jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1810f Block 9:
0x7ffe75d1810f lea ecx, ptr [r12+0x3]
0x7ffe75d18114 cmp ecx, eax
0x7ffe75d18116 jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1811c Block 10:
0x7ffe75d1811c vmovupd ymm10, ymmword ptr [r15+r12*8+0x10]
0x7ffe75d18123 lea ecx, ptr [r12-0x1]
0x7ffe75d18128 cmp ecx, eax
0x7ffe75d1812a jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d18130 Block 11:
0x7ffe75d18130 lea ecx, ptr [r12+0x2]
0x7ffe75d18135 cmp ecx, eax
0x7ffe75d18137 jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1813d Block 12:
0x7ffe75d1813d lea ecx, ptr [r12-0x1]
0x7ffe75d18142 vmovupd ymm11, ymmword ptr [r15+rcx*8+0x10]
0x7ffe75d18149 lea ecx, ptr [r12+0x1]
0x7ffe75d1814e cmp ecx, eax
0x7ffe75d18150 jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d18156 Block 13:
0x7ffe75d18156 lea ecx, ptr [r12+0x4]
0x7ffe75d1815b mov dword ptr [rsp+0x2c], eax
0x7ffe75d1815f cmp ecx, eax
0x7ffe75d18161 jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d18167 Block 14:
0x7ffe75d18167 lea ecx, ptr [r12+0x1]
0x7ffe75d1816c vmovupd ymm12, ymmword ptr [r15+rcx*8+0x10]
0x7ffe75d18173 vmovaps ymm13, ymm10
0x7ffe75d18178 lea rcx, ptr [rsp+0x90]
0x7ffe75d18180 vmovupd ymmword ptr [rsp+0x30], ymm10
0x7ffe75d18186 lea rdx, ptr [rsp+0x30]
0x7ffe75d1818b vmovaps xmm2, xmm9
0x7ffe75d18190 vextractf128 xmm14, ymm10, 0x1
0x7ffe75d18196 vextractf128 xmm15, ymm13, 0x1
0x7ffe75d1819c vextractf128 xmm4, ymm11, 0x1
0x7ffe75d181a2 vmovupd xmmword ptr [rsp+0xd0], xmm4
0x7ffe75d181ab vextractf128 xmm4, ymm12, 0x1
0x7ffe75d181b1 vmovupd xmmword ptr [rsp+0xc0], xmm4
0x7ffe75d181ba call 0x7ffe75d11388
0x7ffe75d181bf Block 15:
0x7ffe75d181bf vmovupd xmm4, xmmword ptr [rsp+0xc0]
0x7ffe75d181c8 vinsertf128 ymm12, ymm12, xmm4, 0x1
0x7ffe75d181ce vmovupd xmm4, xmmword ptr [rsp+0xd0]
0x7ffe75d181d7 vinsertf128 ymm11, ymm11, xmm4, 0x1
0x7ffe75d181dd vinsertf128 ymm13, ymm13, xmm15, 0x1
0x7ffe75d181e3 vinsertf128 ymm10, ymm10, xmm14, 0x1
0x7ffe75d181e9 vsubpd ymm1, ymm10, ymm11
0x7ffe75d181ee vmovupd ymm0, ymmword ptr [rsp+0x90]
0x7ffe75d181f7 vmulpd ymm0, ymm0, ymm1
0x7ffe75d181fb vsubpd ymm6, ymm13, ymm0
0x7ffe75d181ff lea rcx, ptr [rsp+0x70]
0x7ffe75d18204 vmovupd ymmword ptr [rsp+0x30], ymm10
0x7ffe75d1820a vmovsd xmm1, qword ptr [rip+0x29e]
0x7ffe75d18212 lea r8, ptr [rsp+0x30]
0x7ffe75d18217 vextractf128 xmm8, ymm6, 0x1
0x7ffe75d1821d vextractf128 xmm10, ymm11, 0x1
0x7ffe75d18223 vextractf128 xmm13, ymm12, 0x1
0x7ffe75d18229 call 0x7ffe75d11390
0x7ffe75d1822e Block 16:
0x7ffe75d1822e vinsertf128 ymm12, ymm12, xmm13, 0x1
0x7ffe75d18234 vinsertf128 ymm11, ymm11, xmm10, 0x1
0x7ffe75d1823a vinsertf128 ymm6, ymm6, xmm8, 0x1
0x7ffe75d18240 lea rcx, ptr [rsp+0x50]
0x7ffe75d18245 vmovupd ymm1, ymmword ptr [rsp+0x70]
0x7ffe75d1824b vsubpd ymm12, ymm12, ymm1
0x7ffe75d1824f vaddpd ymm12, ymm12, ymm11
0x7ffe75d18254 vmovupd ymmword ptr [rsp+0x30], ymm12
0x7ffe75d1825a vmovsd xmm1, qword ptr [rsp+0xb8]
0x7ffe75d18263 lea r8, ptr [rsp+0x30]
0x7ffe75d18268 vextractf128 xmm8, ymm6, 0x1
0x7ffe75d1826e call 0x7ffe75d11390
0x7ffe75d18273 Block 17:
0x7ffe75d18273 vinsertf128 ymm6, ymm6, xmm8, 0x1
0x7ffe75d18279 vmovupd ymm0, ymmword ptr [rsp+0x50]
0x7ffe75d1827f vaddpd ymm0, ymm6, ymm0
0x7ffe75d18283 cmp r12d, dword ptr [r14+0x8]
0x7ffe75d18287 jnb 0x7ffe75d1844e <Block 30>
0x7ffe75d1828d Block 18:
0x7ffe75d1828d lea eax, ptr [r12+0x3]
0x7ffe75d18292 cmp eax, dword ptr [r14+0x8]
0x7ffe75d18296 jnb 0x7ffe75d18453 <Block 31>
0x7ffe75d1829c Block 19:
0x7ffe75d1829c vmovupd ymmword ptr [r14+r12*8+0x10], ymm0
0x7ffe75d182a3 add r12d, 0x4
0x7ffe75d182a7 cmp r12d, r13d
0x7ffe75d182aa mov eax, dword ptr [rsp+0x2c]
0x7ffe75d182ae jl 0x7ffe75d18106 <Block 8>
0x7ffe75d182b4 Block 20:
0x7ffe75d182b4 vmovsd xmm6, qword ptr [rsp+0x1e0]
0x7ffe75d182bd vmovsd xmm8, qword ptr [rsp+0xb8]
0x7ffe75d182c6 Block 21:
0x7ffe75d182c6 mov r13d, dword ptr [r15+0x8]
0x7ffe75d182ca cmp r13d, 0x0
0x7ffe75d182ce jbe 0x7ffe75d18449 <Block 29>
0x7ffe75d182d4 Block 22:
0x7ffe75d182d4 vmovsd xmm2, qword ptr [r15+0x10]
0x7ffe75d182da vmovaps xmm0, xmm2
0x7ffe75d182de vmulsd xmm0, xmm0, xmm7
0x7ffe75d182e2 vdivsd xmm0, xmm0, xmm6
0x7ffe75d182e6 lea eax, ptr [rsi-0x1]
0x7ffe75d182e9 cmp eax, r13d
0x7ffe75d182ec jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d182f2 Block 23:
0x7ffe75d182f2 movsxd rcx, eax
0x7ffe75d182f5 vmovsd xmm1, qword ptr [r15+rcx*8+0x10]
0x7ffe75d182fc vmovaps xmm3, xmm2
0x7ffe75d18300 vsubsd xmm3, xmm3, xmm1
0x7ffe75d18304 vmulsd xmm0, xmm0, xmm3
0x7ffe75d18308 vmovaps xmm3, xmm2
0x7ffe75d1830c vsubsd xmm3, xmm3, xmm0
0x7ffe75d18310 cmp r13d, 0x1
0x7ffe75d18314 jbe 0x7ffe75d18449 <Block 29>
0x7ffe75d1831a Block 24:
0x7ffe75d1831a vmovsd xmm0, qword ptr [r15+0x18]
0x7ffe75d18320 vmulsd xmm2, xmm2, qword ptr [rip+0x190]
0x7ffe75d18328 vsubsd xmm0, xmm0, xmm2
0x7ffe75d1832c vaddsd xmm0, xmm0, xmm1
0x7ffe75d18330 vmulsd xmm0, xmm0, xmm8
0x7ffe75d18335 vmovaps xmm2, xmm3
0x7ffe75d18339 vaddsd xmm2, xmm2, xmm0
0x7ffe75d1833d cmp dword ptr [r14+0x8], 0x0
0x7ffe75d18342 jbe 0x7ffe75d18449 <Block 29>
0x7ffe75d18348 Block 25:
0x7ffe75d18348 vmovsd qword ptr [r14+0x10], xmm2
0x7ffe75d1834e vmovsd xmm2, qword ptr [r15+rcx*8+0x10]
0x7ffe75d18355 vmovaps xmm0, xmm2
0x7ffe75d18359 vmulsd xmm0, xmm0, xmm7
0x7ffe75d1835d vdivsd xmm0, xmm0, xmm6
0x7ffe75d18361 lea edx, ptr [rsi-0x2]
0x7ffe75d18364 cmp edx, r13d
0x7ffe75d18367 jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d1836d Block 26:
0x7ffe75d1836d lea edx, ptr [rsi-0x2]
0x7ffe75d18370 movsxd rdx, edx
0x7ffe75d18373 vmovsd xmm1, qword ptr [r15+rdx*8+0x10]
0x7ffe75d1837a vmovaps xmm3, xmm2
0x7ffe75d1837e vsubsd xmm3, xmm3, xmm1
0x7ffe75d18382 vmulsd xmm0, xmm0, xmm3
0x7ffe75d18386 vmovaps xmm3, xmm2
0x7ffe75d1838a vsubsd xmm3, xmm3, xmm0
0x7ffe75d1838e vmovsd xmm0, qword ptr [r15+0x10]
0x7ffe75d18394 vmulsd xmm2, xmm2, qword ptr [rip+0x124]
0x7ffe75d1839c vsubsd xmm0, xmm0, xmm2
0x7ffe75d183a0 vaddsd xmm0, xmm0, xmm1
0x7ffe75d183a4 vmulsd xmm0, xmm0, xmm8
0x7ffe75d183a9 vmovaps xmm2, xmm3
0x7ffe75d183ad vaddsd xmm2, xmm2, xmm0
0x7ffe75d183b1 cmp eax, dword ptr [r14+0x8]
0x7ffe75d183b5 jnb 0x7ffe75d18449 <Block 29>
0x7ffe75d183bb Block 27:
0x7ffe75d183bb vmovsd qword ptr [r14+rcx*8+0x10], xmm2
0x7ffe75d183c2 mov r12, r15
0x7ffe75d183c5 mov r15, r14
0x7ffe75d183c8 inc ebx
0x7ffe75d183ca cmp ebx, edi
0x7ffe75d183cc mov r14, r12
0x7ffe75d183cf jl 0x7ffe75d180d2 <Block 6>
0x7ffe75d183d5 Block 28:
0x7ffe75d183d5 mov rax, r15
0x7ffe75d183d8 vmovaps xmm6, xmmword ptr [rsp+0x170]
0x7ffe75d183e1 vmovaps xmm7, xmmword ptr [rsp+0x160]
0x7ffe75d183ea vmovaps xmm8, xmmword ptr [rsp+0x150]
0x7ffe75d183f3 vmovaps xmm9, xmmword ptr [rsp+0x140]
0x7ffe75d183fc vmovaps xmm10, xmmword ptr [rsp+0x130]
0x7ffe75d18405 vmovaps xmm11, xmmword ptr [rsp+0x120]
0x7ffe75d1840e vmovaps xmm12, xmmword ptr [rsp+0x110]
0x7ffe75d18417 vmovaps xmm13, xmmword ptr [rsp+0x100]
0x7ffe75d18420 vmovaps xmm14, xmmword ptr [rsp+0xf0]
0x7ffe75d18429 vmovaps xmm15, xmmword ptr [rsp+0xe0]
0x7ffe75d18432 vzeroupper
0x7ffe75d18435 add rsp, 0x188
0x7ffe75d1843c pop rbx
0x7ffe75d1843d pop rbp
0x7ffe75d1843e pop rsi
0x7ffe75d1843f pop rdi
0x7ffe75d18440 pop r12
0x7ffe75d18442 pop r13
0x7ffe75d18444 pop r14
0x7ffe75d18446 pop r15
0x7ffe75d18448 ret
0x7ffe75d18449 Block 29:
0x7ffe75d18449 call 0x7ffed59bd450
0x7ffe75d1844e Block 30:
0x7ffe75d1844e call 0x7ffed59bddf0
0x7ffe75d18453 Block 31:
0x7ffe75d18453 call 0x7ffed59bdd50
0x7ffe75d18458 Block 32:
0x7ffe75d18458 int3

@fiigii
Copy link

fiigii commented Dec 18, 2018

Additionally, I detected that Vector.op_Multiply is not inlined (the function body generates SIMD code). Is it intentional?

Ah, Vector<T>.op_Multiply is not an intrinsic on (T factor, Vector<T> value) overloads, so we probably need an AggressiveInlining for it.

Method op_Multiply is NOT a SIMD intrinsic
  Known type SIMD Vector<Double>
Calling impNormStructVal on:
               [000174] ------------              *  LCL_VAR   simd32 V13 loc7         
  Known type SIMD Vector<Double>
resulting tree:
               [000177] n-----------              *  OBJ(32)   simd32
               [000176] L-----------              \--*  ADDR      byref 
               [000174] ------------                 \--*  LCL_VAR   simd32 V13 loc7         
INLINER: during 'impMarkInlineCandidate' result 'failed this callee' reason 'too many il bytes' for 'Burgers:GetCalculated3(int,int,double,double,double,ref):ref' calling 'Vector`1:op_Multiply(double,struct):struct'

INLINER: Marking Vector`1:op_Multiply(double,struct):struct as NOINLINE because of too many il bytes
INLINER: during 'impMarkInlineCandidate' result 'failed this callee' reason 'too many il bytes'

// This method is intrinsic only for certain types. It cannot access fields directly unless we are sure the context is unaccelerated.
/// <summary>
/// Multiplies a vector by the given scalar.
/// </summary>
/// <param name="factor">The scalar value.</param>
/// <param name="value">The source vector.</param>
/// <returns>The scaled vector.</returns>
public static Vector<T> operator *(T factor, Vector<T> value)

@tannergooding
Copy link
Member

Ah, Vector.op_Multiply is not an intrinsic on (T factor, Vector value) overloads, so we probably need an AggressiveInlining for it.

@fiigii, It's worth noting that, as of right now, the types in System.Numerics.Vectors do not actually respect the Intrinsic attribute:

  • The types not in CoreLib, which is everything but Vector<T>, aren't even considered for SetIsJitIntrinsic today
  • For types in CoreLib (so Vector and Vector<T>), impSIMDIntrinsic doesn't do any checks for CORINFO_FLG_JIT_INTRINSIC and instead just relies on clsHnd being a SIMD class and the method name/parameter types matching up to one of the entries in the SIMD intrinsic list

I actually hit this the other day when doing some investigation on https://github.com/dotnet/corefx/issues/31425. The fix to get the SIMD intrinsics to start respecting the [Intrinsic] flag is fairly trivial (and I have it done locally for both the SIMD assembly and CoreLib), but this causes some assembly diffs due to certain methods being treated as intrinsic implicitly.

  • I will put up a PR after I ensure no codegen diffs (which just requires marking some methods explicitly as intrinsic)
  • There are also some methods marked as Intrinsic which will never be handled, since we don't have entries in the intrinsic list. This should probably be looked at as well

@fiigii
Copy link

fiigii commented Dec 18, 2018

@tannergooding Thanks for noting. But this problem is not related to "intrinsic", let's discuss that in dotnet/corefx#31425 😄

@AndyAyersMS
Copy link
Member Author

It turns out for x64 we're already doing 16 byte alignment of methods. The jit requests default alignment from the runtime, and so the alignment is ultimately governed by the definition of CODE_SIZE_ALIGN which is found in vm\amd64\cgencpu.h, and that is 16.

So impacts seen above are possibly just normal variation give 16 byte alignment -- note that there is some known perf impact on intel processors from even stronger alignments, say 32 bytes, which may explain some of the above variability.

We could of course consider asking for stronger alignments but it may make sense to look at adjusting internal alignments first to take advantage of the 16 byte alignments we already have.

So am going to close this one out.

@AndyAyersMS AndyAyersMS closed this Jan 5, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants