x64: Migrate selectif and selectif_spectre_guard to ISLE #4619

elliottt · 2022-08-04T22:20:50Z

Lower selectif and selectif_spectre_guard in ISLE, which allows a lot of code related to emit_cmp to be removed from the x64 lower.rs module.

elliottt · 2022-08-04T22:52:17Z

cranelift/codegen/src/isa/x64/inst.isle

@@ -3157,6 +3156,11 @@
 (rule (lower_icmp_bool (IcmpCondResult.Condition producer cc))
      (with_flags producer (x64_setcc cc)))

+;; Emit a conditional move based on the result of an icmp.
+(decl select_icmp (IcmpCondResult Value Value) ValueRegs)


This might not be the most reusable signature, should I rework this to take the type as an argument and ValueRegs arguments instead of Value?

That might be slightly more idiomatic for the "inst helpers" layer, though I don't think it's too important either way; strong types mean we can always refactor easily and relatively safely...

cfallin

Very happy to see more of the handwritten code go away!

cfallin · 2022-08-04T23:14:13Z

cranelift/codegen/src/isa/x64/inst.isle

@@ -3157,6 +3156,11 @@
 (rule (lower_icmp_bool (IcmpCondResult.Condition producer cc))
      (with_flags producer (x64_setcc cc)))

+;; Emit a conditional move based on the result of an icmp.
+(decl select_icmp (IcmpCondResult Value Value) ValueRegs)


That might be slightly more idiomatic for the "inst helpers" layer, though I don't think it's too important either way; strong types mean we can always refactor easily and relatively safely...

cfallin · 2022-08-04T23:17:02Z

cranelift/filetests/filetests/isa/x64/heap.clif

 ;   cmpq    %rdi, %rcx
-;   cmovnbeq %rdx, %rax, %rax
+;   cmovnbeq const(VCodeConstant(0)), %rax, %rax


Here is I think a good reason to force the inputs into registers: otherwise the heap legalization is going to have a load in the hotpath.

Although, after saying that, I am a bit curious whether the reduced register pressure (no rdx needed here anymore) has an effect in the opposite direction... if it's not too much trouble, would you be willing to run a quick set of Sightglass benchmarks (on at least say bz2, spidermonkey, pulldown-cmark)?

Sure! I have a fix that forces the value to registers now so this PR shouldn't change performance, but I'll run the sightglass benchmarks without that fix just to see what the difference is.

Here are the results, branch.so is loading a 0 from memory, main.so is with xor r, r, r: https://gist.github.com/elliottt/67be59da8bbdc3a9f946837efc81dedf

It looks like there were some improvements in instantiation, but no change otherwise. The command I ran for each benchmark was:

RUST_LOG=trace ./target/release/sightglass-cli benchmark benchmarks/<benchmark>/benchmark.wasm --engine ../main.so --engine ../branch.so --iterations-per-process=100 --processes=1

My naming of the two outputs was bothering me, so I ran another set of benchmarks with the sha of the commit as the name for the shared object. Here are the results: https://gist.github.com/elliottt/cd2ad5c97d53c57a315b76b4b674af46. The results this time show very little difference between the two, but the xor commit lagging slightly behind the load commit when there is a difference.

25c436f is the head of this branch, loading 0 via xor
643f537 is the previous commit that would load 0 via memory

The command that I ran each time was:

RUST_LOG=trace ./target/release/sightglass-cli benchmark benchmarks/<benchmark>/benchmark.wasm --engine ../643f53789890f2a776839fe4f5a662d0ec7f891b.so --engine ../25c436f4480b5e20f832eab4d165734692c298a0.so --iterations-per-process=100 --processes=1

A bit late but I realized this morning that we probably aren't actually benchmarking the thing with the default Sightglass config: it should be using static memories with a large enough guard region (4+2GiB by default I think?) that bounds checks don't actually happen. We would need to configure it to use dynamic bounds to see a delta; we may just be reading into noise here.

If you're up for it, --static-memory-maximum-size 0 --static-memory-guard-size 0 will I think give a completely dynamic (bounds-checked) configuration; running the two different wasmtimes and timing them (with e.g. hyperfine, running a cwasm with --allow-precompiled) is probably easier than hacking the default config to make Sightglass use it.

I still remain a little skeptical that loads of 0 from memory are faster, but I'd love to be proven wrong...

Happy to run some more tests in the background -- I've still got the shared objects hanging around. I figured it was better to merge the version that didn't introduce the loads, as it would be pretty easy to revert the last commit on this PR if we decided that it was worthwhile 👍

The results for bz2 are pretty interesting: initialization really takes a hit from the load.

instantiation :: nanoseconds :: benchmarks/bz2/benchmark.wasm Δ = 26233.52 ± 5394.98 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 1.43x to 1.65x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 0.58x to 0.72x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [43591 48321.01 90806] 25c436f4480b5e20f832eab4d165734692c298a0.so [43783 74554.53 122088] 643f53789890f2a776839fe4f5a662d0ec7f891b.so instantiation :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 94444.02 ± 19422.58 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 1.43x to 1.65x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 0.58x to 0.72x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [156936 173963.32 326914] 25c436f4480b5e20f832eab4d165734692c298a0.so [157626 268407.34 439534] 643f53789890f2a776839fe4f5a662d0ec7f891b.so execution :: nanoseconds :: benchmarks/bz2/benchmark.wasm Δ = 269493.01 ± 178023.21 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 0.99x to 1.00x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 1.00x to 1.01x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [30236428 30873377.89 31490866] 25c436f4480b5e20f832eab4d165734692c298a0.so [30184192 30603884.88 31578144] 643f53789890f2a776839fe4f5a662d0ec7f891b.so execution :: cycles :: benchmarks/bz2/benchmark.wasm Δ = 970207.46 ± 640905.23 (confidence = 99%) 25c436f4480b5e20f832eab4d165734692c298a0.so is 0.99x to 1.00x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so! 643f53789890f2a776839fe4f5a662d0ec7f891b.so is 1.00x to 1.01x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so! [108854820 111147916.48 113370948] 25c436f4480b5e20f832eab4d165734692c298a0.so [108666764 110177709.02 113685162] 643f53789890f2a776839fe4f5a662d0ec7f891b.so compilation :: cycles :: benchmarks/bz2/benchmark.wasm No difference in performance. [235950166 248161159.74 266311074] 25c436f4480b5e20f832eab4d165734692c298a0.so [237357286 247903675.98 301295484] 643f53789890f2a776839fe4f5a662d0ec7f891b.so compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm No difference in performance. [65539498 68931326.53 73972799] 25c436f4480b5e20f832eab4d165734692c298a0.so [65930352 68859805.71 83690362] 643f53789890f2a776839fe4f5a662d0ec7f891b.so

As a note about how I set these flags: I modified crates/bench-api/src/lib.rs to initialize both of the fields @cfallin mentioned to 0 so that I could continue using sightglass for generating the benchmarks.

Interesting, so load-from-zero is still 0-1% faster, if I'm reading this right? Great if so; perhaps the register pressure is the deciding factor!

Actually after discussing offline with @elliottt it seems that what's on main now is the xor-based approach; the gap is small enough and xor is otherwise the idiomatic way to zero a register (optimized as such by microarchitectures etc) that I think I'd prefer to keep that approach. Thanks for running the benchmarks here!

elliottt added 2 commits August 4, 2022 15:18

Lower selectif and selectif_spectre_guard

35ce175

Remove unused code

643f537

elliottt changed the title ~~x64: Lower selectif and selectif_spectre_guard in ISLE~~ x64: Migrate selectif and selectif_spectre_guard to ISLE Aug 4, 2022

elliottt marked this pull request as ready for review August 4, 2022 22:36

elliottt commented Aug 4, 2022

View reviewed changes

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Aug 4, 2022

cfallin approved these changes Aug 4, 2022

View reviewed changes

Load constants through registers when lowering selectif_spectre_guard

25c436f

elliottt merged commit 0c2a48f into bytecodealliance:main Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x64: Migrate selectif and selectif_spectre_guard to ISLE #4619

x64: Migrate selectif and selectif_spectre_guard to ISLE #4619

elliottt commented Aug 4, 2022 •

edited

Loading

elliottt Aug 4, 2022

cfallin Aug 4, 2022

cfallin left a comment

cfallin Aug 4, 2022

cfallin Aug 4, 2022

elliottt Aug 4, 2022

elliottt Aug 5, 2022 •

edited

Loading

elliottt Aug 5, 2022 •

edited

Loading

cfallin Aug 5, 2022

elliottt Aug 5, 2022

elliottt Aug 5, 2022

cfallin Aug 5, 2022

cfallin Aug 5, 2022

x64: Migrate selectif and selectif_spectre_guard to ISLE #4619

x64: Migrate selectif and selectif_spectre_guard to ISLE #4619

Conversation

elliottt commented Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfallin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elliottt Aug 5, 2022 • edited Loading

Choose a reason for hiding this comment

elliottt Aug 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elliottt commented Aug 4, 2022 •

edited

Loading

elliottt Aug 5, 2022 •

edited

Loading

elliottt Aug 5, 2022 •

edited

Loading