Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64: Migrate selectif and selectif_spectre_guard to ISLE #4619

Merged
merged 3 commits into from
Aug 5, 2022

Conversation

elliottt
Copy link
Member

@elliottt elliottt commented Aug 4, 2022

Lower selectif and selectif_spectre_guard in ISLE, which allows a lot of code related to emit_cmp to be removed from the x64 lower.rs module.

@elliottt elliottt changed the title x64: Lower selectif and selectif_spectre_guard in ISLE x64: Migrate selectif and selectif_spectre_guard to ISLE Aug 4, 2022
@elliottt elliottt marked this pull request as ready for review August 4, 2022 22:36
@@ -3157,6 +3156,11 @@
(rule (lower_icmp_bool (IcmpCondResult.Condition producer cc))
(with_flags producer (x64_setcc cc)))

;; Emit a conditional move based on the result of an icmp.
(decl select_icmp (IcmpCondResult Value Value) ValueRegs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be the most reusable signature, should I rework this to take the type as an argument and ValueRegs arguments instead of Value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be slightly more idiomatic for the "inst helpers" layer, though I don't think it's too important either way; strong types mean we can always refactor easily and relatively safely...

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:x64 Issues related to x64 codegen labels Aug 4, 2022
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to see more of the handwritten code go away!

@@ -3157,6 +3156,11 @@
(rule (lower_icmp_bool (IcmpCondResult.Condition producer cc))
(with_flags producer (x64_setcc cc)))

;; Emit a conditional move based on the result of an icmp.
(decl select_icmp (IcmpCondResult Value Value) ValueRegs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be slightly more idiomatic for the "inst helpers" layer, though I don't think it's too important either way; strong types mean we can always refactor easily and relatively safely...

; cmpq %rdi, %rcx
; cmovnbeq %rdx, %rax, %rax
; cmovnbeq const(VCodeConstant(0)), %rax, %rax
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is I think a good reason to force the inputs into registers: otherwise the heap legalization is going to have a load in the hotpath.

Although, after saying that, I am a bit curious whether the reduced register pressure (no rdx needed here anymore) has an effect in the opposite direction... if it's not too much trouble, would you be willing to run a quick set of Sightglass benchmarks (on at least say bz2, spidermonkey, pulldown-cmark)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure! I have a fix that forces the value to registers now so this PR shouldn't change performance, but I'll run the sightglass benchmarks without that fix just to see what the difference is.

Copy link
Member Author

@elliottt elliottt Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the results, branch.so is loading a 0 from memory, main.so is with xor r, r, r: https://gist.github.com/elliottt/67be59da8bbdc3a9f946837efc81dedf

It looks like there were some improvements in instantiation, but no change otherwise. The command I ran for each benchmark was:

RUST_LOG=trace ./target/release/sightglass-cli benchmark benchmarks/<benchmark>/benchmark.wasm --engine ../main.so --engine ../branch.so --iterations-per-process=100 --processes=1

Copy link
Member Author

@elliottt elliottt Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My naming of the two outputs was bothering me, so I ran another set of benchmarks with the sha of the commit as the name for the shared object. Here are the results: https://gist.github.com/elliottt/cd2ad5c97d53c57a315b76b4b674af46. The results this time show very little difference between the two, but the xor commit lagging slightly behind the load commit when there is a difference.

25c436f is the head of this branch, loading 0 via xor
643f537 is the previous commit that would load 0 via memory

The command that I ran each time was:

RUST_LOG=trace ./target/release/sightglass-cli benchmark benchmarks/<benchmark>/benchmark.wasm --engine ../643f53789890f2a776839fe4f5a662d0ec7f891b.so --engine ../25c436f4480b5e20f832eab4d165734692c298a0.so --iterations-per-process=100 --processes=1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit late but I realized this morning that we probably aren't actually benchmarking the thing with the default Sightglass config: it should be using static memories with a large enough guard region (4+2GiB by default I think?) that bounds checks don't actually happen. We would need to configure it to use dynamic bounds to see a delta; we may just be reading into noise here.

If you're up for it, --static-memory-maximum-size 0 --static-memory-guard-size 0 will I think give a completely dynamic (bounds-checked) configuration; running the two different wasmtimes and timing them (with e.g. hyperfine, running a cwasm with --allow-precompiled) is probably easier than hacking the default config to make Sightglass use it.

I still remain a little skeptical that loads of 0 from memory are faster, but I'd love to be proven wrong...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to run some more tests in the background -- I've still got the shared objects hanging around. I figured it was better to merge the version that didn't introduce the loads, as it would be pretty easy to revert the last commit on this PR if we decided that it was worthwhile 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results for bz2 are pretty interesting: initialization really takes a hit from the load.

instantiation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  Δ = 26233.52 ± 5394.98 (confidence = 99%)

  25c436f4480b5e20f832eab4d165734692c298a0.so is 1.43x to 1.65x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so!
  643f53789890f2a776839fe4f5a662d0ec7f891b.so is 0.58x to 0.72x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so!

  [43591 48321.01 90806] 25c436f4480b5e20f832eab4d165734692c298a0.so
  [43783 74554.53 122088] 643f53789890f2a776839fe4f5a662d0ec7f891b.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 94444.02 ± 19422.58 (confidence = 99%)

  25c436f4480b5e20f832eab4d165734692c298a0.so is 1.43x to 1.65x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so!
  643f53789890f2a776839fe4f5a662d0ec7f891b.so is 0.58x to 0.72x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so!

  [156936 173963.32 326914] 25c436f4480b5e20f832eab4d165734692c298a0.so
  [157626 268407.34 439534] 643f53789890f2a776839fe4f5a662d0ec7f891b.so

execution :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  Δ = 269493.01 ± 178023.21 (confidence = 99%)

  25c436f4480b5e20f832eab4d165734692c298a0.so is 0.99x to 1.00x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so!
  643f53789890f2a776839fe4f5a662d0ec7f891b.so is 1.00x to 1.01x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so!

  [30236428 30873377.89 31490866] 25c436f4480b5e20f832eab4d165734692c298a0.so
  [30184192 30603884.88 31578144] 643f53789890f2a776839fe4f5a662d0ec7f891b.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  Δ = 970207.46 ± 640905.23 (confidence = 99%)

  25c436f4480b5e20f832eab4d165734692c298a0.so is 0.99x to 1.00x faster than 643f53789890f2a776839fe4f5a662d0ec7f891b.so!
  643f53789890f2a776839fe4f5a662d0ec7f891b.so is 1.00x to 1.01x faster than 25c436f4480b5e20f832eab4d165734692c298a0.so!

  [108854820 111147916.48 113370948] 25c436f4480b5e20f832eab4d165734692c298a0.so
  [108666764 110177709.02 113685162] 643f53789890f2a776839fe4f5a662d0ec7f891b.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [235950166 248161159.74 266311074] 25c436f4480b5e20f832eab4d165734692c298a0.so
  [237357286 247903675.98 301295484] 643f53789890f2a776839fe4f5a662d0ec7f891b.so

compilation :: nanoseconds :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [65539498 68931326.53 73972799] 25c436f4480b5e20f832eab4d165734692c298a0.so
  [65930352 68859805.71 83690362] 643f53789890f2a776839fe4f5a662d0ec7f891b.so

As a note about how I set these flags: I modified crates/bench-api/src/lib.rs to initialize both of the fields @cfallin mentioned to 0 so that I could continue using sightglass for generating the benchmarks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so load-from-zero is still 0-1% faster, if I'm reading this right? Great if so; perhaps the register pressure is the deciding factor!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually after discussing offline with @elliottt it seems that what's on main now is the xor-based approach; the gap is small enough and xor is otherwise the idiomatic way to zero a register (optimized as such by microarchitectures etc) that I think I'd prefer to keep that approach. Thanks for running the benchmarks here!

@elliottt elliottt merged commit 0c2a48f into bytecodealliance:main Aug 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants