Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasmtime: Implement fast Wasm stack walking #4431

Merged
merged 46 commits into from
Jul 28, 2022

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Jul 11, 2022

Why do we want Wasm stack walking to be fast? Because we capture stacks whenever
there is a trap and traps actually happen fairly frequently with short-lived
programs and WASI's exit.

Previously, we would rely on generating the system unwind info (e.g.
.eh_frame) and using the system unwinder (via the backtracecrate) to walk
the full stack and filter out any non-Wasm stack frames. This can,
unfortunately, be slow for two primary reasons:

  1. The system unwinder is doing O(all-kinds-of-frames) work rather than
    O(wasm-frames) work.

  2. System unwind info and the system unwinder need to be much more general than
    a purpose-built stack walker for Wasm needs to be. It has to handle any kind of
    stack frame that any compiler might emit where as our Wasm frames are emitted by
    Cranelift and always have frame pointers. This translates into implementation
    complexity and general overhead. There can also be unnecessary-for-our-use-cases
    global synchronization and locks involved, further slowing down stack walking in
    the presence of multiple threads trying to capture stacks in parallel.

This commit introduces a purpose-built stack walker for traversing just our Wasm
frames. To find all the sequences of Wasm-to-Wasm stack frames, and ignore
non-Wasm stack frames, we keep a linked list of (entry stack pointer, exit frame pointer) pairs. This linked list is maintained via Wasm-to-host and
host-to-Wasm trampolines. Within a sequence of Wasm-to-Wasm calls, we can use
frame pointers (which Cranelift preserves) to find the next older Wasm frame on
the stack, and we keep doing this until we reach the entry stack pointer,
meaning that the next older frame will be a host frame.

The trampolines need to avoid a couple stumbling blocks. First, they need to be
compiled ahead of time, since we may not have access to a compiler at
runtime (e.g. if the cranelift feature is disabled) but still want to be able
to call functions that have already been compiled and get stack traces for those
functions. Usually this means we would compile the appropriate trampolines
inside Module::new and the compiled module object would hold the
trampolines. However, we also need to support calling host functions that are
wrapped into wasmtime::Funcs and there doesn't exist any ahead-of-time
compiled module object to hold the appropriate trampolines:

// Define a host function.
let func_type = wasmtime::FuncType::new(
    vec![wasmtime::ValType::I32],
    vec![wasmtime::ValType::I32],
);
let func = Func::new(&mut store, func_type, |_, params, results| {
    // ...
    Ok(())
});

// Call that host function.
let mut results = vec![wasmtime::Val::I32(0)];
func.call(&[wasmtime::Val::I32(0)], &mut results)?;

Therefore, we define one host-to-Wasm trampoline and one Wasm-to-host trampoline
in assembly that work for all Wasm and host function signatures. These
trampolines are careful to only use volatile registers, avoid touching any
register that is an argument in the calling convention ABI, and tail call to the
target callee function. This allows forwarding any set of arguments and any
returns to and from the callee, while also allowing us to maintain our linked
list of Wasm stack and frame pointers before transferring control to the
callee. These trampolines are not used in Wasm-to-Wasm calls, only when crossing
the host-Wasm boundary, so they do not impose overhead on regular calls. (And if
using one trampoline for all host-Wasm boundary crossing ever breaks branch
prediction enough in the CPU to become any kind of bottleneck, we can do fun
things like have multiple copies of the same trampoline and choose a random copy
for each function, sharding the functions across branch predictor entries.)

Finally, this commit also ends the use of a synthetic Module and allocating a
stubbed out VMContext for host functions. Instead, we define a
VMHostFuncContext with its own magic value, similar to VMComponentContext,
specifically for host functions.

Benchmarks

Traps and Stack Traces

Large improvements to taking stack traces on traps, ranging from shaving off 64%
to 99.95% of the time it used to take.

multi-threaded-traps/0  time:   [2.5686 us 2.5808 us 2.5934 us]
                        thrpt:  [0.0000  elem/s 0.0000  elem/s 0.0000  elem/s]
                 change:
                        time:   [-85.419% -85.153% -84.869%] (p = 0.00 < 0.05)
                        thrpt:  [+560.90% +573.56% +585.84%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
multi-threaded-traps/1  time:   [2.9021 us 2.9167 us 2.9322 us]
                        thrpt:  [341.04 Kelem/s 342.86 Kelem/s 344.58 Kelem/s]
                 change:
                        time:   [-91.455% -91.294% -91.096%] (p = 0.00 < 0.05)
                        thrpt:  [+1023.1% +1048.6% +1070.3%]
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
multi-threaded-traps/2  time:   [2.9996 us 3.0145 us 3.0295 us]
                        thrpt:  [660.18 Kelem/s 663.47 Kelem/s 666.76 Kelem/s]
                 change:
                        time:   [-94.040% -93.910% -93.762%] (p = 0.00 < 0.05)
                        thrpt:  [+1503.1% +1542.0% +1578.0%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
multi-threaded-traps/4  time:   [5.5768 us 5.6052 us 5.6364 us]
                        thrpt:  [709.68 Kelem/s 713.63 Kelem/s 717.25 Kelem/s]
                 change:
                        time:   [-93.193% -93.121% -93.052%] (p = 0.00 < 0.05)
                        thrpt:  [+1339.2% +1353.6% +1369.1%]
                        Performance has improved.
multi-threaded-traps/8  time:   [8.6408 us 9.1212 us 9.5438 us]
                        thrpt:  [838.24 Kelem/s 877.08 Kelem/s 925.84 Kelem/s]
                 change:
                        time:   [-94.754% -94.473% -94.202%] (p = 0.00 < 0.05)
                        thrpt:  [+1624.7% +1709.2% +1806.1%]
                        Performance has improved.
multi-threaded-traps/16 time:   [10.152 us 10.840 us 11.545 us]
                        thrpt:  [1.3858 Melem/s 1.4760 Melem/s 1.5761 Melem/s]
                 change:
                        time:   [-97.042% -96.823% -96.577%] (p = 0.00 < 0.05)
                        thrpt:  [+2821.5% +3048.1% +3281.1%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

many-modules-registered-traps/1
                        time:   [2.6278 us 2.6361 us 2.6447 us]
                        thrpt:  [378.11 Kelem/s 379.35 Kelem/s 380.55 Kelem/s]
                 change:
                        time:   [-85.311% -85.108% -84.909%] (p = 0.00 < 0.05)
                        thrpt:  [+562.65% +571.51% +580.76%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
many-modules-registered-traps/8
                        time:   [2.6294 us 2.6460 us 2.6623 us]
                        thrpt:  [3.0049 Melem/s 3.0235 Melem/s 3.0425 Melem/s]
                 change:
                        time:   [-85.895% -85.485% -85.022%] (p = 0.00 < 0.05)
                        thrpt:  [+567.63% +588.95% +608.95%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
many-modules-registered-traps/64
                        time:   [2.6218 us 2.6329 us 2.6452 us]
                        thrpt:  [24.195 Melem/s 24.308 Melem/s 24.411 Melem/s]
                 change:
                        time:   [-93.629% -93.551% -93.470%] (p = 0.00 < 0.05)
                        thrpt:  [+1431.4% +1450.6% +1469.5%]
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
many-modules-registered-traps/512
                        time:   [2.6569 us 2.6737 us 2.6923 us]
                        thrpt:  [190.17 Melem/s 191.50 Melem/s 192.71 Melem/s]
                 change:
                        time:   [-99.277% -99.268% -99.260%] (p = 0.00 < 0.05)
                        thrpt:  [+13417% +13566% +13731%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
many-modules-registered-traps/4096
                        time:   [2.7258 us 2.7390 us 2.7535 us]
                        thrpt:  [1.4876 Gelem/s 1.4955 Gelem/s 1.5027 Gelem/s]
                 change:
                        time:   [-99.956% -99.955% -99.955%] (p = 0.00 < 0.05)
                        thrpt:  [+221417% +223380% +224881%]
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

many-stack-frames-traps/1
                        time:   [1.4658 us 1.4719 us 1.4784 us]
                        thrpt:  [676.39 Kelem/s 679.38 Kelem/s 682.21 Kelem/s]
                 change:
                        time:   [-90.368% -89.947% -89.586%] (p = 0.00 < 0.05)
                        thrpt:  [+860.23% +894.72% +938.21%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
many-stack-frames-traps/8
                        time:   [2.4772 us 2.4870 us 2.4973 us]
                        thrpt:  [3.2034 Melem/s 3.2167 Melem/s 3.2294 Melem/s]
                 change:
                        time:   [-85.550% -85.370% -85.199%] (p = 0.00 < 0.05)
                        thrpt:  [+575.65% +583.51% +592.03%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
many-stack-frames-traps/64
                        time:   [10.109 us 10.171 us 10.236 us]
                        thrpt:  [6.2525 Melem/s 6.2925 Melem/s 6.3309 Melem/s]
                 change:
                        time:   [-78.144% -77.797% -77.336%] (p = 0.00 < 0.05)
                        thrpt:  [+341.22% +350.38% +357.55%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
many-stack-frames-traps/512
                        time:   [126.16 us 126.54 us 126.96 us]
                        thrpt:  [4.0329 Melem/s 4.0461 Melem/s 4.0583 Melem/s]
                 change:
                        time:   [-65.364% -64.933% -64.453%] (p = 0.00 < 0.05)
                        thrpt:  [+181.32% +185.17% +188.71%]
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high severe

Calls

There is, however, a small regression in raw Wasm-to-host and host-to-Wasm call
performance due the new trampolines. It seems to be on the order of about 2-10
nanoseconds per call, depending on the benchmark.

I believe this regression is ultimately acceptable because

  1. this overhead will be vastly dominated by whatever work a non-nop callee
    actually does,

  2. we will need these trampolines, or something like them, when implementing the
    Wasm exceptions proposal to do things like translate Wasm's exceptions into
    Rust's Results,

  3. and because the performance improvements to trapping and capturing stack
    traces are of such a larger magnitude than this call regressions.

sync/no-hook/host-to-wasm - typed - nop
                        time:   [28.683 ns 28.757 ns 28.844 ns]
                        change: [+16.472% +17.183% +17.904%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  5 (5.00%) high severe
sync/no-hook/host-to-wasm - untyped - nop
                        time:   [42.515 ns 42.652 ns 42.841 ns]
                        change: [+12.371% +14.614% +17.462%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) high mild
  10 (10.00%) high severe
sync/no-hook/host-to-wasm - unchecked - nop
                        time:   [33.936 ns 34.052 ns 34.179 ns]
                        change: [+25.478% +26.938% +28.369%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe
sync/no-hook/host-to-wasm - typed - nop-params-and-results
                        time:   [34.290 ns 34.388 ns 34.502 ns]
                        change: [+40.802% +42.706% +44.526%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) high mild
  8 (8.00%) high severe
sync/no-hook/host-to-wasm - untyped - nop-params-and-results
                        time:   [62.546 ns 62.721 ns 62.919 ns]
                        change: [+2.5014% +3.6319% +4.8078%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe
sync/no-hook/host-to-wasm - unchecked - nop-params-and-results
                        time:   [42.609 ns 42.710 ns 42.831 ns]
                        change: [+20.966% +22.282% +23.475%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

sync/hook-sync/host-to-wasm - typed - nop
                        time:   [29.546 ns 29.675 ns 29.818 ns]
                        change: [+20.693% +21.794% +22.836%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
sync/hook-sync/host-to-wasm - untyped - nop
                        time:   [45.448 ns 45.699 ns 45.961 ns]
                        change: [+17.204% +18.514% +19.590%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
sync/hook-sync/host-to-wasm - unchecked - nop
                        time:   [34.334 ns 34.437 ns 34.558 ns]
                        change: [+23.225% +24.477% +25.886%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
sync/hook-sync/host-to-wasm - typed - nop-params-and-results
                        time:   [36.594 ns 36.763 ns 36.974 ns]
                        change: [+41.967% +47.261% +52.086%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe
sync/hook-sync/host-to-wasm - untyped - nop-params-and-results
                        time:   [63.541 ns 63.831 ns 64.194 ns]
                        change: [-4.4337% -0.6855% +2.7134%] (p = 0.73 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
sync/hook-sync/host-to-wasm - unchecked - nop-params-and-results
                        time:   [43.968 ns 44.169 ns 44.437 ns]
                        change: [+18.772% +21.802% +24.623%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) high mild
  12 (12.00%) high severe

async/no-hook/host-to-wasm - typed - nop
                        time:   [4.9612 us 4.9743 us 4.9889 us]
                        change: [+9.9493% +11.911% +13.502%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe
async/no-hook/host-to-wasm - untyped - nop
                        time:   [5.0030 us 5.0211 us 5.0439 us]
                        change: [+10.841% +11.873% +12.977%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
async/no-hook/host-to-wasm - typed - nop-params-and-results
                        time:   [4.9273 us 4.9468 us 4.9700 us]
                        change: [+4.7381% +6.8445% +8.8238%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe
async/no-hook/host-to-wasm - untyped - nop-params-and-results
                        time:   [5.1151 us 5.1338 us 5.1555 us]
                        change: [+9.5335% +11.290% +13.044%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe

async/hook-sync/host-to-wasm - typed - nop
                        time:   [4.9330 us 4.9394 us 4.9467 us]
                        change: [+10.046% +11.038% +12.035%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
async/hook-sync/host-to-wasm - untyped - nop
                        time:   [5.0073 us 5.0183 us 5.0310 us]
                        change: [+9.3828% +10.565% +11.752%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
async/hook-sync/host-to-wasm - typed - nop-params-and-results
                        time:   [4.9610 us 4.9839 us 5.0097 us]
                        change: [+9.0857% +11.513% +14.359%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe
async/hook-sync/host-to-wasm - untyped - nop-params-and-results
                        time:   [5.0995 us 5.1272 us 5.1617 us]
                        change: [+9.3600% +11.506% +13.809%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  6 (6.00%) high mild
  4 (4.00%) high severe

async-pool/no-hook/host-to-wasm - typed - nop
                        time:   [2.4242 us 2.4316 us 2.4396 us]
                        change: [+7.8756% +8.8803% +9.8346%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
async-pool/no-hook/host-to-wasm - untyped - nop
                        time:   [2.5102 us 2.5155 us 2.5210 us]
                        change: [+12.130% +13.194% +14.270%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) high mild
  8 (8.00%) high severe
async-pool/no-hook/host-to-wasm - typed - nop-params-and-results
                        time:   [2.4203 us 2.4310 us 2.4440 us]
                        change: [+4.0380% +6.3623% +8.7534%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe
async-pool/no-hook/host-to-wasm - untyped - nop-params-and-results
                        time:   [2.5501 us 2.5593 us 2.5700 us]
                        change: [+8.8802% +10.976% +12.937%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe

async-pool/hook-sync/host-to-wasm - typed - nop
                        time:   [2.4135 us 2.4190 us 2.4254 us]
                        change: [+8.3640% +9.3774% +10.435%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
async-pool/hook-sync/host-to-wasm - untyped - nop
                        time:   [2.5172 us 2.5248 us 2.5357 us]
                        change: [+11.543% +12.750% +13.982%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe
async-pool/hook-sync/host-to-wasm - typed - nop-params-and-results
                        time:   [2.4214 us 2.4353 us 2.4532 us]
                        change: [+1.5158% +5.0872% +8.6765%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  2 (2.00%) high mild
  13 (13.00%) high severe
async-pool/hook-sync/host-to-wasm - untyped - nop-params-and-results
                        time:   [2.5499 us 2.5607 us 2.5748 us]
                        change: [+10.146% +12.459% +14.919%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe

sync/no-hook/wasm-to-host - nop - typed
                        time:   [6.6135 ns 6.6288 ns 6.6452 ns]
                        change: [+37.927% +38.837% +39.869%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
sync/no-hook/wasm-to-host - nop-params-and-results - typed
                        time:   [15.930 ns 15.993 ns 16.067 ns]
                        change: [+3.9583% +5.6286% +7.2430%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  11 (11.00%) high mild
  1 (1.00%) high severe
sync/no-hook/wasm-to-host - nop - untyped
                        time:   [20.596 ns 20.640 ns 20.690 ns]
                        change: [+4.3293% +5.2047% +6.0935%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
sync/no-hook/wasm-to-host - nop-params-and-results - untyped
                        time:   [42.659 ns 42.882 ns 43.159 ns]
                        change: [-2.1466% -0.5079% +1.2554%] (p = 0.58 > 0.05)
                        No change in performance detected.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) high mild
  14 (14.00%) high severe
sync/no-hook/wasm-to-host - nop - unchecked
                        time:   [10.671 ns 10.691 ns 10.713 ns]
                        change: [+83.911% +87.620% +92.062%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
sync/no-hook/wasm-to-host - nop-params-and-results - unchecked
                        time:   [11.136 ns 11.190 ns 11.263 ns]
                        change: [-29.719% -28.446% -27.029%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe

sync/hook-sync/wasm-to-host - nop - typed
                        time:   [6.7964 ns 6.8087 ns 6.8226 ns]
                        change: [+21.531% +24.206% +27.331%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  4 (4.00%) high mild
  10 (10.00%) high severe
sync/hook-sync/wasm-to-host - nop-params-and-results - typed
                        time:   [15.865 ns 15.921 ns 15.985 ns]
                        change: [+4.8466% +6.3330% +7.8317%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe
sync/hook-sync/wasm-to-host - nop - untyped
                        time:   [21.505 ns 21.587 ns 21.677 ns]
                        change: [+8.0908% +9.1943% +10.254%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe
sync/hook-sync/wasm-to-host - nop-params-and-results - untyped
                        time:   [44.018 ns 44.128 ns 44.261 ns]
                        change: [-1.4671% -0.0458% +1.2443%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe
sync/hook-sync/wasm-to-host - nop - unchecked
                        time:   [11.264 ns 11.326 ns 11.387 ns]
                        change: [+80.225% +81.659% +83.068%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
sync/hook-sync/wasm-to-host - nop-params-and-results - unchecked
                        time:   [11.816 ns 11.865 ns 11.920 ns]
                        change: [-29.152% -28.040% -26.957%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  8 (8.00%) high mild
  6 (6.00%) high severe

async/no-hook/wasm-to-host - nop - typed
                        time:   [6.6221 ns 6.6385 ns 6.6569 ns]
                        change: [+43.618% +44.755% +45.965%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - typed
                        time:   [15.884 ns 15.929 ns 15.983 ns]
                        change: [+3.5987% +5.2053% +6.7846%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  3 (3.00%) high mild
  13 (13.00%) high severe
async/no-hook/wasm-to-host - nop - untyped
                        time:   [20.615 ns 20.702 ns 20.821 ns]
                        change: [+6.9799% +8.1212% +9.2819%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - untyped
                        time:   [41.956 ns 42.207 ns 42.521 ns]
                        change: [-4.3057% -2.7730% -1.2428%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
async/no-hook/wasm-to-host - nop - unchecked
                        time:   [10.440 ns 10.474 ns 10.513 ns]
                        change: [+83.959% +85.826% +87.541%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - unchecked
                        time:   [11.476 ns 11.512 ns 11.554 ns]
                        change: [-29.857% -28.383% -26.978%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe
async/no-hook/wasm-to-host - nop - async-typed
                        time:   [26.427 ns 26.478 ns 26.532 ns]
                        change: [+6.5730% +7.4676% +8.3983%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
async/no-hook/wasm-to-host - nop-params-and-results - async-typed
                        time:   [28.557 ns 28.693 ns 28.880 ns]
                        change: [+1.9099% +3.7332% +5.9731%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) high mild
  14 (14.00%) high severe

async/hook-sync/wasm-to-host - nop - typed
                        time:   [6.7488 ns 6.7630 ns 6.7784 ns]
                        change: [+19.935% +22.080% +23.683%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - typed
                        time:   [15.928 ns 16.031 ns 16.149 ns]
                        change: [+5.5188% +6.9567% +8.3839%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
async/hook-sync/wasm-to-host - nop - untyped
                        time:   [21.930 ns 22.114 ns 22.296 ns]
                        change: [+4.6674% +7.7588% +10.375%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - untyped
                        time:   [42.684 ns 42.858 ns 43.081 ns]
                        change: [-5.2957% -3.4693% -1.6217%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  2 (2.00%) high mild
  12 (12.00%) high severe
async/hook-sync/wasm-to-host - nop - unchecked
                        time:   [11.026 ns 11.053 ns 11.086 ns]
                        change: [+70.751% +72.378% +73.961%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) high mild
  5 (5.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - unchecked
                        time:   [11.840 ns 11.900 ns 11.982 ns]
                        change: [-27.977% -26.584% -24.887%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe
async/hook-sync/wasm-to-host - nop - async-typed
                        time:   [27.601 ns 27.709 ns 27.882 ns]
                        change: [+8.1781% +9.1102% +10.030%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe
async/hook-sync/wasm-to-host - nop-params-and-results - async-typed
                        time:   [28.955 ns 29.174 ns 29.413 ns]
                        change: [+1.1226% +3.0366% +5.1126%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  7 (7.00%) high mild
  6 (6.00%) high severe

async-pool/no-hook/wasm-to-host - nop - typed
                        time:   [6.5626 ns 6.5733 ns 6.5851 ns]
                        change: [+40.561% +42.307% +44.514%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - typed
                        time:   [15.820 ns 15.886 ns 15.969 ns]
                        change: [+4.1044% +5.7928% +7.7122%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 17 outliers among 100 measurements (17.00%)
  4 (4.00%) high mild
  13 (13.00%) high severe
async-pool/no-hook/wasm-to-host - nop - untyped
                        time:   [20.481 ns 20.521 ns 20.566 ns]
                        change: [+6.7962% +7.6950% +8.7612%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - untyped
                        time:   [41.834 ns 41.998 ns 42.189 ns]
                        change: [-3.8185% -2.2687% -0.7541%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) high mild
  10 (10.00%) high severe
async-pool/no-hook/wasm-to-host - nop - unchecked
                        time:   [10.353 ns 10.380 ns 10.414 ns]
                        change: [+82.042% +84.591% +87.205%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - unchecked
                        time:   [11.123 ns 11.168 ns 11.228 ns]
                        change: [-30.813% -29.285% -27.874%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  11 (11.00%) high mild
  1 (1.00%) high severe
async-pool/no-hook/wasm-to-host - nop - async-typed
                        time:   [27.442 ns 27.528 ns 27.638 ns]
                        change: [+7.5215% +9.9795% +12.266%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe
async-pool/no-hook/wasm-to-host - nop-params-and-results - async-typed
                        time:   [29.014 ns 29.148 ns 29.312 ns]
                        change: [+2.0227% +3.4722% +4.9047%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

async-pool/hook-sync/wasm-to-host - nop - typed
                        time:   [6.7916 ns 6.8116 ns 6.8325 ns]
                        change: [+20.937% +22.050% +23.281%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  5 (5.00%) high mild
  6 (6.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - typed
                        time:   [15.917 ns 15.975 ns 16.051 ns]
                        change: [+4.6404% +6.4217% +8.3075%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  5 (5.00%) high mild
  11 (11.00%) high severe
async-pool/hook-sync/wasm-to-host - nop - untyped
                        time:   [21.558 ns 21.612 ns 21.679 ns]
                        change: [+8.1158% +9.1409% +10.217%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - untyped
                        time:   [42.475 ns 42.614 ns 42.775 ns]
                        change: [-6.3613% -4.4709% -2.7647%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 18 outliers among 100 measurements (18.00%)
  3 (3.00%) high mild
  15 (15.00%) high severe
async-pool/hook-sync/wasm-to-host - nop - unchecked
                        time:   [11.150 ns 11.195 ns 11.247 ns]
                        change: [+74.424% +77.056% +79.811%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) high mild
  11 (11.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - unchecked
                        time:   [11.639 ns 11.695 ns 11.760 ns]
                        change: [-30.212% -29.023% -27.954%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  7 (7.00%) high mild
  8 (8.00%) high severe
async-pool/hook-sync/wasm-to-host - nop - async-typed
                        time:   [27.480 ns 27.712 ns 27.984 ns]
                        change: [+2.9764% +6.5061% +9.8914%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
async-pool/hook-sync/wasm-to-host - nop-params-and-results - async-typed
                        time:   [29.218 ns 29.380 ns 29.600 ns]
                        change: [+5.2283% +7.7247% +10.822%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) high mild
  14 (14.00%) high severe

@fitzgen
Copy link
Member Author

fitzgen commented Jul 11, 2022

TODO

  • aarch64 support (I think I can implement this myself, but cc @akirilov-arm anyways)
  • s390x support (I think I will need some help with this, cc @uweigand)
  • leaf functions that can trap need to preserve frame pointers to avoid missing frames in stack traces
  • use the macros from wasmtime-fiber for defining the asm trampolines
  • write a fuzz target that makes sure we are visiting all stack Wasm frames and aren't getting any host frames in our traces

@github-actions github-actions bot added wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:ref-types Issues related to reference types and GC in Wasmtime labels Jul 11, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @fitzgen, @peterhuene

This issue or pull request has been labeled: "wasmtime:api", "wasmtime:ref-types"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: wasmtime:ref-types
  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

crates/environ/src/vmoffsets.rs Outdated Show resolved Hide resolved
crates/environ/src/vmoffsets/vm_host_func_offsets.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/func.rs Outdated Show resolved Hide resolved
crates/runtime/src/traphandlers/unix.rs Outdated Show resolved Hide resolved
crates/runtime/src/traphandlers.rs Outdated Show resolved Hide resolved
@fitzgen fitzgen force-pushed the fast-stack-walking branch from 0f98b46 to d18b329 Compare July 12, 2022 00:23
@github-actions github-actions bot added the fuzzing Issues related to our fuzzing infrastructure label Jul 12, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@uweigand
Copy link
Member

I still need to have a closer look. But just one comment on this:

Within a sequence of Wasm-to-Wasm calls, we can use
frame pointers (which Cranelift preserves) to find the next older Wasm frame on
the stack, and we keep doing this until we reach the entry stack pointer,
meaning that the next older frame will be a host frame.

Actually, the Cranelift s390x back end does not maintain frame pointers at all in Wasm code - frame pointers are really only necessary in our ABI for frames that use dynamic stack allocation, and that never happens in Wasm, so right now we don't ever use frame pointers ... I guess it would be possible to add (the equivalent of) frame pointers, but that would incur a runtime overhead for all Wasm code.

@fitzgen fitzgen force-pushed the fast-stack-walking branch 2 times, most recently from 6f53aaf to 40d383c Compare July 12, 2022 18:46
@fitzgen
Copy link
Member Author

fitzgen commented Jul 13, 2022

Thanks for responding @uweigand.

Actually, the Cranelift s390x back end does not maintain frame pointers at all in Wasm code - frame pointers are really only necessary in our ABI for frames that use dynamic stack allocation, and that never happens in Wasm, so right now we don't ever use frame pointers ... I guess it would be possible to add (the equivalent of) frame pointers, but that would incur a runtime overhead for all Wasm code.

Hm, I didn't realize this, since our other backends preserve frame pointers (other than leaf functions on aarch64).

So I guess our options are to

  1. preserve frame pointers in s390x (or equivalent; unsure what you meant by "or equivalent of" since I'm not very familiar with s390x nor its calling conventions),
  2. continue to depend on the backtrace crate and the system unwinder on s390x, or
  3. write a custom stack walker for s390x that is not based on frame pointers (and is instead based on... DWARF/eh_frame? something else?)

I think (1) would be simplest, although it will incur some slight overhead to Wasm execution, but we pay this overhead on other architectures and it is generally considered acceptable. We could definitely make (2) work, but it wouldn't be as tidy as I'd like, since we would need to add more cfgs in more places than just choosing which bits of global assembly or frame pointer recovery code we'd like to compile with. Option (3) is the most nebulous to me, and if we are talking about DWARF/eh_frame then it is probably the most complicated as well. I'd like to avoid this option unless you have other implementation ideas to cut down on complexity here.

So ultimately, I'm personally leaning towards (1) but could be convinced of (2) and I'd prefer not to do (3) unless I'm missing something.

@fitzgen fitzgen force-pushed the fast-stack-walking branch 2 times, most recently from 9abc309 to d917643 Compare July 13, 2022 23:14
@uweigand
Copy link
Member

Hm, I didn't realize this, since our other backends preserve frame pointers (other than leaf functions on aarch64).

To be honest, I never understood that either. In my mind, a frame pointer, i.e. a reserved register separate from the stack pointer, which is being using by generated code to access local variables, spill slots, and other objects in a function's stack frame, is only ever necessary if those objects cannot be accessed relative to the stack pointer instead. This is normally only the case if the function performs dynamic stack space allocation (the C alloca or some equivalent), because then you no longer know at compile time what the offset of your stack frame objects relative to the stack pointer is.

Given that there is no dynamic stack allocation in Cranelift, I do not think maintaining an actual frame pointer register is ever needed for any architecture, so I do not understand why the Intel and Arm back-ends are currently doing so. I would expect that as optimization of Cranelift proceeds, we would want to re-examine that choice - I believe unnecessarily maintaining a frame pointer does cost a noticable performance overhead. Note that this overhead is not (or not only) due to the extra saves and restores in prolog and epilog, but primarily due to the fact that we've blocked one whole register that we could have used freely for register allocation otherwise.

I do understand that, once we do maintain a frame pointer register, there are ancillary benefits for stack walking, since the frame pointer must be callee-saved, and the saved copies of that register on the stack happen to form a nice chain that can be easily walked to perform a simple backtrace. However, as soon as we enforce maintaining a frame pointer just for that purpose, this is not actually "free", but does incur the runtime overhead described above, which would otherwise be unnecessary. I think there may be alternative solutions for the debugging problem without that overhead ...

So I guess our options are to

1. preserve frame pointers in s390x (or equivalent; unsure what you meant by "or equivalent of" since I'm not very familiar with s390x nor its calling conventions),

2. continue to depend on the `backtrace` crate and the system unwinder on s390x, or

3. write a custom stack walker for s390x that is not based on frame pointers (and is instead based on... DWARF/eh_frame? something else?)

I think (1) would be simplest, although it will incur some slight overhead to Wasm execution, but we pay this overhead on other architectures and it is generally considered acceptable. We could definitely make (2) work, but it wouldn't be as tidy as I'd like, since we would need to add more cfgs in more places than just choosing which bits of global assembly or frame pointer recovery code we'd like to compile with. Option (3) is the most nebulous to me, and if we are talking about DWARF/eh_frame then it is probably the most complicated as well. I'd like to avoid this option unless you have other implementation ideas to cut down on complexity here.

So ultimately, I'm personally leaning towards (1) but could be convinced of (2) and I'd prefer not to do (3) unless I'm missing something.

As to the "equivalent" I mention under (1), sorry for being vague :-) What I meant is in the s390x ABI we actually have an optional feature to maintain an explicit stack backchain, which forms a linked list of frames on the stack, just like the saved copies of a frame pointer register would, except we're not actually maintaining a frame pointer register. Basically, the backchain link is formed by prolog code storing a pointer to the caller's stack frame into the lowest word of the callee's stack frame - and that's it, there is no frame pointer register reserved throughout the function.

Now, we decided more than 20 years ago to even switch this feature off by default in the Linux toolchains, because with DWARF CFI it is no longer needed for debugging, and even that simple store has measurable overhead. (It is still used in special circumstances, e.g. in Linux kernel code.) But I could certainly enable that feature for Cranelift in the s390x back end.

That said, this would still leave us with the problem of frameless leaf functions, just like in the Arm case. I could also force any function that contains a trap to always allocate a frame anyway, just so we can store the backchain and the return address. But that would be even further run-time overhead ...

There may be a more elegant solution to the frameless leaf function problem, also for Arm. Once we know we are currently in a frameless leaf function, this is actually simple to handle: to unwind from a leaf function, you only need to notice that the caller's SP equals the current SP, and the caller's PC equals the value currently in the link register. (Assuming you have one; on Intel you'd get the return address from the lowest word of the stack instead.)

The only difficulty is to know whether we are in a frameless leaf function or not. But we should (be able to) know that - after all, we did compile that function ourselves earlier on :-) So, assuming we can associate the current PC with a data structure identifying the function (which I believe we must do anyway to get at the function name - which is the whole point of generating stack backtraces in the first place), the only thing we'd need to do is to have the compiler set a flag in that structure that this function is frameless.

In fact, spinning that thought experiment a bit further: if we'd store not just a "frameless" flag, but the actual size of the stack frame, we should be able to use that information to unwind from non-leaf functions as well without requiring any explicit backchain or frame pointer: just compute the caller's SP as the callee's SP plus the callee's (known) stack frame size. Note that this makes explicit use of the fact that frame sizes in Cranelift are constants known at compile time, because there is no dynamic stack allocation. (Glossing over some details here; this assumes that the stack pointer doesn't change at all during function execution, at least not at the "relevant" places - traps and function call sites. This is true on s390x and -as far as I know- on Arm, but not on Intel. There'd be extra information required to handle Intel correctly.)

@akirilov-arm
Copy link
Contributor

... I do not think maintaining an actual frame pointer register is ever needed for any architecture, so I do not understand why the Intel and Arm back-ends are currently doing so.

This is easy to answer in the AArch64 case - because the ABI (and its implementation on Linux in particular) pretty much requires it.

@uweigand
Copy link
Member

... I do not think maintaining an actual frame pointer register is ever needed for any architecture, so I do not understand why the Intel and Arm back-ends are currently doing so.

This is easy to answer in the AArch64 case - because the ABI (and its implementation on Linux in particular) pretty much requires it.

Huh, interesting. The ABI text still seems to leave it open to the "platform", up to and including the choice:

It may elect not to maintain a frame chain and to use the frame pointer register as a general-purpose callee-saved register.

Do you know why Linux specifically requires the frame pointer?

Anyway, I agree if that's the platform ABI standard, then we need to follow it in Cranelift.

@alexcrichton
Copy link
Member

Personally I see two routes for s390x w.r.t. this PR:

  • Match what x86_64 and aarch64 are doing
  • Temporarily drop s390x from CI until @uweigand you've got time to implement an alternative stack-walking scheme

I don't think we can keep around the backtrace-based traces for maintenance reasons. I feel that the code necessary to keep that working is so different than the code for the frame-pointer-based approach that it's too much to take on for something that most of us aren't running locally to test.

Otherwise this PR will require some more work regardless to handle things like leaf frames on AArch64. I think it would be worth evaluating at that point whether the frame pointer approach makes sense. If some metadata is plumbed through for leaf frames it might be "easy enough" to simply plumb through stack frame sizes as well making frame pointers unnecessary. If it's not as easy as originally thought though the frame pointers may still be needed.

@uweigand
Copy link
Member

To be clear, I certainly agree it wouldn't make sense to keep the DWARF backtrace in place just for s390x. As I said, I'd be happy to implement the backchain based mechanism on s390x described above, that's easy enough. That would still leave the leaf function problem to be solved, just like on aarch64. Just wondering whether this is best approach overall ...

@fitzgen
Copy link
Member Author

fitzgen commented Jul 14, 2022

In fact, spinning that thought experiment a bit further: if we'd store not just a "frameless" flag, but the actual size of the stack frame, we should be able to use that information to unwind from non-leaf functions as well without requiring any explicit backchain or frame pointer: just compute the caller's SP as the callee's SP plus the callee's (known) stack frame size. Note that this makes explicit use of the fact that frame sizes in Cranelift are constants known at compile time, because there is no dynamic stack allocation. (Glossing over some details here; this assumes that the stack pointer doesn't change at all during function execution, at least not at the "relevant" places - traps and function call sites. This is true on s390x and -as far as I know- on Arm, but not on Intel. There'd be extra information required to handle Intel correctly.)

When discussing how to speed up stack walking with Chris, I originally proposed a side table approach, and he pushed back on that (and I ultimately agreed with him) because frame pointers are so much simpler and easier to implement correctly. This is an important property for anything that is critical for safety1.

Frame pointers are a local property of compiling a single function that Cranelift can make sure are preserved and it is hard to get wrong.

Side tables on the other hand are more of a global property of all of the module's compilation and we also have to make sure that the runtime is interpreting and doing lookup in the table correctly. It isn't hard to have bugs with this kind of global integration between two far away components, as we've seen with multiple stack map-related CVEs.

This is why I've been pursuing frame pointers rather than side table metadata, even though I am aware that frame pointers are not strictly necessary (outside of ABI constraints) and impose some overhead (which we already happen to be paying on our most-used backends).

Footnotes

  1. We walk the stack to identify on-stack GC roots, for example, and failing to find a stack frame can lead to the collector reclaiming something it shouldn't, resulting in use-after-free bugs. In fact, we've had problems with system unwinders before where they fail to fully walk the stack and if we fully trusted the system unwinder we would have exactly these use-after-free bugs! This was the motivation for the "stack canary" infrastructure that Wasmtime has in main now (and is obsoleted in this PR).

@fitzgen
Copy link
Member Author

fitzgen commented Jul 14, 2022

My current hope for leaf functions is that we won't need a side table for identifying whether we are in a leaf function at the start of the backtrace, and instead there is an easy way to identify whether a leaf function has trapping instructions and maintain frame pointers in that case (as discussed at the last Cranelift meeting).

If that doesn't work out tho, I may need to maintain an "is this PC in a leaf function?" side table, and at that point it may be worth re-evaluating whether frame poiinters are the right approach, or if we can use a side table of static frame sizes, as you described @uweigand.

@uweigand
Copy link
Member

If that doesn't work out tho, I may need to maintain an "is this PC in a leaf function?" side table

As an alternate approach for this part, I think this question will only ever get asked for PCs that are trap locations, and we already have a side table for trap locations (holding the trap code), so this could just be an extra bit in there.

@fitzgen fitzgen force-pushed the fast-stack-walking branch from d917643 to f2a3748 Compare July 14, 2022 18:21
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen labels Jul 14, 2022
@bjorn3
Copy link
Contributor

bjorn3 commented Jul 14, 2022

Hm, I didn't realize this, since our other backends preserve frame pointers (other than leaf functions on aarch64).

To be honest, I never understood that either. In my mind, a frame pointer, i.e. a reserved register separate from the stack pointer, which is being using by generated code to access local variables, spill slots, and other objects in a function's stack frame, is only ever necessary if those objects cannot be accessed relative to the stack pointer instead. This is normally only the case if the function performs dynamic stack space allocation (the C alloca or some equivalent), because then you no longer know at compile time what the offset of your stack frame objects relative to the stack pointer is.

Frame pointers are often used for sampling profilers to have lightning fast stack unwinding as necessary for accurate profiling. For example on macOS frame pointers are mandated to have Instruments work I believe. They don't even support DWARF unwinding AFAIK.

@fitzgen fitzgen force-pushed the fast-stack-walking branch from f2a3748 to 24d2af4 Compare July 14, 2022 20:17
crates/runtime/src/trampolines/aarch64.rs Outdated Show resolved Hide resolved
crates/runtime/src/trampolines/aarch64.rs Outdated Show resolved Hide resolved
crates/runtime/src/trampolines/aarch64.rs Outdated Show resolved Hide resolved
crates/runtime/src/trampolines/aarch64.rs Outdated Show resolved Hide resolved
crates/runtime/src/trampolines/aarch64.rs Outdated Show resolved Hide resolved
crates/runtime/src/trampolines/aarch64.rs Outdated Show resolved Hide resolved
crates/runtime/src/trampolines/aarch64.rs Outdated Show resolved Hide resolved
fitzgen added 3 commits July 27, 2022 16:01
…king imported host functions

Also clean up assertions surrounding our saved entry/exit registers.
Regardless if we are doing wasm-to-host or host-to-wasm
@fitzgen fitzgen force-pushed the fast-stack-walking branch from 676ff0f to 0cd3d32 Compare July 27, 2022 23:02
@fitzgen fitzgen merged commit 46782b1 into bytecodealliance:main Jul 28, 2022
@fitzgen fitzgen deleted the fast-stack-walking branch July 28, 2022 22:46
@alexcrichton
Copy link
Member

🎊

fitzgen added a commit to fitzgen/wasmtime that referenced this pull request Aug 8, 2022
This method configures whether native unwind information (e.g. `.eh_frame` on
Linux) is generated or not.

This helps integrate with third-party stack capturing tools, such as the system
unwinder or the `backtrace` crate. It does not affect whether Wasmtime can
capture stack traces in Wasm code that it is running or not.

Unwind info is always enabled on Windows, since the Windows ABI requires it.

This configuration option defaults to true.

Additionally, we deprecate `Config::wasm_backtrace` since we can always cheaply
capture stack traces ever since
bytecodealliance#4431.

Fixes bytecodealliance#4554
fitzgen added a commit to fitzgen/wasmtime that referenced this pull request Aug 8, 2022
This method configures whether native unwind information (e.g. `.eh_frame` on
Linux) is generated or not.

This helps integrate with third-party stack capturing tools, such as the system
unwinder or the `backtrace` crate. It does not affect whether Wasmtime can
capture stack traces in Wasm code that it is running or not.

Unwind info is always enabled on Windows, since the Windows ABI requires it.

This configuration option defaults to true.

Additionally, we deprecate `Config::wasm_backtrace` since we can always cheaply
capture stack traces ever since
bytecodealliance#4431.

Fixes bytecodealliance#4554
fitzgen added a commit that referenced this pull request Aug 8, 2022
This method configures whether native unwind information (e.g. `.eh_frame` on
Linux) is generated or not.

This helps integrate with third-party stack capturing tools, such as the system
unwinder or the `backtrace` crate. It does not affect whether Wasmtime can
capture stack traces in Wasm code that it is running or not.

Unwind info is always enabled on Windows, since the Windows ABI requires it.

This configuration option defaults to true.

Additionally, we deprecate `Config::wasm_backtrace` since we can always cheaply
capture stack traces ever since
#4431.

Fixes #4554
bnjbvr added a commit to bnjbvr/wasmtime that referenced this pull request Aug 11, 2022
alexcrichton pushed a commit that referenced this pull request Aug 11, 2022
sunfishcode added a commit to sunfishcode/wasmtime that referenced this pull request Jan 11, 2023
Update the documentation for `Caller::get_export` to clarify that it's
not expected to be removed in the future. Components do offer an
alternative to `Caller::get_export`, so add a brief note mentioning
that.

Also, as of bytecodealliance#4431 `get_export` now works for all exports, not just
memories and functions.
abrown pushed a commit that referenced this pull request Jan 11, 2023
Update the documentation for `Caller::get_export` to clarify that it's
not expected to be removed in the future. Components do offer an
alternative to `Caller::get_export`, so add a brief note mentioning
that.

Also, as of #4431 `get_export` now works for all exports, not just
memories and functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift Issues related to the Cranelift code generator fuzzing Issues related to our fuzzing infrastructure wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:config Issues related to the configuration of Wasmtime wasmtime:ref-types Issues related to reference types and GC in Wasmtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants