-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasmtime
: Implement fast Wasm stack walking
#4431
Conversation
TODO
|
Subscribe to Label Actioncc @fitzgen, @peterhuene
This issue or pull request has been labeled: "wasmtime:api", "wasmtime:ref-types"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
0f98b46
to
d18b329
Compare
Subscribe to Label Actioncc @fitzgen
This issue or pull request has been labeled: "fuzzing"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
I still need to have a closer look. But just one comment on this:
Actually, the Cranelift s390x back end does not maintain frame pointers at all in Wasm code - frame pointers are really only necessary in our ABI for frames that use dynamic stack allocation, and that never happens in Wasm, so right now we don't ever use frame pointers ... I guess it would be possible to add (the equivalent of) frame pointers, but that would incur a runtime overhead for all Wasm code. |
6f53aaf
to
40d383c
Compare
Thanks for responding @uweigand.
Hm, I didn't realize this, since our other backends preserve frame pointers (other than leaf functions on aarch64). So I guess our options are to
I think (1) would be simplest, although it will incur some slight overhead to Wasm execution, but we pay this overhead on other architectures and it is generally considered acceptable. We could definitely make (2) work, but it wouldn't be as tidy as I'd like, since we would need to add more So ultimately, I'm personally leaning towards (1) but could be convinced of (2) and I'd prefer not to do (3) unless I'm missing something. |
9abc309
to
d917643
Compare
To be honest, I never understood that either. In my mind, a frame pointer, i.e. a reserved register separate from the stack pointer, which is being using by generated code to access local variables, spill slots, and other objects in a function's stack frame, is only ever necessary if those objects cannot be accessed relative to the stack pointer instead. This is normally only the case if the function performs dynamic stack space allocation (the C Given that there is no dynamic stack allocation in Cranelift, I do not think maintaining an actual frame pointer register is ever needed for any architecture, so I do not understand why the Intel and Arm back-ends are currently doing so. I would expect that as optimization of Cranelift proceeds, we would want to re-examine that choice - I believe unnecessarily maintaining a frame pointer does cost a noticable performance overhead. Note that this overhead is not (or not only) due to the extra saves and restores in prolog and epilog, but primarily due to the fact that we've blocked one whole register that we could have used freely for register allocation otherwise. I do understand that, once we do maintain a frame pointer register, there are ancillary benefits for stack walking, since the frame pointer must be callee-saved, and the saved copies of that register on the stack happen to form a nice chain that can be easily walked to perform a simple backtrace. However, as soon as we enforce maintaining a frame pointer just for that purpose, this is not actually "free", but does incur the runtime overhead described above, which would otherwise be unnecessary. I think there may be alternative solutions for the debugging problem without that overhead ...
As to the "equivalent" I mention under (1), sorry for being vague :-) What I meant is in the s390x ABI we actually have an optional feature to maintain an explicit stack backchain, which forms a linked list of frames on the stack, just like the saved copies of a frame pointer register would, except we're not actually maintaining a frame pointer register. Basically, the backchain link is formed by prolog code storing a pointer to the caller's stack frame into the lowest word of the callee's stack frame - and that's it, there is no frame pointer register reserved throughout the function. Now, we decided more than 20 years ago to even switch this feature off by default in the Linux toolchains, because with DWARF CFI it is no longer needed for debugging, and even that simple store has measurable overhead. (It is still used in special circumstances, e.g. in Linux kernel code.) But I could certainly enable that feature for Cranelift in the s390x back end. That said, this would still leave us with the problem of frameless leaf functions, just like in the Arm case. I could also force any function that contains a trap to always allocate a frame anyway, just so we can store the backchain and the return address. But that would be even further run-time overhead ... There may be a more elegant solution to the frameless leaf function problem, also for Arm. Once we know we are currently in a frameless leaf function, this is actually simple to handle: to unwind from a leaf function, you only need to notice that the caller's SP equals the current SP, and the caller's PC equals the value currently in the link register. (Assuming you have one; on Intel you'd get the return address from the lowest word of the stack instead.) The only difficulty is to know whether we are in a frameless leaf function or not. But we should (be able to) know that - after all, we did compile that function ourselves earlier on :-) So, assuming we can associate the current PC with a data structure identifying the function (which I believe we must do anyway to get at the function name - which is the whole point of generating stack backtraces in the first place), the only thing we'd need to do is to have the compiler set a flag in that structure that this function is frameless. In fact, spinning that thought experiment a bit further: if we'd store not just a "frameless" flag, but the actual size of the stack frame, we should be able to use that information to unwind from non-leaf functions as well without requiring any explicit backchain or frame pointer: just compute the caller's SP as the callee's SP plus the callee's (known) stack frame size. Note that this makes explicit use of the fact that frame sizes in Cranelift are constants known at compile time, because there is no dynamic stack allocation. (Glossing over some details here; this assumes that the stack pointer doesn't change at all during function execution, at least not at the "relevant" places - traps and function call sites. This is true on s390x and -as far as I know- on Arm, but not on Intel. There'd be extra information required to handle Intel correctly.) |
This is easy to answer in the AArch64 case - because the ABI (and its implementation on Linux in particular) pretty much requires it. |
Huh, interesting. The ABI text still seems to leave it open to the "platform", up to and including the choice:
Do you know why Linux specifically requires the frame pointer? Anyway, I agree if that's the platform ABI standard, then we need to follow it in Cranelift. |
Personally I see two routes for s390x w.r.t. this PR:
I don't think we can keep around the Otherwise this PR will require some more work regardless to handle things like leaf frames on AArch64. I think it would be worth evaluating at that point whether the frame pointer approach makes sense. If some metadata is plumbed through for leaf frames it might be "easy enough" to simply plumb through stack frame sizes as well making frame pointers unnecessary. If it's not as easy as originally thought though the frame pointers may still be needed. |
To be clear, I certainly agree it wouldn't make sense to keep the DWARF backtrace in place just for s390x. As I said, I'd be happy to implement the backchain based mechanism on s390x described above, that's easy enough. That would still leave the leaf function problem to be solved, just like on aarch64. Just wondering whether this is best approach overall ... |
When discussing how to speed up stack walking with Chris, I originally proposed a side table approach, and he pushed back on that (and I ultimately agreed with him) because frame pointers are so much simpler and easier to implement correctly. This is an important property for anything that is critical for safety1. Frame pointers are a local property of compiling a single function that Cranelift can make sure are preserved and it is hard to get wrong. Side tables on the other hand are more of a global property of all of the module's compilation and we also have to make sure that the runtime is interpreting and doing lookup in the table correctly. It isn't hard to have bugs with this kind of global integration between two far away components, as we've seen with multiple stack map-related CVEs. This is why I've been pursuing frame pointers rather than side table metadata, even though I am aware that frame pointers are not strictly necessary (outside of ABI constraints) and impose some overhead (which we already happen to be paying on our most-used backends). Footnotes
|
My current hope for leaf functions is that we won't need a side table for identifying whether we are in a leaf function at the start of the backtrace, and instead there is an easy way to identify whether a leaf function has trapping instructions and maintain frame pointers in that case (as discussed at the last Cranelift meeting). If that doesn't work out tho, I may need to maintain an "is this PC in a leaf function?" side table, and at that point it may be worth re-evaluating whether frame poiinters are the right approach, or if we can use a side table of static frame sizes, as you described @uweigand. |
As an alternate approach for this part, I think this question will only ever get asked for PCs that are trap locations, and we already have a side table for trap locations (holding the trap code), so this could just be an extra bit in there. |
d917643
to
f2a3748
Compare
Frame pointers are often used for sampling profilers to have lightning fast stack unwinding as necessary for accurate profiling. For example on macOS frame pointers are mandated to have Instruments work I believe. They don't even support DWARF unwinding AFAIK. |
f2a3748
to
24d2af4
Compare
…king imported host functions Also clean up assertions surrounding our saved entry/exit registers.
Regardless if we are doing wasm-to-host or host-to-wasm
676ff0f
to
0cd3d32
Compare
🎊 |
This method configures whether native unwind information (e.g. `.eh_frame` on Linux) is generated or not. This helps integrate with third-party stack capturing tools, such as the system unwinder or the `backtrace` crate. It does not affect whether Wasmtime can capture stack traces in Wasm code that it is running or not. Unwind info is always enabled on Windows, since the Windows ABI requires it. This configuration option defaults to true. Additionally, we deprecate `Config::wasm_backtrace` since we can always cheaply capture stack traces ever since bytecodealliance#4431. Fixes bytecodealliance#4554
This method configures whether native unwind information (e.g. `.eh_frame` on Linux) is generated or not. This helps integrate with third-party stack capturing tools, such as the system unwinder or the `backtrace` crate. It does not affect whether Wasmtime can capture stack traces in Wasm code that it is running or not. Unwind info is always enabled on Windows, since the Windows ABI requires it. This configuration option defaults to true. Additionally, we deprecate `Config::wasm_backtrace` since we can always cheaply capture stack traces ever since bytecodealliance#4431. Fixes bytecodealliance#4554
This method configures whether native unwind information (e.g. `.eh_frame` on Linux) is generated or not. This helps integrate with third-party stack capturing tools, such as the system unwinder or the `backtrace` crate. It does not affect whether Wasmtime can capture stack traces in Wasm code that it is running or not. Unwind info is always enabled on Windows, since the Windows ABI requires it. This configuration option defaults to true. Additionally, we deprecate `Config::wasm_backtrace` since we can always cheaply capture stack traces ever since #4431. Fixes #4554
Thanks to bytecodealliance#4431 and @fitzgen who implemented it!
Update the documentation for `Caller::get_export` to clarify that it's not expected to be removed in the future. Components do offer an alternative to `Caller::get_export`, so add a brief note mentioning that. Also, as of bytecodealliance#4431 `get_export` now works for all exports, not just memories and functions.
Update the documentation for `Caller::get_export` to clarify that it's not expected to be removed in the future. Components do offer an alternative to `Caller::get_export`, so add a brief note mentioning that. Also, as of #4431 `get_export` now works for all exports, not just memories and functions.
Why do we want Wasm stack walking to be fast? Because we capture stacks whenever
there is a trap and traps actually happen fairly frequently with short-lived
programs and WASI's
exit
.Previously, we would rely on generating the system unwind info (e.g.
.eh_frame
) and using the system unwinder (via thebacktrace
crate) to walkthe full stack and filter out any non-Wasm stack frames. This can,
unfortunately, be slow for two primary reasons:
The system unwinder is doing
O(all-kinds-of-frames)
work rather thanO(wasm-frames)
work.System unwind info and the system unwinder need to be much more general than
a purpose-built stack walker for Wasm needs to be. It has to handle any kind of
stack frame that any compiler might emit where as our Wasm frames are emitted by
Cranelift and always have frame pointers. This translates into implementation
complexity and general overhead. There can also be unnecessary-for-our-use-cases
global synchronization and locks involved, further slowing down stack walking in
the presence of multiple threads trying to capture stacks in parallel.
This commit introduces a purpose-built stack walker for traversing just our Wasm
frames. To find all the sequences of Wasm-to-Wasm stack frames, and ignore
non-Wasm stack frames, we keep a linked list of
(entry stack pointer, exit frame pointer)
pairs. This linked list is maintained via Wasm-to-host andhost-to-Wasm trampolines. Within a sequence of Wasm-to-Wasm calls, we can use
frame pointers (which Cranelift preserves) to find the next older Wasm frame on
the stack, and we keep doing this until we reach the entry stack pointer,
meaning that the next older frame will be a host frame.
The trampolines need to avoid a couple stumbling blocks. First, they need to be
compiled ahead of time, since we may not have access to a compiler at
runtime (e.g. if the
cranelift
feature is disabled) but still want to be ableto call functions that have already been compiled and get stack traces for those
functions. Usually this means we would compile the appropriate trampolines
inside
Module::new
and the compiled module object would hold thetrampolines. However, we also need to support calling host functions that are
wrapped into
wasmtime::Func
s and there doesn't exist any ahead-of-timecompiled module object to hold the appropriate trampolines:
Therefore, we define one host-to-Wasm trampoline and one Wasm-to-host trampoline
in assembly that work for all Wasm and host function signatures. These
trampolines are careful to only use volatile registers, avoid touching any
register that is an argument in the calling convention ABI, and tail call to the
target callee function. This allows forwarding any set of arguments and any
returns to and from the callee, while also allowing us to maintain our linked
list of Wasm stack and frame pointers before transferring control to the
callee. These trampolines are not used in Wasm-to-Wasm calls, only when crossing
the host-Wasm boundary, so they do not impose overhead on regular calls. (And if
using one trampoline for all host-Wasm boundary crossing ever breaks branch
prediction enough in the CPU to become any kind of bottleneck, we can do fun
things like have multiple copies of the same trampoline and choose a random copy
for each function, sharding the functions across branch predictor entries.)
Finally, this commit also ends the use of a synthetic
Module
and allocating astubbed out
VMContext
for host functions. Instead, we define aVMHostFuncContext
with its own magic value, similar toVMComponentContext
,specifically for host functions.
Benchmarks
Traps and Stack Traces
Large improvements to taking stack traces on traps, ranging from shaving off 64%
to 99.95% of the time it used to take.
Calls
There is, however, a small regression in raw Wasm-to-host and host-to-Wasm call
performance due the new trampolines. It seems to be on the order of about 2-10
nanoseconds per call, depending on the benchmark.
I believe this regression is ultimately acceptable because
this overhead will be vastly dominated by whatever work a non-nop callee
actually does,
we will need these trampolines, or something like them, when implementing the
Wasm exceptions proposal to do things like translate Wasm's exceptions into
Rust's
Result
s,and because the performance improvements to trapping and capturing stack
traces are of such a larger magnitude than this call regressions.