-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
memfd/madvise-based CoW pooling allocator #3697
Conversation
Subscribe to Label Actioncc @peterhuene
This issue or pull request has been labeled: "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
I've hooked up your implementation to our benchmarks; here are the results: Legend:
And here's the comparison in raw numeric values (times are in microseconds): call_empty_function
dirty_1mb_of_memory
So there's still some way to go performance-wise. (:
Yes that's me. (: |
Can you share the branch and/or code to reproduce the results with this PR? I was able to reproduce somewhat locally where I got:
I think that roughly aligns with what you measured, but I wanted to confirm by peeking at the code if possible. Some quick benchmarking shows that this function is quite hot in this PR. That makes sense to me because @koute your PR skips that function entirely with the reuse mechanism you've implemented. I couldn't for sure drill down into what's causing the issue but my best guess is that this loop is the slow part. That's a one-time-initialization which is pretty branch-y which happens once-per-function in a module, and the modules you're loading are quite large. I personally think that this PR's allocation strategy is more viable in terms of long term maintenance so I'd like to continue to measure this and push on this if we can, but "the numbers don't lie" and your PR has some very impressive numbers! I think we should be working towards that as a goal because that would be awesome to reinstantiate instances that fast. Personally I think it would be fruitful to diff the performance between these two PRs. Without memfd enabled I was measuring like 132us for the benchmark above, so I think that this PR shows that 100us of that can be shaved off with memfd/cow, but it looks like there's a remaining ~25us or so to get shaved off. I believe the function-initialization is probably one of those issues, but there may be others lurking. Quantifying the difference and where the remaining 25us or so can go I think would be helpful to figure out the best viable implementation strategy going forward. |
Sure. Here's the branch: https://github.com/koute/substrate/tree/master_wasmtime_benchmarks_with_cfallin_memfd_cow And here's how to run them to get numbers for this PR (essentially the same as described in my PR, just with an extra cargo feature):
If you have any further questions feel free to ask! (I can also make myself available for discussion on Zulip.)
Indeed! One of the reasons why I implemented my PR the way I did is because you have to punch through less layers of abstraction so it's easier to make it faster. (: |
bc2fe0a
to
9d15443
Compare
OK, so I've spent some time digging into this more as well, and thinking about potential designs. My benchmarking (mainly with the A few thoughts:
Anyway, thoughts on the above? Sorry for the lengthy braindump; I wanted to try to represent how I'm thinking about this and framing the design space in my head, is all. And @koute I want to repeat again a "thank you" for spawning all of this thinking and exploration! |
I just implemented the lazy-anyfunc-initialization scheme I mentioned above: b8bb1d5 It seems to help in my quick-and-dirty local testing; I'm curious what it will do to the benchmarks above but will save that benchmarking (and benchmarking in general with enough precision to have quotable numbers) for Monday :-) |
This is a good point; I imagine that there should be use cases on both sides of the spectrum. I guess maybe it'd be nice to survey other I can only speak for ourselves, but in our main use case it's less of a security issue, and more of a "we just don't want consecutive instantiations to interfere with each other by accident". Our WASM module is - in general - trusted, and (simplifying a lot) we run a distributed system where the nodes essentially "check" each other's results, so leaking information locally within the same node is not a critical issue. Originally we didn't even clear the linear memory at all - we just manually reinitialized the globals and statics on every "fake" instantiation. But that ended up being problematic, since depending on the flags with which the WASM module was compiled the information about which statics need to be cleared on reinstantiation might not be emitted, so we decided to start just clearing the memory for simplicity as a quick fix, and started working on speeding it up, which resulted in my PR. As for the new features - in general we're pretty conservative here, and we disable every
Thank you for tackling this problem!
I updated my
( |
Thanks for writing all that up @cfallin, and at least my own personal opinion is to primarily err on the side of this PR in the sense of information leakage and spec-compliance. I very much want to get to the performance numbers in #3691 but I am not 100% convinced yet it's necessarily worth the loss in internal structuring (e.g. implementing an "undo" in addition to implementing a fast "redo".). Before making that conclusion though I think it's best to get this approach as close to #3691 as we possibly can.
Originally I didn't think this would work but digging more into your patch it looks like it's doing everything that would be necessary (although it is pretty gnarly). That being said I believe that the construction of an owned I personally still think that the biggest win is going to be not creating a Another possible idea I might have is that the #[repr(C)]
pub struct VMCallerCheckedAnyfunc {
pub func_ptr: NonNull<VMFunctionBody>,
pub type_index: VMSharedSignatureIndex,
pub vmctx: *mut VMContext,
} but we could probably change this to: #[repr(C)]
pub struct VMCallerCheckedAnyfunc {
pub info: *mut VMCallerCheckedAnyfuncStaticInfo,
pub vmctx: *mut VMContext,
}
#[repr(C)]
pub struct VMCallerCheckedAnyfuncStaticInfo {
pub func_ptr: NonNull<VMFunctionBody>,
pub type_index: VMSharedSignatureIndex,
} Here |
@alexcrichton I may be misunderstanding something about the internal workings but my intent with the lazy initialization was to actually subsume this; i.e. with lazy init we should at least only construct the anyfuncs that are possibly exported, and ideally even fewer than that. (So in other words I think we should already be getting that benefit with the patch.) Or are you imagining something different here?
I do like this, but it involves quite a lot more changes to generated code so I'd like to see if we can avoid doing it if we can. I think the most promising approach (what I'm poking at right now) is to share more of the init work wrt the (I've been testing by instantiating |
@cfallin oh that's a good point, I was mostly thinking of other ways to get the speedup without laziness now that I'm thinking about it due to the complexity here. Otherwise though I think your approach is currently slower for unrelated reasons to laziness, the creation of |
@koute I spent some time trying to reproduce numbers on your branch, and ran into a series of issues:
I'm benchmarking locally with |
I've modified the scheme in this PR to not recompute signature info on every instantiation; now it's showing a decent speedup on SpiderMonkey instantiation, from ~1.69us to ~1.48us on my machine (raw instance creation only, no start function invocation). WIthout the signature fix I was seeing ~1.6us IIRC. I'm redoing some profiling now to see what else might be done... |
6bb241e
to
e2b49cc
Compare
@koute I've now improved perf a bit more -- in my local tests (raw instantiate, no start function,
I'm curious if the factor of ~2 will translate into your benchmark as well, and/or what constant factors are left. (I'd still like to somehow be able to get your benchmark to run locally too...) The trick in my latest changes, beyond lazy init of anyfuncs, was to represent non-init with a zeroed bitmap in the vmctx, rather than zeroes in the sparse array. Zeroing the latter was slow; per-field writes, even just a memset, array is too sparse. After the change it doesn't seem to show up in my profiles any more; the actual table-element init (building anyfuncs that are actually referenced) is the bulk of the instance init time. |
@cfallin Sorry about the non-working branch! I've updated the code, but I haven't updated the I've pulled in your most recent changes and updated the branches; here are the numbers:
Looks like they're mostly... unchanged again? |
@koute -- thanks! @alexcrichton and I tracked down what we think is the delta in runtime perf; it led back to a comment you left at your use of |
@koute: I'm unfortunately still not able to run your benchmarks: I'm reaching the same "Incompatible allocation strategy" error I quoted above. I'm on the latest commit ( In any case, I suspect that this PR is much closer now -- I'll wait for your confirmation, or not, on this to be sure; if we know we can get close with the "from-scratch" approach, or otherwise if we can quantify what the remaining delta is, then this should help us decide what to do next. |
@cfallin: This might be a silly question, but... are you using the right branch and calling the benchmarks exactly as described? (: There's the branch from my original PR (which is unmodified, and won't work), and there's the branch for this PR (link is here: #3697 (comment)). You also have to Anyway, I reran the benchmarks again, and here are the results:
There's an improvement! |
Oh, I definitely missed that bit; I was running However, on a clean checkout of
If I should be running a different command, or from a different commit base, please do let me know!
It's good to see confirmation here, thanks! So I think at this point we can conclude something like: when the instance runs for long enough, the performance of this PR converges to that of your PR, because the underlying memory mapping is the same (anonymous mmap for zeroes, CoW of memfd otherwise). There is still an instantiation-time penalty, and this still mostly has to do with anyfunc initialization and initialization of the table elements that refer to those anyfunc structs. One thing that @alexcrichton and I noticed when looking at your benchmark in particular today is that it does not appear to contain any Anyway, at this point I think it might be fair to say that we've closed part of the gap (between stock wasmtime and your instance-reuse PR), know what the remaining gap can be attributed to, and this PR's approach might be slightly easier to maintain (as per @alexcrichton above); so it comes down to a choice of squeezing every last microsecond with a "trusted module" (your case) vs taking a still-significant improvement over today's mainline with existing security boundaries (this PR). That's a choice that depends on relative priorities; do others want to weigh in more here? One last thing I might say is that it does seem possible to build a snapshot/rewind-like approach on top of a memfd-as-part-of-normal-instantiation foundation (this PR); basically we would provide an extra hook that does the |
Sorry, the way I quickly hacked different instantiation strategies into the benchmarks is little janky, so you can't run all of them with a single command. So for each instantiation strategy you basically have to run them separately, e.g. to get the numbers for your PR:
To get the numbers for just recreating each instance from scratch with the ondemand strategy without any pooling:
And to get the numbers for my PR:
Well, yes, but the whole point here is to reduce the overhead of starting fresh, relatively short-lived instances, right? (: Even just using the ondemand strategy converges to the same performance if it runs long enough.
Indeed. Do you want me to add some extra benchmarks for that? Hmm... also, couldn't the
Since we're talking about the API, I just want to chime in here that from our perspective as Consider the scenario where someone is currently using the ondemand instance allocator and wants to switch to a faster approach. With the approach from my PR it's basically this (I'm obviously simplifying as this is not the actual patch from our code, but that's essentially what it took):
It's fast, gives the user control and is super simple. There's no need to define any limits on the WASM module itself (so the WASM executor won't suddenly get bricked if the module has more functions/globals/larger memory/etc. than expected), it will still work if we instantiate too many modules at the same time (the extra modules just won't be reused), and it allows the user to tweak how many instances are being cached without reinitializing everything. (And if we wanted to make it even more transparent the reuse could maybe be integrated into Now consider switching to the pooling allocator. Yes, turning it on is essentially just a single line of code calling So I totally understand that my approach might be harder to maintain, however purely as a user I think the pooling approach (completely ignoring how it performs) is not very convenient to use. (Unless it would be truly a "set and forget" without having to explicitly set any hard limits.) |
My personal take an opinion is that we should follow a sequence of steps that looks like:
I believe we can solve all the problems here using a hybrid of this PR and @koute's, I don't think that this is an either/or choice. In isolation I don't think that anyone would disagree that start-from-scratch instantiation is more robust and easier to reason about. The main reason to implement some sort of snapshot-like approach would be performance, but I'm pretty sure we have a good handle on performance from poking around at the examples from @koute your substrate fork. This so far has revealed:
I'm pretty confident that we have basically quantified the difference in numbers that you're seeing @koute between I'd like to also ideally address some of our API-wise concerns @koute. You've mentioned that for your use case it's important that instantiation always succeeds and you don't want to configure the pooling allocator so far. I think much of this can be solved by having a default pooling allocator for memories within engines like I mentioned above. We can pretty easily remove the need to have limits on the memory pool as well. I think it's worthwhile to acknowledge, though, that the desire to run an unlimited number of instances concurrently isn't a constraint we've seen yet and is so far unique to you. With a memory pool that enables the CoW/memfd strategy implemented in this PR to be usable with the on-demand allocation strategy. I think that this would solve the issues around trying to configure the pooling allocator since you wouldn't be using it and you could allocate as many stores/instances as the system can support. The limits related to memory could also be relaxed in this situation so they don't need to be specified up-front. The final thing you've mentioned is that it's easy to build pooling allocators externally, so why not do that? The problem with this is that it places a lot of constraints on the API design of the Well that's a bit of a long-winded post, sorry for the wall of text. I think that captures well my thoughts on all this, though, and I'd like to hear what others hink about this all too! |
Just as an addendum here: @alexcrichton and I discussed the possibility of using the implementation bits here ( I took a bit of a deeper look now and I think it's probably best as a followup PR -- it involves some more refactoring in the allocator machinery (e.g. the |
Actually, it turned out to be pretty simple to extend this to the on-demand allocator too; pushed another commit for this, but happy to split it to a second PR if that makes review easier @alexcrichton . |
// N.B.: this comes after the `mmap` field above because it must | ||
// be destructed first. It puts a placeholder mapping in place on | ||
// drop, then the `mmap` above completely unmaps the region. | ||
memfd: Option<MemFdSlot>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike the Static
variant of memory I think we might be able to skip the storage here entirely (and also make the instantiate
a litte faster below with fewer syscalls) since with a runtime memory all memfd really is is the initial mapping of the cow image overlaid on top of the underlying anonymous mapping. In that sense I think we can skip most of the logic of MemFdSlot
and avoid storing redundant data.
Now that being said I think the implementation here is correct and will work well regardless. I think it'd be fine for a follow-up to clean this up a bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with this a bit, and it's a good thought! I think though it may be safer to go through the MemFdSlot
for heap growth specifically in case we choose to implement some other technique (such as ftruncate on a second memfd) in the future. At least if I'm reading your suggestion right, otherwise, we'd use the MemFdSlot
's instantiate
logic to set up the actual CoW mapping then throw it away (without its dtor running) and rely on the implicit alignment of the mprotect
-growth strategy between memfd and the usual dynamic memory.
Re: syscalls on instantiate, we could indeed avoid one if we give the MemFdSlot
a "no fixed mapping" mode, but then it takes responsibility for the pre-guard too, which complicates the pooling implementation somewhat. I'll think a bit more about this though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my worries is that there's a fair bit of tricky state with handling linear memories and having that duplicated across a few locations I think is not ideal. The MemFdSlot
tries to take over a fair bit of the management of the memory which is also tracked by the runtime Memory
. We discussed this a bit in person as well, but to reiterate here I think it might be good to investigate a bit to figure out whether some functions below could grow an extra parameter or two instead of storing the state internally so it could be passed in by the "source of truth".
I don't have a great idea though for whether this is possible, so I'll leave it up to your discretion about whether it ends up working out or not.
crates/runtime/Cargo.toml
Outdated
@@ -37,6 +37,7 @@ winapi = { version = "0.3.7", features = ["winbase", "memoryapi", "errhandlingap | |||
|
|||
[target.'cfg(target_os = "linux")'.dependencies] | |||
userfaultfd = { version = "0.4.1", optional = true } | |||
memfd = { version = "0.4.1", optional = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given where this PR now is and the trajectory of this feature, I think this is good to have as an optional dependency but I think it should be enabled by default and probably just called memfd
(enabled-by-default probably from the wasmtime
crate, not here). Especially with this making such a drastric difference for the on-demand allocator it seems like using memfd
is a no-brainer.
To make this tweak as well as improve the portability here (right now if you enable memfd
on Windows it'd probably just fail to compile) I think a few small changes are all that's needed. To be clear though this is icing on the cake and it's totally fine to do this as a follow-up, but I think the steps would be:
- Remove the
memfd-allocator
- Add a
memfd
feature to thewasmtime
crate that forwards towasmtime-runtime/memfd
- Add
memfd
to the default feature set of thewasmtime
crate - Add a
crates/runtime/build.rs
that detects that the target's os is Linux (by looking atCARGO_CFG_TARGET_OS
) and that thememfd
feature is active (by looking atCARGO_FEATURE_MEMFD
), and if both are present then printingprintln!("cargo:rustc-cfg=memfd")
- Change all
cfg(feature = "memfd-allocator")
in the crate tocfg(memfd)
With that I think we'll have on-by-default memfd for Linux-only, with the opt-in ability to turn it off. Again though it's fine to defer this to a future PR.
As I type all this out though I also would question maybe we don't even need a feature for this. I suppose one thing that could come up is that if you have a million modules in a process that's a million file descriptors which can blow kernel limits, but other than that I'm not sure why anyone would explicitly want to disable memfd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone ahead and done all of this; the default build now will use memfd, validated by building the wasmtime
binary and strace'ing an execution. Waiting for CI to see if moving memfd
out of the target-only deps into the main deps will work on non-Linux, but I am hoping it will (the crate's got a toplevel #![cfg(target_os ...)]
that should make it a no-op on other platforms...?).
Sorry I'm heading out for the day and only recently noticed the addition to the on-demand allocator, I skimmed it and it looks good to me, but I'll need to dig in a bit more depth tomorrow, otherwise I've got one instance of what I think is a bug but otherwise just a bunch of nits which can all be deferred to later. |
60bb2d3
to
2184a82
Compare
(This was not a correctness bug, but is an obvious performance bug...)
d07530d
to
2efa0f5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for being patient with me! There's a few more minor things below which I think are worth addressing but nothing major. Otherwise I'm pretty confident in the state of this PR and I'm happy to approve here as well.
@koute we haven't heard from you in a bit, but to reiterate this PR brings all the CoW benefits to the on-demand allocator as well although the reuse case isn't simply an madvise
to reset. That being said for your "empty" function benchmark from before I'm clocking this PR (re-instantiation in a loop) at around 40us and "'just madvise
", your PR, at around 5ns. A huge portion of the remaining time is table initialization, scheduled to be addressed in #3733 after some further work (which should make table initialization effectively zero-cost). Basically I want to reiterate we continue to be very interested in solving your use case and accomodating your performance constraints. If you'd like we'd be happy to ping you when other optimization work has settled down and we're ready to re-benchmark.
// N.B.: this comes after the `mmap` field above because it must | ||
// be destructed first. It puts a placeholder mapping in place on | ||
// drop, then the `mmap` above completely unmaps the region. | ||
memfd: Option<MemFdSlot>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of my worries is that there's a fair bit of tricky state with handling linear memories and having that duplicated across a few locations I think is not ideal. The MemFdSlot
tries to take over a fair bit of the management of the memory which is also tracked by the runtime Memory
. We discussed this a bit in person as well, but to reiterate here I think it might be good to investigate a bit to figure out whether some functions below could grow an extra parameter or two instead of storing the state internally so it could be passed in by the "source of truth".
I don't have a great idea though for whether this is possible, so I'll leave it up to your discretion about whether it ends up working out or not.
Thanks @alexcrichton ! I did a few last refactors based on your comments; given the criticality of this code I want to make sure you're happy with the last commit before merging! I played around a bit with simplifying |
…ng the initial mmap.
Yes, I saw you're pretty busy working on this so that's why I was keeping quiet until you finish. (: Please do ping me once this is ready to rebenchmark! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've got one question about a subtraction mainly but otherwise looks good 👍
(tagging myself as a reviewer on this just so I don't lose track of it, but please don't hold up the merge on me) |
Hmm, running into the |
Add a pooling allocator mode based on copy-on-write mappings of memfds.
As first suggested by Jan on the Zulip here [1], a cheap and effective
way to obtain copy-on-write semantics of a "backing image" for a Wasm
memory is to mmap a file with
MAP_PRIVATE
. Thememfd
mechanismprovided by the Linux kernel allows us to create anonymous,
in-memory-only files that we can use for this mapping, so we can
construct the image contents on-the-fly then effectively create a CoW
overlay. Furthermore, and importantly,
madvise(MADV_DONTNEED, ...)
will discard the CoW overlay, returning the mapping to its original
state.
By itself this is almost enough for a very fast
instantiation-termination loop of the same image over and over,
without changing the address space mapping at all (which is
expensive). The only missing bit is how to implement
heap growth. But here memfds can help us again: if we create another
anonymous file and map it where the extended parts of the heap would
go, we can take advantage of the fact that a
mmap()
mapping canbe larger than the file itself, with accesses beyond the end
generating a
SIGBUS
, and the fact that we can cheaply resize thefile with
ftruncate
, even after a mapping exists. So we can map the"heap extension" file once with the maximum memory-slot size and grow
the memfd itself as
memory.grow
operations occur.The above CoW technique and heap-growth technique together allow us a
fastpath of
madvise()
andftruncate()
only when we re-instantiatethe same module over and over, as long as we can reuse the same
slot. This fastpath avoids all whole-process address-space locks in
the Linux kernel, which should mean it is highly scalable. It also
avoids the cost of copying data on read, as the
uffd
heap backenddoes when servicing pagefaults; the kernel's own optimized CoW
logic (same as used by all file mmaps) is used instead.
There are still a few loose ends in this PR, which I intend to tie up
before merging:
There is no
InstanceAllocationStrategy
yet that attempts toactually reuse instance slots; that should be added ASAP. For testing
so far, I have just instantiated the same one module repeatedly (so
reuse naturally occurs).
The guard-page strategy is slightly wrong; I need to implement the
pre-heap guard region as well. This will be done by performing
another mapping once, to reserve the whole address range, then
mmap'ing the image and extension file on top at appropriate
offsets (2GiB, 2GiB plus image size).
Thanks to Jan on Zulip (are you also @koute from #3691?) for the initial
idea/inspiration! This PR is meant to demonstrate my thoughts on how
to build the feature and spawn discussion; now that we see both approaches
hopefully we can work out a way to meet the needs of both of our use-cases.
[1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772