Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memfd/madvise-based CoW pooling allocator #3697

Merged
merged 12 commits into from
Feb 2, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jan 19, 2022

Add a pooling allocator mode based on copy-on-write mappings of memfds.

As first suggested by Jan on the Zulip here [1], a cheap and effective
way to obtain copy-on-write semantics of a "backing image" for a Wasm
memory is to mmap a file with MAP_PRIVATE. The memfd mechanism
provided by the Linux kernel allows us to create anonymous,
in-memory-only files that we can use for this mapping, so we can
construct the image contents on-the-fly then effectively create a CoW
overlay. Furthermore, and importantly, madvise(MADV_DONTNEED, ...)
will discard the CoW overlay, returning the mapping to its original
state.

By itself this is almost enough for a very fast
instantiation-termination loop of the same image over and over,
without changing the address space mapping at all (which is
expensive). The only missing bit is how to implement
heap growth. But here memfds can help us again: if we create another
anonymous file and map it where the extended parts of the heap would
go, we can take advantage of the fact that a mmap() mapping can
be larger than the file itself, with accesses beyond the end
generating a SIGBUS, and the fact that we can cheaply resize the
file with ftruncate, even after a mapping exists. So we can map the
"heap extension" file once with the maximum memory-slot size and grow
the memfd itself as memory.grow operations occur.

The above CoW technique and heap-growth technique together allow us a
fastpath of madvise() and ftruncate() only when we re-instantiate
the same module over and over, as long as we can reuse the same
slot. This fastpath avoids all whole-process address-space locks in
the Linux kernel, which should mean it is highly scalable. It also
avoids the cost of copying data on read, as the uffd heap backend
does when servicing pagefaults; the kernel's own optimized CoW
logic (same as used by all file mmaps) is used instead.

There are still a few loose ends in this PR, which I intend to tie up
before merging:

  • There is no InstanceAllocationStrategy yet that attempts to
    actually reuse instance slots; that should be added ASAP. For testing
    so far, I have just instantiated the same one module repeatedly (so
    reuse naturally occurs).

  • The guard-page strategy is slightly wrong; I need to implement the
    pre-heap guard region as well. This will be done by performing
    another mapping once, to reserve the whole address range, then
    mmap'ing the image and extension file on top at appropriate
    offsets (2GiB, 2GiB plus image size).

Thanks to Jan on Zulip (are you also @koute from #3691?) for the initial
idea/inspiration! This PR is meant to demonstrate my thoughts on how
to build the feature and spawn discussion; now that we see both approaches
hopefully we can work out a way to meet the needs of both of our use-cases.

[1] https://bytecodealliance.zulipchat.com/#narrow/stream/206238-general/topic/Copy.20on.20write.20based.20instance.20reuse/near/266657772

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Jan 19, 2022
@github-actions
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@koute
Copy link
Contributor

koute commented Jan 19, 2022

I've hooked up your implementation to our benchmarks; here are the results:

call_empty_function

dirty_1mb_of_memory

Legend:

  • instance_pooling_memfd - this PR
  • native_instance_reuse - Add copy-on-write based instance reuse mechanism #3691
  • recreate_instance - create a fresh instance with InstanceAllocationStrategy::OnDemand strategy
  • instance_pooling_with_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy with uffd turned on and without this PR
  • instance_pooling_without_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy without uffd turned on and without this PR

And here's the comparison in raw numeric values (times are in microseconds):

call_empty_function

threads instance_pooling_with_uffd instance_pooling_without_uffd recreate_instance native_instance_reuse instance_pooling_memfd
1 41 78 80 4 27
2 63 110 114 11 43
4 76 172 188 17 66
8 123 360 439 24 89
16 286 695 1117 36 127

dirty_1mb_of_memory

threads instance_pooling_with_uffd instance_pooling_without_uffd recreate_instance native_instance_reuse instance_pooling_memfd
1 312 216 221 144 238
2 844 325 334 209 385
4 1400 526 549 306 453
8 1979 1047 1168 581 618
16 3084 1814 1954 765 840

So there's still some way to go performance-wise. (:

Thanks to Jan on Zulip (are you also @koute from #3691?) for the initial
idea/inspiration!

Yes that's me. (:

@alexcrichton
Copy link
Member

alexcrichton commented Jan 19, 2022

Can you share the branch and/or code to reproduce the results with this PR? I was able to reproduce somewhat locally where I got:

  • call_empty_function_from_kusama_runtime_with_legacy_instance_reuse_on_1_threads - 97us
  • call_empty_function_from_kusama_runtime_with_native_instance_reuse_on_1_threads - 7.1us (your PR)
  • call_empty_function_from_kusama_runtime_with_native_instance_reuse_on_1_threads - 32us (this PR)

I think that roughly aligns with what you measured, but I wanted to confirm by peeking at the code if possible.

Some quick benchmarking shows that this function is quite hot in this PR. That makes sense to me because @koute your PR skips that function entirely with the reuse mechanism you've implemented.

I couldn't for sure drill down into what's causing the issue but my best guess is that this loop is the slow part. That's a one-time-initialization which is pretty branch-y which happens once-per-function in a module, and the modules you're loading are quite large.

I personally think that this PR's allocation strategy is more viable in terms of long term maintenance so I'd like to continue to measure this and push on this if we can, but "the numbers don't lie" and your PR has some very impressive numbers! I think we should be working towards that as a goal because that would be awesome to reinstantiate instances that fast.

Personally I think it would be fruitful to diff the performance between these two PRs. Without memfd enabled I was measuring like 132us for the benchmark above, so I think that this PR shows that 100us of that can be shaved off with memfd/cow, but it looks like there's a remaining ~25us or so to get shaved off. I believe the function-initialization is probably one of those issues, but there may be others lurking. Quantifying the difference and where the remaining 25us or so can go I think would be helpful to figure out the best viable implementation strategy going forward.

@koute
Copy link
Contributor

koute commented Jan 20, 2022

Can you share the branch and/or code to reproduce the results with this PR?

Sure. Here's the branch:

https://github.com/koute/substrate/tree/master_wasmtime_benchmarks_with_cfallin_memfd_cow

And here's how to run them to get numbers for this PR (essentially the same as described in my PR, just with an extra cargo feature):

cd substrate/client/executor/benches
FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator call_empty_function_from_kusama_runtime_with_recreate_instance_on_1_threads

If you have any further questions feel free to ask! (I can also make myself available for discussion on Zulip.)

Some quick benchmarking shows that this function is quite hot in this PR. That makes sense to me because @koute your PR skips that function entirely with the reuse mechanism you've implemented.

Indeed! One of the reasons why I implemented my PR the way I did is because you have to punch through less layers of abstraction so it's easier to make it faster. (:

@cfallin cfallin force-pushed the memfd-cow branch 2 times, most recently from bc2fe0a to 9d15443 Compare January 21, 2022 01:53
@cfallin
Copy link
Member Author

cfallin commented Jan 22, 2022

OK, so I've spent some time digging into this more as well, and thinking about potential designs. My benchmarking (mainly with the server benchmark included here, tweaked to load a spidermonkey.wasm to stress instantiation more) also shows that the VMCallerCheckedAnyfunc initialization is a large part of the VMContext initialization time.

A few thoughts:

  • At the highest level, there's the question of what re-instantiation strategy different use-cases require. At least in cases I'm familiar with, and as I understand it, the "freshly instantiated for every request" property is a hard security requirement. The instance-reuse PR Add copy-on-write based instance reuse mechanism #3691 relaxes the boundary somewhat by effectively emulating a fresh instantiation, but without going through all initialization; I remain pretty concerned with the risk here, especially as we add more Wasm features in the future, that we'll forget something and have a small but critical information leak. (In other words, it inverts the initialization: we have to positively erase state or else stale state might come through, rather than today where we'd have to explicitly copy data for it to leak.)

    That question feels biggest to me. If all use-cases are OK with reuse and accepting risk of leakage, then let's pursue something like your PR. But if there's still a case for the stronger isolation, then I think it's interesting to see if we can approach it from the "other side" -- making initialization from scratch as fast as possible.

  • There is also the question of whether we want both, for different circumstances. Currently we seem to be taking for granted that "the fastest approach wins", or in other words, that if we can't get this approach to be at ~parity with Add copy-on-write based instance reuse mechanism #3691, then we pick that approach instead. But the requirements addressed by both are different; this one does more work, for lower risk-profile. Maybe we can optimize it to be close, I'm not sure; but if there is still a gap, I think we should consider them as solving slightly different problems and evaluate appropriately.

  • All that said, I think it may be possible to optimize and come close. As @alexcrichton pointed out above, the major speed delta between the two is I think almost completely in the function-table initialization in the VMContext. (There won't be a significant difference in globals or tables unless the initializers are complex: the reuse-based PR splats a saved array of values over both, while this PR uses the ordinary from-scratch initialization.)

    We can't fully skip the function-table initialization; the VMCallerCheckedAnyfuncs are referred to by pointers in others' tables when one instance imports another's function, as far as I can tell. We also cannot share them between all instances of a module (which would have let us either add indirection to share one block of them, or else do a fast memcpy):

    • This module's functions (possibly exported) embed the vmctx in the 3-tuple, so one of every 3 words needs to be specific to the VMContext we're building;
    • Imported functions also may have different vmctxs for each instantiation (ie not just the default-callee vmctx for host functions).

    However, I think we can get away with initializing lazily in wasmtime_runtime::Instance::get_caller_checked_anyfunc; just check if null, and if so, lazily init then return. (Re &self and concurrency, use atomic stores and store the checked-for-null ptr last; updates are idempotent so benign race is safe.) This is sort of loosely inspired by late-binding with dynamic linkers and such, except here the binding happens when someone else asks for the reference (ie during module linking or when filling in table entries), not when a call first occurs. I'm curious what @alexcrichton thinks of all this.

Anyway, thoughts on the above? Sorry for the lengthy braindump; I wanted to try to represent how I'm thinking about this and framing the design space in my head, is all. And @koute I want to repeat again a "thank you" for spawning all of this thinking and exploration!

@cfallin
Copy link
Member Author

cfallin commented Jan 22, 2022

I just implemented the lazy-anyfunc-initialization scheme I mentioned above: b8bb1d5

It seems to help in my quick-and-dirty local testing; I'm curious what it will do to the benchmarks above but will save that benchmarking (and benchmarking in general with enough precision to have quotable numbers) for Monday :-)

@koute
Copy link
Contributor

koute commented Jan 24, 2022

At the highest level, there's the question of what re-instantiation strategy different use-cases require. At least in cases I'm familiar with, and as I understand it, the "freshly instantiated for every request" property is a hard security requirement. The instance-reuse PR Add copy-on-write based instance reuse mechanism #3691 relaxes the boundary somewhat by effectively emulating a fresh instantiation, but without going through all initialization; I remain pretty concerned with the risk here, especially as we add more Wasm features in the future, that we'll forget something and have a small but critical information leak. (In other words, it inverts the initialization: we have to positively erase state or else stale state might come through, rather than today where we'd have to explicitly copy data for it to leak.)

That question feels biggest to me. If all use-cases are OK with reuse and accepting risk of leakage, then let's pursue something like your PR. But if there's still a case for the stronger isolation, then I think it's interesting to see if we can approach it from the "other side" -- making initialization from scratch as fast as possible.

This is a good point; I imagine that there should be use cases on both sides of the spectrum. I guess maybe it'd be nice to survey other wasmtime users' and ask their opinion about this?

I can only speak for ourselves, but in our main use case it's less of a security issue, and more of a "we just don't want consecutive instantiations to interfere with each other by accident". Our WASM module is - in general - trusted, and (simplifying a lot) we run a distributed system where the nodes essentially "check" each other's results, so leaking information locally within the same node is not a critical issue.

Originally we didn't even clear the linear memory at all - we just manually reinitialized the globals and statics on every "fake" instantiation. But that ended up being problematic, since depending on the flags with which the WASM module was compiled the information about which statics need to be cleared on reinstantiation might not be emitted, so we decided to start just clearing the memory for simplicity as a quick fix, and started working on speeding it up, which resulted in my PR.

As for the new features - in general we're pretty conservative here, and we disable every wasm_* feature flag by default until we can validate that there's a benefit to enabling them, and that everything still works as it should.

And @koute I want to repeat again a "thank you" for spawning all of this thinking and exploration!

Thank you for tackling this problem!

It seems to help in my quick-and-dirty local testing; I'm curious what it will do to the benchmarks above but will save that benchmarking (and benchmarking in general with enough precision to have quotable numbers) for Monday :-)

I updated my wasmtime benchmarking branch (I took your branch and merged the branch from my PR) and reran all the benchmarks and (unless I brainfarted somewhere and did something wrong) unfortunately it looks like the results are unchanged from the previous run for our benchmarks (within the margin of error):

call_empty_function
| threads | instance_pooling_memfd_v2 | instance_pooling_memfd_v1 |
|---------|---------------------------|---------------------------|
| 1       | 27                        | 27                        |
| 2       | 47                        | 43                        |
| 4       | 67                        | 66                        |
| 8       | 90                        | 89                        |
| 16      | 125                       | 127                       |

dirty_1mb_of_memory
| threads | instance_pooling_memfd_v2 | instance_pooling_memfd_v1 |
|---------|---------------------------|---------------------------|
| 1       | 239                       | 238                       |
| 2       | 380                       | 385                       |
| 4       | 456                       | 453                       |
| 8       | 613                       | 618                       |
| 16      | 838                       | 840                       |

(instance_pooling_memfd_v2 is the new run, instance_pooling_memfd_v1 are the numbers from the original run)

@alexcrichton
Copy link
Member

Thanks for writing all that up @cfallin, and at least my own personal opinion is to primarily err on the side of this PR in the sense of information leakage and spec-compliance. I very much want to get to the performance numbers in #3691 but I am not 100% convinced yet it's necessarily worth the loss in internal structuring (e.g. implementing an "undo" in addition to implementing a fast "redo".). Before making that conclusion though I think it's best to get this approach as close to #3691 as we possibly can.

However, I think we can get away with initializing lazily

Originally I didn't think this would work but digging more into your patch it looks like it's doing everything that would be necessary (although it is pretty gnarly). That being said I believe that the construction of an owned InstanceAllocationInfo is probably a large hit to instantiation-time costs, there's some large-ish arrays internally which we don't want to reallocate on all instantiations so to get a speedup from this that would probably need to be cached across instantiations somehow (and may explain why the measured numbers aren't all that different.


I personally still think that the biggest win is going to be not creating a VMCallerCheckedAnyfunc for all functions in the module, only for the "possibly exported" ones. That set is at least an order of magnitude smaller than the set of all functions and will make whatever we do that much faster since there's less work to be done. I don't think that it will close the gap fully, though, and further investigation/tweaking may still be warranted.

Another possible idea I might have is that the VMCallerCheckedAnyfunc representation today is:

#[repr(C)]
pub struct VMCallerCheckedAnyfunc {
    pub func_ptr: NonNull<VMFunctionBody>,
    pub type_index: VMSharedSignatureIndex,
    pub vmctx: *mut VMContext,
}

but we could probably change this to:

#[repr(C)]
pub struct VMCallerCheckedAnyfunc {
    pub info: *mut VMCallerCheckedAnyfuncStaticInfo,
    pub vmctx: *mut VMContext,
}

#[repr(C)]
pub struct VMCallerCheckedAnyfuncStaticInfo {
    pub func_ptr: NonNull<VMFunctionBody>,
    pub type_index: VMSharedSignatureIndex,
}

Here VMCallerCheckedAnyfunc is per-instance but VMCallerCheckedAnyfuncStaticInfo can be calculated per-module. This means that at Module construction time we could build up an array of the "static info" and then instantiation would have less cross-correlation to do across a number of lists (I think we could entirely skip the SharedSignatures thing being passed down. I think that this will probably speed up that loop but I don't know by how much. This'll also add one more load or two to call_indirect and I don't know the performance implications of that. Theoretically though instantiating a module many times consumes less memory since VMCallerCheckedAnyfunc is a pointer smaller!

@cfallin
Copy link
Member Author

cfallin commented Jan 24, 2022

I personally still think that the biggest win is going to be not creating a VMCallerCheckedAnyfunc for all functions in the module, only for the "possibly exported" ones. That set is at least an order of magnitude smaller than the set of all functions and will make whatever we do that much faster since there's less work to be done. I don't think that it will close the gap fully, though, and further investigation/tweaking may still be warranted.

@alexcrichton I may be misunderstanding something about the internal workings but my intent with the lazy initialization was to actually subsume this; i.e. with lazy init we should at least only construct the anyfuncs that are possibly exported, and ideally even fewer than that. (So in other words I think we should already be getting that benefit with the patch.) Or are you imagining something different here?

[factoring the anyfunc into two parts]

I do like this, but it involves quite a lot more changes to generated code so I'd like to see if we can avoid doing it if we can. I think the most promising approach (what I'm poking at right now) is to share more of the init work wrt the SharedSignatures. I'm not sure but I suspect this may be why @koute 's benchmark still isn't showing improvement from lazy-init.

(I've been testing by instantiating spidermonkey.wasm, so the benefit from skipping initialization of an anyfunc per function is probably significantly magnified compared to smaller modules; I'm planning to get the substrate benchmark working locally for me too, so I can steer based on the same numbers...)

@alexcrichton
Copy link
Member

@cfallin oh that's a good point, I was mostly thinking of other ways to get the speedup without laziness now that I'm thinking about it due to the complexity here. Otherwise though I think your approach is currently slower for unrelated reasons to laziness, the creation of InstanceAllocationInfo on each instantiation?

@cfallin
Copy link
Member Author

cfallin commented Jan 24, 2022

@koute I spent some time trying to reproduce numbers on your branch, and ran into a series of issues:

  • A clone of the branch linked above refers to a no-longer-existing commit in your wasmtime fork (I guess because of a rebase?); I edited client/executor/wasmtime/Cargo.toml and primitives/wasm-interface/Cargo.toml to refer to my local clone of your wasmtime;
  • When trying to build with your cargo bench commandline above, I hit a number of build errors in client/executor/wasmtime/src/tests.rs related to the Semantics struct (no fast_instance_reuse bool, instead an instantiation_strategy field);
  • When I tried to hack that to work (using a "reuse" or "recreate" strategy), I ran into: thread 'main' panicked at 'creating a full node doesn't fail: Client(VersionInvalid("RuntimeConstruction(Other(\"failed to instantiate a new WASM module instance: Incompatible allocation strategy\"))"))', bin/node/cli/benches/block_production.rs:114:57

I'm benchmarking locally with benches/server.rs in this PR, modified to load some other Wasm modules; so far a big one (spidermonkey.wasm) but I'll test with some others as well. I'm planning to improve the lazy-anyfunc-init and will hopefully have another round of changes today at least verified locally :-)

@cfallin
Copy link
Member Author

cfallin commented Jan 25, 2022

I've modified the scheme in this PR to not recompute signature info on every instantiation; now it's showing a decent speedup on SpiderMonkey instantiation, from ~1.69us to ~1.48us on my machine (raw instance creation only, no start function invocation). WIthout the signature fix I was seeing ~1.6us IIRC. I'm redoing some profiling now to see what else might be done...

@cfallin cfallin force-pushed the memfd-cow branch 2 times, most recently from 6bb241e to e2b49cc Compare January 25, 2022 00:51
@cfallin
Copy link
Member Author

cfallin commented Jan 25, 2022

@koute I've now improved perf a bit more -- in my local tests (raw instantiate, no start function, spidermonkey.wasm) I went from 1.68us earlier today to 773ns (!):

strategy pooling, occupancy 1000, benches ["spidermonkey.wasm"]
                        time:   [679.58 ns 773.45 ns 876.66 ns]
                        change: [-58.530% -52.164% -44.172%] (p = 0.00 < 0.05)

I'm curious if the factor of ~2 will translate into your benchmark as well, and/or what constant factors are left. (I'd still like to somehow be able to get your benchmark to run locally too...)

The trick in my latest changes, beyond lazy init of anyfuncs, was to represent non-init with a zeroed bitmap in the vmctx, rather than zeroes in the sparse array. Zeroing the latter was slow; per-field writes, even just a memset, array is too sparse. After the change it doesn't seem to show up in my profiles any more; the actual table-element init (building anyfuncs that are actually referenced) is the bulk of the instance init time.

@koute
Copy link
Contributor

koute commented Jan 25, 2022

@cfallin Sorry about the non-working branch! I've updated the code, but I haven't updated the Cargo.lock so it was referring to the old commit. (Updating the Cargo.lock would have fixed it.)

I've pulled in your most recent changes and updated the branches; here are the numbers:

call_empty_function
| threads | native_instance_reuse | instance_pooling_memfd_v3 | instance_pooling_memfd_v1 |
|---------|-----------------------|---------------------------|---------------------------|
| 1       | 4                     | 28                        | 27                        |
| 2       | 11                    | 49                        | 43                        |
| 4       | 17                    | 71                        | 66                        |
| 8       | 24                    | 98                        | 89                        |
| 16      | 35                    | 127                       | 127                       |

dirty_1mb_of_memory
| threads | native_instance_reuse | instance_pooling_memfd_v3 | instance_pooling_memfd_v1 |
|---------|-----------------------|---------------------------|---------------------------|
| 1       | 147                   | 242                       | 238                       |
| 2       | 214                   | 381                       | 385                       |
| 4       | 309                   | 453                       | 453                       |
| 8       | 580                   | 629                       | 618                       |
| 16      | 732                   | 831                       | 840                       |

Looks like they're mostly... unchanged again?

@cfallin
Copy link
Member Author

cfallin commented Jan 25, 2022

@koute -- thanks! @alexcrichton and I tracked down what we think is the delta in runtime perf; it led back to a comment you left at your use of Mmap::populated_range regarding performance of touching anonymous-zero memory vs CoW-mapped memory. Obvious in hindsight (zeroed, or even better pre-zeroed, is faster than copied) but I had missed it. I think this PR should be on par with yours wrt runtime perf now; instantiation still TBD, likely some gap remaining, but the lazy anyfunc init or some other scheme (knowing which functions are reference-taken and having a separate index space for those, for example) is the key for large modules I think.

@cfallin
Copy link
Member Author

cfallin commented Jan 26, 2022

@koute: I'm unfortunately still not able to run your benchmarks: I'm reaching the same "Incompatible allocation strategy" error I quoted above. I'm on the latest commit (25917541d78), with Cargo.tomls modified to refer to a local checkout of your Wasmtime branch, commit ec95b9365e3. I had to comment out tests in client/executor/wasmtime/src/tests.rs to get the build to complete.

In any case, I suspect that this PR is much closer now -- I'll wait for your confirmation, or not, on this to be sure; if we know we can get close with the "from-scratch" approach, or otherwise if we can quantify what the remaining delta is, then this should help us decide what to do next.

@koute
Copy link
Contributor

koute commented Jan 26, 2022

I'm unfortunately still not able to run your benchmarks: I'm reaching the same "Incompatible allocation strategy" error I quoted above.

@cfallin: This might be a silly question, but... are you using the right branch and calling the benchmarks exactly as described? (:

There's the branch from my original PR (which is unmodified, and won't work), and there's the branch for this PR (link is here: #3697 (comment)). You also have to cd into the right directory first. (I don't remember whether the whole codebase compiled on that branch, but it might not, since I was only interested in our WASM executor and the benchmarks.)

Anyway, I reran the benchmarks again, and here are the results:

call_empty_function
| threads | native_instance_reuse | instance_pooling_memfd_v4 | instance_pooling_memfd_v1 |
|---------|-----------------------|---------------------------|---------------------------|
| 1       | 5                     | 24                        | 27                        |
| 2       | 11                    | 46                        | 43                        |
| 4       | 17                    | 69                        | 66                        |
| 8       | 24                    | 97                        | 89                        |
| 16      | 35                    | 128                       | 127                       |

dirty_1mb_of_memory
| threads | native_instance_reuse | instance_pooling_memfd_v4 | instance_pooling_memfd_v1 |
|---------|-----------------------|---------------------------|---------------------------|
| 1       | 144                   | 170                       | 238                       |
| 2       | 217                   | 241                       | 385                       |
| 4       | 307                   | 338                       | 453                       |
| 8       | 589                   | 605                       | 618                       |
| 16      | 764                   | 815                       | 840                       |

There's an improvement!

@cfallin
Copy link
Member Author

cfallin commented Jan 26, 2022

You also have to cd into the right directory first.

Oh, I definitely missed that bit; I was running cargo bench from the root of your tree, and indeed hitting build failures. I confirm I can run some of your benchmarks (the "legacy reuse" ones specifically) locally now.

However, on a clean checkout of 3d80c9593624dc9f28ab877bae6e7d86459523db (your latest commit), with no local changes, I get the following error for a "native reuse" benchmark:

$ cd client/executor/benches/
$ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads

[ ... ]
Benchmarking call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads: Warming up for 3.0000 sthread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RuntimeConstruction(Other("failed to instantiate a new WASM module instance: Incompatible allocation strategy"))', client/executor/benches/bench.rs:171:67
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

If I should be running a different command, or from a different commit base, please do let me know!

[new numbers]

It's good to see confirmation here, thanks!

So I think at this point we can conclude something like: when the instance runs for long enough, the performance of this PR converges to that of your PR, because the underlying memory mapping is the same (anonymous mmap for zeroes, CoW of memfd otherwise). There is still an instantiation-time penalty, and this still mostly has to do with anyfunc initialization and initialization of the table elements that refer to those anyfunc structs.

One thing that @alexcrichton and I noticed when looking at your benchmark in particular today is that it does not appear to contain any memory.grow instruction at all; so the memory growth path is not stressed here. That is a hot path for other use-cases, and the instance-reuse PR needs to call mprotect, which takes the process-wide mmap_lock; this is something we want to avoid in highly-concurrent-server contexts. This PR uses a per-slot memfd with ftruncate to avoid that.

Anyway, at this point I think it might be fair to say that we've closed part of the gap (between stock wasmtime and your instance-reuse PR), know what the remaining gap can be attributed to, and this PR's approach might be slightly easier to maintain (as per @alexcrichton above); so it comes down to a choice of squeezing every last microsecond with a "trusted module" (your case) vs taking a still-significant improvement over today's mainline with existing security boundaries (this PR). That's a choice that depends on relative priorities; do others want to weigh in more here?

One last thing I might say is that it does seem possible to build a snapshot/rewind-like approach on top of a memfd-as-part-of-normal-instantiation foundation (this PR); basically we would provide an extra hook that does the madvise(), and a Snapshot object attached to an instance that saves and restores globals/table entries, and splats them back en-masse. This layering -- putting the memfd and mmap magic in the traditional instantiation machinery -- lets us use it for a "flash-reset" too, while the other way around, building all of this as a side-mechanism, precludes us from getting any benefit in a "transparent behind existing API" scenario. So this might be a way to satisfy both use-cases with one (or 1.2-ish) implementation(s). Thoughts?

@koute
Copy link
Contributor

koute commented Jan 26, 2022

However, on a clean checkout of 3d80c9593624dc9f28ab877bae6e7d86459523db (your latest commit), with no local changes, I get the following error for a "native reuse" benchmark:

$ cd client/executor/benches/
$ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads

[ ... ]
Benchmarking call_empty_function_from_test_runtime_with_native_instance_reuse_on_1_threads: Warming up for 3.0000 sthread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: RuntimeConstruction(Other("failed to instantiate a new WASM module instance: Incompatible allocation strategy"))', client/executor/benches/bench.rs:171:67
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

If I should be running a different command, or from a different commit base, please do let me know!

Sorry, the way I quickly hacked different instantiation strategies into the benchmarks is little janky, so you can't run all of them with a single command. So for each instantiation strategy you basically have to run them separately, e.g. to get the numbers for your PR:

$ FORCE_WASMTIME_INSTANCE_POOLING=1 rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime,sc-executor-wasmtime/memfd-allocator with_recreate_instance

To get the numbers for just recreating each instance from scratch with the ondemand strategy without any pooling:

$ rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime with_recreate_instance

And to get the numbers for my PR:

$ rustup run nightly-2021-11-01-x86_64-unknown-linux-gnu cargo bench --features wasmtime native_instance_reuse

So I think at this point we can conclude something like: when the instance runs for long enough, the performance of this PR converges to that of your PR, because the underlying memory mapping is the same (anonymous mmap for zeroes, CoW of memfd otherwise).

Well, yes, but the whole point here is to reduce the overhead of starting fresh, relatively short-lived instances, right? (: Even just using the ondemand strategy converges to the same performance if it runs long enough.

One thing that @alexcrichton and I noticed when looking at your benchmark in particular today is that it does not appear to contain any memory.grow instruction at all; so the memory growth path is not stressed here. That is a hot path for other use-cases, and the instance-reuse PR needs to call mprotect, which takes the process-wide mmap_lock; this is something we want to avoid in highly-concurrent-server contexts. This PR uses a per-slot memfd with ftruncate to avoid that.

Indeed. Do you want me to add some extra benchmarks for that?

Hmm... also, couldn't the ftruncate trick be also used with my approach? (I can try adding it in, but I'm not sure if there's a point if the chances of my PR being accepted are low anyway?)

lets us use it for a "flash-reset" too, while the other way around, building all of this as a side-mechanism, precludes us from getting any benefit in a "transparent behind existing API" scenario

Since we're talking about the API, I just want to chime in here that from our perspective as wasmtime users (this might or might not be true for other people with different use cases) the "side-mechanism" is actually more of a "transparent behind existing API" approach. (:

Consider the scenario where someone is currently using the ondemand instance allocator and wants to switch to a faster approach. With the approach from my PR it's basically this (I'm obviously simplifying as this is not the actual patch from our code, but that's essentially what it took):

@@ -1,3 +1,13 @@
+if let Some((instance, store)) = instance_cache.pop() {
+    return Ok((instance, store));
+} else {
     let store = initialize_store()?;
-    let instance = instance_pre.instantiate(&mut store)?;
+    let instance = instance_pre.instantiate_reusable(&mut store)?
     return Ok((instance, store));
+}
+
+// ...and when we're done with the instance...
+if instance_cache.len() < MAXIMUM_CACHED_INSTANCES {
+    instance.reset()?;
+    instance_cache.push((instance, store));
+}

It's fast, gives the user control and is super simple. There's no need to define any limits on the WASM module itself (so the WASM executor won't suddenly get bricked if the module has more functions/globals/larger memory/etc. than expected), it will still work if we instantiate too many modules at the same time (the extra modules just won't be reused), and it allows the user to tweak how many instances are being cached without reinitializing everything.

(And if we wanted to make it even more transparent the reuse could maybe be integrated into InstancePre, which would implicitly create reusable instances with its normal instantiate and cache them once they're dropped, without the user having to do anything besides setting a single flag. That would obviously require some more API changes in general since e.g. Instance is currently Copy, among other things. I didn't do this because it would have been even more controversial.)

Now consider switching to the pooling allocator. Yes, turning it on is essentially just a single line of code calling Config::allocation_strategy, but that's not all it takes! Now suddenly you have limitations which you didn't have before. Your WASM module can't exceed certain limit, and you can only have a certain number of instances active at the same time. It's not transparent. Now you need to write some code to have a fallback, and since the point where the strategy is configured is when creating the Engine and not when instantiating the module it's not very convenient.

So I totally understand that my approach might be harder to maintain, however purely as a user I think the pooling approach (completely ignoring how it performs) is not very convenient to use. (Unless it would be truly a "set and forget" without having to explicitly set any hard limits.)

@alexcrichton
Copy link
Member

My personal take an opinion is that we should follow a sequence of steps that looks like:

  • Focus on the strategy outlined in this PR, accelerating instantiation rather than implementing snapshots.
  • Update this PR to be simply memfd-backed-CoW for the pooling allocator
  • Enumerate remaining optimizations around instantiation which need to be improved. I think an important point to note here is that we literally haven't been able to measure these other bits of instantiation because the memory initialization was so expensive. That basically means that I think there's a lot of low-hanging fruit to optimize here:
    • Optimize the VMCallerCheckedAnyfunc array by making it smaller, lazily initialized, faster to initialize, or some combination thereof.
    • Remove the need to grab an rwlock for inserting a host function into a store.
    • Remove the need to clone function types when inserting a host function into a store.
    • Optimize tracking of trampolines in the store to probably not use HashMap vanilla (either a different data structure or a more optimized hash algorithm)
  • Add a new feature to Wasmtime which is enabling the memory pooling allocator with the on-demand instance allocator. Perhaps with some default settings about how many memories to pool or similar. (or something like this).
  • Work towards enabling memfd by default on Linux where we can with the on-demand instance allocator.

I believe we can solve all the problems here using a hybrid of this PR and @koute's, I don't think that this is an either/or choice. In isolation I don't think that anyone would disagree that start-from-scratch instantiation is more robust and easier to reason about. The main reason to implement some sort of snapshot-like approach would be performance, but I'm pretty sure we have a good handle on performance from poking around at the examples from @koute your substrate fork. This so far has revealed:

  • As mentioned above the initialize-an-instance path hasn't really been optimized to this degree because we couldn't measure things. There's lots of low-hanging fruit to optimize.
  • @cfallin's original PR and implementation showed that CoW-ing a page of zeros from memfd is much slower than from an anonymous mapping. This led to tweaking the design to, as your PR does @koute, minimizing the size of the memfd image as well as using an anonymous mmap for the initial heap contents that are all zero.
  • Locally I'm seeing that for the threaded benchmarks you have @koute the main performance bottleneck seems to be the usage of std::sync::RwLock<T> around the signatures in an engine and hitting that rwlock ~100 times during instantiation. (this is also one of those easy "low hanging fruit" to optimize)

I'm pretty confident that we have basically quantified the difference in numbers that you're seeing @koute between native_instance_reuse and instance_pooling_memfd_v*. I don't mean to pretend that they're the same and we don't have to worry about the difference, my point is that I feel like we have a good handle on why there's a difference and a number of avenues ahead of us to make the difference even smaller. It may be the case that this PR's strategy of always-initialize-the-instace may not microsecond-for-microsecond come out on par with "reuse a snapshot", but I believe that there's still legwork to be done to definitively say "this is the difference". Until such a conclusion is reached I don't personally think that it's worthwhile to implement a snapshot-like-scheme in Wasmtime.


I'd like to also ideally address some of our API-wise concerns @koute. You've mentioned that for your use case it's important that instantiation always succeeds and you don't want to configure the pooling allocator so far. I think much of this can be solved by having a default pooling allocator for memories within engines like I mentioned above. We can pretty easily remove the need to have limits on the memory pool as well. I think it's worthwhile to acknowledge, though, that the desire to run an unlimited number of instances concurrently isn't a constraint we've seen yet and is so far unique to you.

With a memory pool that enables the CoW/memfd strategy implemented in this PR to be usable with the on-demand allocation strategy. I think that this would solve the issues around trying to configure the pooling allocator since you wouldn't be using it and you could allocate as many stores/instances as the system can support. The limits related to memory could also be relaxed in this situation so they don't need to be specified up-front.

The final thing you've mentioned is that it's easy to build pooling allocators externally, so why not do that? The problem with this is that it places a lot of constraints on the API design of the wasmtime crate itself. We originally wanted to do precisely that with the current pooling allocator, allowing an external crate to implement it rather than baking it into wasmtime itself. The problem is that the API hooks necessary to implement this were far too complicated and invasive. It effectively would have been a massive amount of effort (too much) to design and support such APIs. Similarly with your approach you're enabling a pooling-style of allocation but it comes at the cost of wasmtime having to maintain a snapshot-like API. This snapshot-like API probably won't always be exclusively used for pooling which means that it has to be maintained and grow over time for a variety of use-cases. Overall at least my personal conclusion is that it's better for now for us to implement common practices in Wasmtime to give us the maximum flexibility in the API and support over time to ensure we don't lock ourselves into anything.


Well that's a bit of a long-winded post, sorry for the wall of text. I think that captures well my thoughts on all this, though, and I'd like to hear what others hink about this all too!

@cfallin
Copy link
Member Author

cfallin commented Jan 31, 2022

Just as an addendum here: @alexcrichton and I discussed the possibility of using the implementation bits here (MemFdSlot) in the on-demand allocator as well. This would address any use-case that requires that allocator (@koute , we're still very interested in getting something into the codebase that serves your needs!).

I took a bit of a deeper look now and I think it's probably best as a followup PR -- it involves some more refactoring in the allocator machinery (e.g. the RuntimeMemoryCreator and RuntimeLinearMemory traits). But I'm happy to do this work!

@cfallin
Copy link
Member Author

cfallin commented Jan 31, 2022

Actually, it turned out to be pretty simple to extend this to the on-demand allocator too; pushed another commit for this, but happy to split it to a second PR if that makes review easier @alexcrichton .

crates/runtime/src/instance/allocator/memfd.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance/allocator/memfd.rs Outdated Show resolved Hide resolved
crates/runtime/src/instance/allocator/memfd.rs Outdated Show resolved Hide resolved
// N.B.: this comes after the `mmap` field above because it must
// be destructed first. It puts a placeholder mapping in place on
// drop, then the `mmap` above completely unmaps the region.
memfd: Option<MemFdSlot>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike the Static variant of memory I think we might be able to skip the storage here entirely (and also make the instantiate a litte faster below with fewer syscalls) since with a runtime memory all memfd really is is the initial mapping of the cow image overlaid on top of the underlying anonymous mapping. In that sense I think we can skip most of the logic of MemFdSlot and avoid storing redundant data.

Now that being said I think the implementation here is correct and will work well regardless. I think it'd be fine for a follow-up to clean this up a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with this a bit, and it's a good thought! I think though it may be safer to go through the MemFdSlot for heap growth specifically in case we choose to implement some other technique (such as ftruncate on a second memfd) in the future. At least if I'm reading your suggestion right, otherwise, we'd use the MemFdSlot's instantiate logic to set up the actual CoW mapping then throw it away (without its dtor running) and rely on the implicit alignment of the mprotect-growth strategy between memfd and the usual dynamic memory.

Re: syscalls on instantiate, we could indeed avoid one if we give the MemFdSlot a "no fixed mapping" mode, but then it takes responsibility for the pre-guard too, which complicates the pooling implementation somewhat. I'll think a bit more about this though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my worries is that there's a fair bit of tricky state with handling linear memories and having that duplicated across a few locations I think is not ideal. The MemFdSlot tries to take over a fair bit of the management of the memory which is also tracked by the runtime Memory. We discussed this a bit in person as well, but to reiterate here I think it might be good to investigate a bit to figure out whether some functions below could grow an extra parameter or two instead of storing the state internally so it could be passed in by the "source of truth".

I don't have a great idea though for whether this is possible, so I'll leave it up to your discretion about whether it ends up working out or not.

@@ -37,6 +37,7 @@ winapi = { version = "0.3.7", features = ["winbase", "memoryapi", "errhandlingap

[target.'cfg(target_os = "linux")'.dependencies]
userfaultfd = { version = "0.4.1", optional = true }
memfd = { version = "0.4.1", optional = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given where this PR now is and the trajectory of this feature, I think this is good to have as an optional dependency but I think it should be enabled by default and probably just called memfd (enabled-by-default probably from the wasmtime crate, not here). Especially with this making such a drastric difference for the on-demand allocator it seems like using memfd is a no-brainer.

To make this tweak as well as improve the portability here (right now if you enable memfd on Windows it'd probably just fail to compile) I think a few small changes are all that's needed. To be clear though this is icing on the cake and it's totally fine to do this as a follow-up, but I think the steps would be:

  • Remove the memfd-allocator
  • Add a memfd feature to the wasmtime crate that forwards to wasmtime-runtime/memfd
  • Add memfd to the default feature set of the wasmtime crate
  • Add a crates/runtime/build.rs that detects that the target's os is Linux (by looking at CARGO_CFG_TARGET_OS) and that the memfd feature is active (by looking at CARGO_FEATURE_MEMFD), and if both are present then printing println!("cargo:rustc-cfg=memfd")
  • Change all cfg(feature = "memfd-allocator") in the crate to cfg(memfd)

With that I think we'll have on-by-default memfd for Linux-only, with the opt-in ability to turn it off. Again though it's fine to defer this to a future PR.

As I type all this out though I also would question maybe we don't even need a feature for this. I suppose one thing that could come up is that if you have a million modules in a process that's a million file descriptors which can blow kernel limits, but other than that I'm not sure why anyone would explicitly want to disable memfd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone ahead and done all of this; the default build now will use memfd, validated by building the wasmtime binary and strace'ing an execution. Waiting for CI to see if moving memfd out of the target-only deps into the main deps will work on non-Linux, but I am hoping it will (the crate's got a toplevel #![cfg(target_os ...)] that should make it a no-op on other platforms...?).

@alexcrichton
Copy link
Member

Sorry I'm heading out for the day and only recently noticed the addition to the on-demand allocator, I skimmed it and it looks good to me, but I'll need to dig in a bit more depth tomorrow, otherwise I've got one instance of what I think is a bug but otherwise just a bunch of nits which can all be deferred to later.

@cfallin cfallin force-pushed the memfd-cow branch 2 times, most recently from 60bb2d3 to 2184a82 Compare February 1, 2022 01:13
(This was not a correctness bug, but is an obvious performance bug...)
@cfallin cfallin force-pushed the memfd-cow branch 2 times, most recently from d07530d to 2efa0f5 Compare February 1, 2022 06:12
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for being patient with me! There's a few more minor things below which I think are worth addressing but nothing major. Otherwise I'm pretty confident in the state of this PR and I'm happy to approve here as well.

@koute we haven't heard from you in a bit, but to reiterate this PR brings all the CoW benefits to the on-demand allocator as well although the reuse case isn't simply an madvise to reset. That being said for your "empty" function benchmark from before I'm clocking this PR (re-instantiation in a loop) at around 40us and "'just madvise", your PR, at around 5ns. A huge portion of the remaining time is table initialization, scheduled to be addressed in #3733 after some further work (which should make table initialization effectively zero-cost). Basically I want to reiterate we continue to be very interested in solving your use case and accomodating your performance constraints. If you'd like we'd be happy to ping you when other optimization work has settled down and we're ready to re-benchmark.

crates/runtime/build.rs Outdated Show resolved Hide resolved
Cargo.toml Show resolved Hide resolved
crates/runtime/src/instance/allocator/pooling.rs Outdated Show resolved Hide resolved
crates/runtime/src/lib.rs Outdated Show resolved Hide resolved
crates/runtime/src/memfd.rs Show resolved Hide resolved
crates/wasmtime/Cargo.toml Outdated Show resolved Hide resolved
crates/runtime/src/memfd.rs Outdated Show resolved Hide resolved
crates/runtime/src/memory.rs Show resolved Hide resolved
// N.B.: this comes after the `mmap` field above because it must
// be destructed first. It puts a placeholder mapping in place on
// drop, then the `mmap` above completely unmaps the region.
memfd: Option<MemFdSlot>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of my worries is that there's a fair bit of tricky state with handling linear memories and having that duplicated across a few locations I think is not ideal. The MemFdSlot tries to take over a fair bit of the management of the memory which is also tracked by the runtime Memory. We discussed this a bit in person as well, but to reiterate here I think it might be good to investigate a bit to figure out whether some functions below could grow an extra parameter or two instead of storing the state internally so it could be passed in by the "source of truth".

I don't have a great idea though for whether this is possible, so I'll leave it up to your discretion about whether it ends up working out or not.

crates/runtime/src/memfd.rs Show resolved Hide resolved
@cfallin
Copy link
Member Author

cfallin commented Feb 1, 2022

Thanks @alexcrichton !

I did a few last refactors based on your comments; given the criticality of this code I want to make sure you're happy with the last commit before merging!

I played around a bit with simplifying MemFdSlot to carry less state, but in the end I think as it manages reuse, and has to know what the current mapping state is to reuse mappings, there is not much complexity reduction in externalizing that state and requiring it to be passed back in. It felt a little more brittle to me. Perhaps we could go the other way and have an abstraction for "something that manages an address range", build another one that doesn't use memfds, and make this "slot manager" the one source of truth. We can probably play with this in followup PRs -- what do you think?

@koute
Copy link
Contributor

koute commented Feb 2, 2022

@koute we haven't heard from you in a bit, but to reiterate this PR brings all the CoW benefits to the on-demand allocator as well although the reuse case isn't simply an madvise to reset. That being said for your "empty" function benchmark from before I'm clocking this PR (re-instantiation in a loop) at around 40us and "'just madvise", your PR, at around 5ns. A huge portion of the remaining time is table initialization, scheduled to be addressed in #3733 after some further work (which should make table initialization effectively zero-cost). Basically I want to reiterate we continue to be very interested in solving your use case and accomodating your performance constraints. If you'd like we'd be happy to ping you when other optimization work has settled down and we're ready to re-benchmark.

Yes, I saw you're pretty busy working on this so that's why I was keeping quiet until you finish. (:

Please do ping me once this is ready to rebenchmark!

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've got one question about a subtraction mainly but otherwise looks good 👍

crates/runtime/src/memfd.rs Outdated Show resolved Hide resolved
crates/runtime/src/memfd.rs Show resolved Hide resolved
crates/runtime/src/memfd.rs Show resolved Hide resolved
crates/runtime/src/memfd.rs Outdated Show resolved Hide resolved
@acfoltzer acfoltzer self-requested a review February 2, 2022 17:39
@acfoltzer
Copy link
Contributor

(tagging myself as a reviewer on this just so I don't lose track of it, but please don't hold up the merge on me)

@cfallin
Copy link
Member Author

cfallin commented Feb 2, 2022

Hmm, running into the madvise semantics gaps in qemu on s390x it seems -- I recall a similar issue when we were adding tests for aarch64 in the uffd work. I'll find a way to skip the tests when on qemu.

@cfallin cfallin merged commit 99ed8cc into bytecodealliance:main Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:api Related to the API of the `wasmtime` crate itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants