-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Investigate instance pooling strategy in wasmtime #10244
Comments
@koute do you maybe want to work on this? |
Sure I can take a stab at it (since I'm in the area anyway); just one question - does anyone remember if we already have a benchmark for this? (If not I'll add one.) |
We don't have a benchmark for this yet. Feel free to add one 🙂 |
So I've done some experiments. Here's a quick rundown of the numbers based on a simple benchmark (calls
So pooling doesn't actually net that much extra speed (at least in this benchmark), however even without pooling after my PRs the codepath without fast instance reuse becomes pretty fast, although unfortunately not as fast as our own fast instance reuse method. So now the question is - what do you want to do here? We could maybe close the gap even further with some optimization work on |
Thanks for so detailed investigation. Could you try the same but with a real runtime? One of the biggest source of overhead for Fast Instance Reuse (FIR) is copying over the data segments. In a real runtime data segments could reach 200-300 KiB. I expect that it would also influence the pooling strategy, however, we can also try to use wasmtime's userfaultfd which may help. In that case we should be careful with a more representative workload. Also that would depend on a fresh Linux but that may be ok. Besides that we could also try to see if things like Admittedly, FIR is a hack and I'd be happy to get rid of it. At some point we may run out of tricks in our sleeves to keep it work in case changes related to new features. Ideally we improve wasmtime so that it works for this ultra low latency scenario. I am optimistic because AFAIU the folks behind wasmtime also value low-latency a lot. |
Okay, here are some numbers calling into the Kusama runtime (the function called within the runtime is exactly the same - I copy-pasted it from the output of
...so this does look promising if we enable both pooling and So it looks like we can probably just delete the fast instance reuse codepath? Before we commit it'd be nice to make one more test that's more real-world, e.g. importing a bunch of blocks from an actual production chain and timing that, or something along these lines. @pepyakin Any idea what would be good to run? |
Ok, this is very useful and a good sign. I think even without UFFD it is not that bad: it's 2x regression compared to 5x we witnessed before. Relying on UFFD is kind of annoying still though. I checked the code and it seems to require 4.11. It seems that we can afford to impose the minimum kernel version requirement. On the first glance it seems that in case we enable UFFD, executor will fail during the run time on <4.11, which is really annoying.
Regarding the tests. I think block importing may be a good start, although once again it may be a bit deceiving, at least in case of the production chain. E.g. with UFFD we can basically skip the data segment initialization, however, at the price of a more costly access to the un-paged memory areas where data segments reside (or maybe even untouched/zeroed areas, I haven't dived into the implementation that deep). In a production chain the first many blocks will be empty and will hit only the same pattern, and as I imagine only a subset of data segments will be actually used. If that's the case, maybe it will be better to come up with a synthetic test. |
For the record, there is a discussion going about using COW pages here. Quoting @koute :
|
Continuing the discussion from the PR, for now I'm thinking we should do this:
|
For posterity, here I'm copy-pasting the results of final benchmarks of my new CoW-based instance reuse mechanism: Legend:
The measurements are only for the main thread; thread count on the bottom signifies how many other threads were running in the background doing exactly the same thing as the main thread, e.g. for 4 threads there was 1 thread (the main thread) being benchmarked while other 3 threads were running in the background. The benchmarks are not yet fully committed; I'll add them in a PR after the new instantiation mechanism will be merged-in to |
Since some time, wasmtime supports instance pooling. In theory, this mechanism should improve startup latency.
If so, we may not need to employ our own rolled hacked such as "fast instance reuse" mechanism, which already gives us some pain. For example, it takes more effort to implement (#9164 (comment)) or potentially may give way to hard to debug errors (#10095).
The text was updated successfully, but these errors were encountered: