Investigate instance pooling strategy in wasmtime #10244

pepyakin · 2021-11-12T10:49:08Z

Since some time, wasmtime supports instance pooling. In theory, this mechanism should improve startup latency.

If so, we may not need to employ our own rolled hacked such as "fast instance reuse" mechanism, which already gives us some pain. For example, it takes more effort to implement (#9164 (comment)) or potentially may give way to hard to debug errors (#10095).

bkchr · 2021-11-12T11:30:49Z

@koute do you maybe want to work on this?

koute · 2021-11-12T11:33:07Z

Sure I can take a stab at it (since I'm in the area anyway); just one question - does anyone remember if we already have a benchmark for this? (If not I'll add one.)

bkchr · 2021-11-12T11:59:47Z

We don't have a benchmark for this yet. Feel free to add one 🙂

koute · 2021-12-02T11:01:58Z

So I've done some experiments. Here's a quick rundown of the numbers based on a simple benchmark (calls new_instance and then calls the test_empty_return within the runtime):

Each iteration with fast instance reuse currently takes ~5us.
Each iteration without fast instance reuse currently takes ~1ms.
After Statically register host WASM functions #10394 (plus one more PR that I'll put up after that one) is merged each iteration without fast instance reuse and without pooling strategy will take ~30us.
And finally on top of that turning on the pooling only gives us an extra ~5us, so ~25us per iteration.

So pooling doesn't actually net that much extra speed (at least in this benchmark), however even without pooling after my PRs the codepath without fast instance reuse becomes pretty fast, although unfortunately not as fast as our own fast instance reuse method.

So now the question is - what do you want to do here? We could maybe close the gap even further with some optimization work on wasmtime's side, we could just eat the ~25us loss, or we could keep it as-is (with the fast instance reuse being the default) and reinvestigate another day.

pepyakin · 2021-12-02T11:49:53Z

Thanks for so detailed investigation.

Could you try the same but with a real runtime? One of the biggest source of overhead for Fast Instance Reuse (FIR) is copying over the data segments. In a real runtime data segments could reach 200-300 KiB. I expect that it would also influence the pooling strategy, however, we can also try to use wasmtime's userfaultfd which may help. In that case we should be careful with a more representative workload. Also that would depend on a fresh Linux but that may be ok.

Besides that we could also try to see if things like wasmtime::InstancePre or register the functions in config can shave some more time. I wouldn't hold the breath for those though.

Admittedly, FIR is a hack and I'd be happy to get rid of it. At some point we may run out of tricks in our sleeves to keep it work in case changes related to new features. Ideally we improve wasmtime so that it works for this ultra low latency scenario. I am optimistic because AFAIU the folks behind wasmtime also value low-latency a lot.

koute · 2021-12-06T06:27:19Z

Okay, here are some numbers calling into the Kusama runtime (the function called within the runtime is exactly the same - I copy-pasted it from the output of cargo expand for the test runtime and recompiled it as part of the full Kusama runtime):

With fast instance reuse: ~49us
Without fast instance reuse: ~4.9ms
Without fast instance reuse, after my PRs, no pooling: ~83us
Without fast instance reuse, after my PRs, with pooling, without uffd: ~83us
Without fast instance reuse, after my PRs, with pooling, with uffd: ~48us

...so this does look promising if we enable both pooling and uffd, basically giving us the same performance as with fast instance reuse! (Enabling only one of them gives the normal ~83us.)

So it looks like we can probably just delete the fast instance reuse codepath? Before we commit it'd be nice to make one more test that's more real-world, e.g. importing a bunch of blocks from an actual production chain and timing that, or something along these lines. @pepyakin Any idea what would be good to run?

pepyakin · 2021-12-06T16:07:40Z

Ok, this is very useful and a good sign.

I think even without UFFD it is not that bad: it's 2x regression compared to 5x we witnessed before.

Relying on UFFD is kind of annoying still though. I checked the code and it seems to require 4.11. It seems that we can afford to impose the minimum kernel version requirement. On the first glance it seems that in case we enable UFFD, executor will fail during the run time on <4.11, which is really annoying.

~~While I was browsing the code, I stumbled upon the paged_memory_initialization configuration option which I completely forgot about. Did you enable it?~~ Nevermind, paged_memory_initialization is enabled by default with UFFD enabled.

Regarding the tests. I think block importing may be a good start, although once again it may be a bit deceiving, at least in case of the production chain. E.g. with UFFD we can basically skip the data segment initialization, however, at the price of a more costly access to the un-paged memory areas where data segments reside (or maybe even untouched/zeroed areas, I haven't dived into the implementation that deep). In a production chain the first many blocks will be empty and will hit only the same pattern, and as I imagine only a subset of data segments will be actually used. If that's the case, maybe it will be better to come up with a synthetic test.

pepyakin · 2021-12-16T12:07:26Z

For the record, there is a discussion going about using COW pages here. Quoting @koute :

Anyway, I did some experiments on wasmtime with COW memory and... it actually looks promising. On our benchmarks with the Kusama runtime the invocation time dropped from ~48us (which is what we get with either fast instance reuse or instance pooling + uffd) to ~20us, so it might be worth it to investigate this even further. (Of course this is just a YOLO proof of concept implementation; doing this properly would require more work to handle all of the corner cases.) From the profiling I did it might be possible to go even lower, since now after COW-ing the linear memory I see a bunch of normal memory allocations related to imports which dominate the runtime and which should also be cacheable across invocations.

[edit]After switching to preinitializing the module the invocation time with COW'd memory goes down to ~14us.[/edit]

koute · 2021-12-16T12:19:30Z

Continuing the discussion from the PR, for now I'm thinking we should do this:

Merge in the refactoring from Refactor WASM module instantiation #10480 but do not enable instance pooling (I'll update the PR to strip it out).
See if we can maybe contribute COW-based instance spawning to wasmtime, basically using @pepyakin 's idea to have an InstancePre which will also keep a fossilized copy of initialized memory and allow spawning cheap Instances. (I'm already looking into this.)
If we can get COW-based spawning contributed then we'll switch to that, forget about instance pooling and rip out the fast instance reuse. (This will simplify our code and be faster.)
If (3) doesn't work out for some reason we can always get back to instance pooling with the current code from my PR (but first get the uffd feature to fail gracefully on older kernels, which should be easier and less controversial of the feature to contribute to wasmtime).

koute · 2022-01-24T12:58:23Z

For posterity, here I'm copy-pasting the results of final benchmarks of my new CoW-based instance reuse mechanism:

Legend:

native_instance_reuse: new CoW-based reuse
legacy_instance_reuse: our current reuse mechanism
instance_pooling_with_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy with uffd turned on
instance_pooling_without_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy without uffd turned on
recreate_instance: create a fresh instance with InstanceAllocationStrategy::OnDemand strategy
interpreted: wasmi

The measurements are only for the main thread; thread count on the bottom signifies how many other threads were running in the background doing exactly the same thing as the main thread, e.g. for 4 threads there was 1 thread (the main thread) being benchmarked while other 3 threads were running in the background.

The benchmarks are not yet fully committed; I'll add them in a PR after the new instantiation mechanism will be merged-in to wasmtime and we can switch to it.

pepyakin added I7-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. labels Nov 12, 2021

pepyakin mentioned this issue Nov 12, 2021

wasm trap: out of bounds memory access #10095

Closed

bkchr assigned koute Nov 12, 2021

koute mentioned this issue Nov 30, 2021

Statically register host WASM functions #10394

Merged

koute mentioned this issue Dec 14, 2021

Refactor WASM module instantiation #10480

Merged

pepyakin mentioned this issue Dec 24, 2021

Block building within the same wasm memory? #10557

Open

koute mentioned this issue Apr 18, 2022

Switch to pooling copy-on-write instantiation strategy for WASM #11232

Merged

andresilva moved this to In Progress 🛠 in SDK Node Apr 26, 2022

andresilva added this to SDK Node Apr 26, 2022

koute moved this from In Progress 🛠 to Code in review 🧐 in SDK Node Apr 27, 2022

paritytech-processbot bot closed this as completed in #11232 May 19, 2022

Repository owner moved this from Code in review 🧐 to Blocked ⛔️ in SDK Node May 19, 2022

koute moved this from Blocked ⛔️ to Done ✅ in SDK Node May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate instance pooling strategy in wasmtime #10244

Investigate instance pooling strategy in wasmtime #10244

pepyakin commented Nov 12, 2021

bkchr commented Nov 12, 2021

koute commented Nov 12, 2021

bkchr commented Nov 12, 2021 •

edited

Loading

koute commented Dec 2, 2021

pepyakin commented Dec 2, 2021

koute commented Dec 6, 2021 •

edited

Loading

pepyakin commented Dec 6, 2021 •

edited

Loading

pepyakin commented Dec 16, 2021

koute commented Dec 16, 2021

koute commented Jan 24, 2022

Investigate instance pooling strategy in wasmtime #10244

Investigate instance pooling strategy in wasmtime #10244

Comments

pepyakin commented Nov 12, 2021

bkchr commented Nov 12, 2021

koute commented Nov 12, 2021

bkchr commented Nov 12, 2021 • edited Loading

koute commented Dec 2, 2021

pepyakin commented Dec 2, 2021

koute commented Dec 6, 2021 • edited Loading

pepyakin commented Dec 6, 2021 • edited Loading

pepyakin commented Dec 16, 2021

koute commented Dec 16, 2021

koute commented Jan 24, 2022

bkchr commented Nov 12, 2021 •

edited

Loading

koute commented Dec 6, 2021 •

edited

Loading

pepyakin commented Dec 6, 2021 •

edited

Loading