Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Investigate instance pooling strategy in wasmtime #10244

Closed
pepyakin opened this issue Nov 12, 2021 · 10 comments · Fixed by #11232
Closed

Investigate instance pooling strategy in wasmtime #10244

pepyakin opened this issue Nov 12, 2021 · 10 comments · Fixed by #11232
Assignees
Labels
I7-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.

Comments

@pepyakin
Copy link
Contributor

Since some time, wasmtime supports instance pooling. In theory, this mechanism should improve startup latency.

If so, we may not need to employ our own rolled hacked such as "fast instance reuse" mechanism, which already gives us some pain. For example, it takes more effort to implement (#9164 (comment)) or potentially may give way to hard to debug errors (#10095).

@pepyakin pepyakin added I7-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task. labels Nov 12, 2021
@bkchr
Copy link
Member

bkchr commented Nov 12, 2021

@koute do you maybe want to work on this?

@koute
Copy link
Contributor

koute commented Nov 12, 2021

Sure I can take a stab at it (since I'm in the area anyway); just one question - does anyone remember if we already have a benchmark for this? (If not I'll add one.)

@bkchr
Copy link
Member

bkchr commented Nov 12, 2021

We don't have a benchmark for this yet. Feel free to add one 🙂

@koute
Copy link
Contributor

koute commented Dec 2, 2021

So I've done some experiments. Here's a quick rundown of the numbers based on a simple benchmark (calls new_instance and then calls the test_empty_return within the runtime):

  • Each iteration with fast instance reuse currently takes ~5us.
  • Each iteration without fast instance reuse currently takes ~1ms.
  • After Statically register host WASM functions #10394 (plus one more PR that I'll put up after that one) is merged each iteration without fast instance reuse and without pooling strategy will take ~30us.
  • And finally on top of that turning on the pooling only gives us an extra ~5us, so ~25us per iteration.

So pooling doesn't actually net that much extra speed (at least in this benchmark), however even without pooling after my PRs the codepath without fast instance reuse becomes pretty fast, although unfortunately not as fast as our own fast instance reuse method.

So now the question is - what do you want to do here? We could maybe close the gap even further with some optimization work on wasmtime's side, we could just eat the ~25us loss, or we could keep it as-is (with the fast instance reuse being the default) and reinvestigate another day.

@pepyakin
Copy link
Contributor Author

pepyakin commented Dec 2, 2021

Thanks for so detailed investigation.

Could you try the same but with a real runtime? One of the biggest source of overhead for Fast Instance Reuse (FIR) is copying over the data segments. In a real runtime data segments could reach 200-300 KiB. I expect that it would also influence the pooling strategy, however, we can also try to use wasmtime's userfaultfd which may help. In that case we should be careful with a more representative workload. Also that would depend on a fresh Linux but that may be ok.

Besides that we could also try to see if things like wasmtime::InstancePre or register the functions in config can shave some more time. I wouldn't hold the breath for those though.

Admittedly, FIR is a hack and I'd be happy to get rid of it. At some point we may run out of tricks in our sleeves to keep it work in case changes related to new features. Ideally we improve wasmtime so that it works for this ultra low latency scenario. I am optimistic because AFAIU the folks behind wasmtime also value low-latency a lot.

@koute
Copy link
Contributor

koute commented Dec 6, 2021

Okay, here are some numbers calling into the Kusama runtime (the function called within the runtime is exactly the same - I copy-pasted it from the output of cargo expand for the test runtime and recompiled it as part of the full Kusama runtime):

  • With fast instance reuse: ~49us
  • Without fast instance reuse: ~4.9ms
  • Without fast instance reuse, after my PRs, no pooling: ~83us
  • Without fast instance reuse, after my PRs, with pooling, without uffd: ~83us
  • Without fast instance reuse, after my PRs, with pooling, with uffd: ~48us

...so this does look promising if we enable both pooling and uffd, basically giving us the same performance as with fast instance reuse! (Enabling only one of them gives the normal ~83us.)

So it looks like we can probably just delete the fast instance reuse codepath? Before we commit it'd be nice to make one more test that's more real-world, e.g. importing a bunch of blocks from an actual production chain and timing that, or something along these lines. @pepyakin Any idea what would be good to run?

@pepyakin
Copy link
Contributor Author

pepyakin commented Dec 6, 2021

Ok, this is very useful and a good sign.

I think even without UFFD it is not that bad: it's 2x regression compared to 5x we witnessed before.

Relying on UFFD is kind of annoying still though. I checked the code and it seems to require 4.11. It seems that we can afford to impose the minimum kernel version requirement. On the first glance it seems that in case we enable UFFD, executor will fail during the run time on <4.11, which is really annoying.

While I was browsing the code, I stumbled upon the paged_memory_initialization configuration option which I completely forgot about. Did you enable it? Nevermind, paged_memory_initialization is enabled by default with UFFD enabled.

Regarding the tests. I think block importing may be a good start, although once again it may be a bit deceiving, at least in case of the production chain. E.g. with UFFD we can basically skip the data segment initialization, however, at the price of a more costly access to the un-paged memory areas where data segments reside (or maybe even untouched/zeroed areas, I haven't dived into the implementation that deep). In a production chain the first many blocks will be empty and will hit only the same pattern, and as I imagine only a subset of data segments will be actually used. If that's the case, maybe it will be better to come up with a synthetic test.

@pepyakin
Copy link
Contributor Author

For the record, there is a discussion going about using COW pages here. Quoting @koute :

Anyway, I did some experiments on wasmtime with COW memory and... it actually looks promising. On our benchmarks with the Kusama runtime the invocation time dropped from ~48us (which is what we get with either fast instance reuse or instance pooling + uffd) to ~20us, so it might be worth it to investigate this even further. (Of course this is just a YOLO proof of concept implementation; doing this properly would require more work to handle all of the corner cases.) From the profiling I did it might be possible to go even lower, since now after COW-ing the linear memory I see a bunch of normal memory allocations related to imports which dominate the runtime and which should also be cacheable across invocations.

[edit]After switching to preinitializing the module the invocation time with COW'd memory goes down to ~14us.[/edit]

@koute
Copy link
Contributor

koute commented Dec 16, 2021

Continuing the discussion from the PR, for now I'm thinking we should do this:

  1. Merge in the refactoring from Refactor WASM module instantiation #10480 but do not enable instance pooling (I'll update the PR to strip it out).
  2. See if we can maybe contribute COW-based instance spawning to wasmtime, basically using @pepyakin 's idea to have an InstancePre which will also keep a fossilized copy of initialized memory and allow spawning cheap Instances. (I'm already looking into this.)
  3. If we can get COW-based spawning contributed then we'll switch to that, forget about instance pooling and rip out the fast instance reuse. (This will simplify our code and be faster.)
  4. If (3) doesn't work out for some reason we can always get back to instance pooling with the current code from my PR (but first get the uffd feature to fail gracefully on older kernels, which should be easier and less controversial of the feature to contribute to wasmtime).

@koute
Copy link
Contributor

koute commented Jan 24, 2022

For posterity, here I'm copy-pasting the results of final benchmarks of my new CoW-based instance reuse mechanism:

149503025-f3d56896-5181-4fa0-8b26-ea9ba03fbeb2

149503033-0c7c48ad-64e4-4bb0-a0d4-dfaf2223cd9c

Legend:

  • native_instance_reuse: new CoW-based reuse
  • legacy_instance_reuse: our current reuse mechanism
  • instance_pooling_with_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy with uffd turned on
  • instance_pooling_without_uffd: create a fresh instance with InstanceAllocationStrategy::Pooling strategy without uffd turned on
  • recreate_instance: create a fresh instance with InstanceAllocationStrategy::OnDemand strategy
  • interpreted: wasmi

The measurements are only for the main thread; thread count on the bottom signifies how many other threads were running in the background doing exactly the same thing as the main thread, e.g. for 4 threads there was 1 thread (the main thread) being benchmarked while other 3 threads were running in the background.

The benchmarks are not yet fully committed; I'll add them in a PR after the new instantiation mechanism will be merged-in to wasmtime and we can switch to it.

@andresilva andresilva moved this to In Progress 🛠 in SDK Node Apr 26, 2022
@koute koute moved this from In Progress 🛠 to Code in review 🧐 in SDK Node Apr 27, 2022
Repository owner moved this from Code in review 🧐 to Blocked ⛔️ in SDK Node May 19, 2022
@koute koute moved this from Blocked ⛔️ to Done ✅ in SDK Node May 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
I7-refactor Code needs refactoring. I9-optimisation An enhancement to provide better overall performance in terms of time-to-completion for a task.
Projects
Status: done
Development

Successfully merging a pull request may close this issue.

3 participants