-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tail-calling interpreter #642
Comments
In the example, shouldn't these two lines be swapped?
ISTM that |
To clarify, this is backwards: Clang already has So the tail-calling idea can be tested immediately, but may have performance issues without the calling convention change. And, of course, it only works on Clang builds. |
Hello, I work on Pyodide at Cloudflare, and I did a quick and dirty implementation of this that seems to work. I'm mainly focused on Emscripten performance, so I haven't tried building with clang-19 with preserve_none calling convention. Feel free to take a look and hmu if you run into any issues. |
(Coming in as a bit of an outsider to cpython) Why keep this as just a tier 1 Interpreter improvement? Wouldn't the tier 2 Interpreter also benefit from this? Also, given there are several ways to dispatch in an Interpreter (Direct Threading, Indirect Threading, Token Threading, Call Threading, Tail Calls, as this proposal suggests), would it be feasible to compare the performance of all of them to see which would fit cpython the best? In particular, with call threading, each handler can return the address of the next handler to call to the main loop, which will have the next handler's address nicely laid out in the return value register to be called again, until the loop terminates when program execution ends |
It seems GCC now supports guaranteed tail call attributes https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html#index-musttail-statement-attribute. Not sure which version though. With the interpreter generator in 3.14, we can generate multiple interpreters. For platforms that support it (mostly clang and gcc), we can generate a tail-calling one. We can leave MSVC alone with the normal interpreter. This should also improve the perf on wasm-based platforms.
The tier 2 interpreter is only used for debugging. We're only concerned with the perf of the tier 2 JIT. However, if we can get this working, entering the JITted code should no longer require an expensive shim frame, and should be significantly cheaper. This also assumes we get the calling convention. I'll put this on my queue of things to work on. Edit: I'm happy to shepherd such a change, but after hacking on it myself, it seems pretty intrusive to upstream. Relevant branch: https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tail-call Note: it seems according to this blog post, https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html , we need |
Looking at @garrettgu10 's implementation, I'm pretty impressed at the tricks you used to make the tail-calls more efficient (like returning an enum of the err state). I naively tried to clone every error label. |
I got a 10% geometric mean speedup on pyperformance, with up to 40% on my branch with the tail-calling interpreter and clang-19: https://gist.github.com/Fidget-Spinner/497c664eef389622d146d632990b0d21 |
Woah, those numbers look incredible!! I appreciate the credit :-) I've never tried running pyperformance on Pyodide but I shall posthaste 😮 |
Ah, alright. But wouldn't this conflict with the computed goto thing that cpython has in place for certain compilers currently? |
We would only enable this dispatch mode for a small subset of architectures that have support for musttail. The existing switch and goto dispatch modes would still be available. |
I tried building on wasmtime+wasi using the instructions here but couldn't get it to work https://devguide.python.org/getting-started/setup-building/#wasi. I get the following:
I don't know how to proceed further. But @garrettgu10 hopefully this helps you: you can run a quick and dirty pystones benchmark to test rather than the full pyperformance https://gist.github.com/Fidget-Spinner/e7bf204bf605680b0fc1540fe3777acf. On Ubuntu 22.04 x86-64 I get a 25% speedup with tail-calls (ThinLTO + PGO + clang-19 on both systems). Hopefully you get higher. |
Presumably you need to pass |
@Fidget-Spinner If you look at the diff for I tried building your branch just now but I'm getting stuck in dependency hell while building the bootstrap python bc I haven't built in a few months. Will try again later. |
Where did that show up? My guess is you need to tweak the flags used by wasmtime: |
I think there's no longer a runtime flag since wasm tail calls is stage 4. Web assembly features says:
So this error message is presumably from the compiler toolchain. |
Yup it was my compiler. Thanks for the help! WASI results are disappointing. Mainly because
Trying emscriptem next. |
Unfortunately, I would imagine the results for wasi and Emscripten will be about the same. |
Yeah, those results are consistent with what I found when I tried this a few months ago. Both Cranelift and V8 fail to preserve all the function inputs within registers between tail calls, so the code generation ends up being quite suboptimal. It seems difficult (impossible?) to communicate such a requirement through the compiled Wasm code. I also observed about a 10% slowdown on x64 without the preserve_none calling convention, so that's likely the issue. |
It's done 🎉 |
An alternative dispatch mechanism to switch or computed gotos is to use tail calls.
In a tail calling interpreter, dispatch is implemented as a tail call to the next instruction.
For example: the C code generated for
NOP
would look something like:If we can get the C compiler to generate jumps for tailcalls and avoid lots of expensive spills, this can be faster than computed gotos.
The problem is that C compilers do not always generate tailcalls and the platform ABI force them to spill a lot of registers.
However, upcoming versions of Clang may support annotations to force tail calls, and LLVM already supports the GHC (Haskell) calling convention which should eliminate the spills.
This has the potential to speed up the tier 1 interpreter by several percent.
The text was updated successfully, but these errors were encountered: