Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail-calling interpreter #642

Closed
markshannon opened this issue Jan 8, 2024 · 19 comments
Closed

Tail-calling interpreter #642

markshannon opened this issue Jan 8, 2024 · 19 comments
Assignees

Comments

@markshannon
Copy link
Member

markshannon commented Jan 8, 2024

An alternative dispatch mechanism to switch or computed gotos is to use tail calls.
In a tail calling interpreter, dispatch is implemented as a tail call to the next instruction.

For example: the C code generated for NOP would look something like:

funcptr INSTRUCTION_TABLE[256] = {
    [NOP] = nop_func,
    ...
};

RETURN_TYPE nop_func(PyObject **sp, PyInterpreterFrame *fp, PyThreadState *ts, _Py_CODEUNIT *next_instr, int oparg)
{
    next_instr += 1;
    int opcode = next_instr->op.code;
    int oparg = next_instr->op.arg;
    funcptr next = INSTRUCTION_TABLE[opcode];
    return next(sp, fp, ts, next_instr, oparg);
}

If we can get the C compiler to generate jumps for tailcalls and avoid lots of expensive spills, this can be faster than computed gotos.

The problem is that C compilers do not always generate tailcalls and the platform ABI force them to spill a lot of registers.

However, upcoming versions of Clang may support annotations to force tail calls, and LLVM already supports the GHC (Haskell) calling convention which should eliminate the spills.

This has the potential to speed up the tier 1 interpreter by several percent.

@gvanrossum
Copy link
Collaborator

In the example, shouldn't these two lines be swapped?

    funcptr next = INSTRUCTION_TABLE[opcode];
    int opcode = next_instr->op.code;

ISTM that next depends on the new value of opcode, which is computed in the second line.

@brandtbucher
Copy link
Member

However, upcoming versions of Clang may support annotations to force tail calls, and LLVM already supports the GHC (Haskell) calling convention which should eliminate the spills.

To clarify, this is backwards: Clang already has __attribute__((musttail)), and may be getting the GHC calling convention in the future.

So the tail-calling idea can be tested immediately, but may have performance issues without the calling convention change. And, of course, it only works on Clang builds.

@garrettgu10
Copy link

garrettgu10 commented Jun 3, 2024

Hello, I work on Pyodide at Cloudflare, and I did a quick and dirty implementation of this that seems to work. I'm mainly focused on Emscripten performance, so I haven't tried building with clang-19 with preserve_none calling convention. Feel free to take a look and hmu if you run into any issues.

garrettgu10/cpython#1

@TheShermanTanker
Copy link

(Coming in as a bit of an outsider to cpython)

Why keep this as just a tier 1 Interpreter improvement? Wouldn't the tier 2 Interpreter also benefit from this?

Also, given there are several ways to dispatch in an Interpreter (Direct Threading, Indirect Threading, Token Threading, Call Threading, Tail Calls, as this proposal suggests), would it be feasible to compare the performance of all of them to see which would fit cpython the best? In particular, with call threading, each handler can return the address of the next handler to call to the main loop, which will have the next handler's address nicely laid out in the return value register to be called again, until the loop terminates when program execution ends

@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Jan 4, 2025

It seems GCC now supports guaranteed tail call attributes https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html#index-musttail-statement-attribute. Not sure which version though.

With the interpreter generator in 3.14, we can generate multiple interpreters. For platforms that support it (mostly clang and gcc), we can generate a tail-calling one. We can leave MSVC alone with the normal interpreter.

This should also improve the perf on wasm-based platforms.

Wouldn't the tier 2 Interpreter also benefit from this?

The tier 2 interpreter is only used for debugging. We're only concerned with the perf of the tier 2 JIT.

However, if we can get this working, entering the JITted code should no longer require an expensive shim frame, and should be significantly cheaper. This also assumes we get the calling convention.

I'll put this on my queue of things to work on. Edit: I'm happy to shepherd such a change, but after hacking on it myself, it seems pretty intrusive to upstream. Relevant branch: https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tail-call

Note: it seems according to this blog post, https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html , we need preserve_none or preserve_most (either one of these) to make it work performantly.

@Fidget-Spinner
Copy link
Collaborator

Looking at @garrettgu10 's implementation, I'm pretty impressed at the tricks you used to make the tail-calls more efficient (like returning an enum of the err state). I naively tried to clone every error label.

@Fidget-Spinner Fidget-Spinner self-assigned this Jan 5, 2025
@Fidget-Spinner
Copy link
Collaborator

Fidget-Spinner commented Jan 5, 2025

I got a 10% geometric mean speedup on pyperformance, with up to 40% on my branch with the tail-calling interpreter and clang-19:
Base: clang-19 compiled CPython with ThinLTO+PGO
My branch: clang-19 compiled CPython with ThinLTO+PGO and tail-calling interpreter
System: Ubuntu 22.04

https://gist.github.com/Fidget-Spinner/497c664eef389622d146d632990b0d21

@garrettgu10
Copy link

Woah, those numbers look incredible!! I appreciate the credit :-)

I've never tried running pyperformance on Pyodide but I shall posthaste 😮

@TheShermanTanker
Copy link

It seems GCC now supports guaranteed tail call attributes https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html#index-musttail-statement-attribute. Not sure which version though.

With the interpreter generator in 3.14, we can generate multiple interpreters. For platforms that support it (mostly clang and gcc), we can generate a tail-calling one. We can leave MSVC alone with the normal interpreter.

This should also improve the perf on wasm-based platforms.

Wouldn't the tier 2 Interpreter also benefit from this?

The tier 2 interpreter is only used for debugging. We're only concerned with the perf of the tier 2 JIT.

However, if we can get this working, entering the JITted code should no longer require an expensive shim frame, and should be significantly cheaper. This also assumes we get the calling convention.

I'll put this on my queue of things to work on. Edit: I'm happy to shepherd such a change, but after hacking on it myself, it seems pretty intrusive to upstream. Relevant branch: https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tail-call

Note: it seems according to this blog post, https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html , we need preserve_none or preserve_most (either one of these) to make it work performantly.

Ah, alright. But wouldn't this conflict with the computed goto thing that cpython has in place for certain compilers currently?

@garrettgu10
Copy link

Ah, alright. But wouldn't this conflict with the computed goto thing that cpython has in place for certain compilers currently?

We would only enable this dispatch mode for a small subset of architectures that have support for musttail. The existing switch and goto dispatch modes would still be available.

@Fidget-Spinner
Copy link
Collaborator

I tried building on wasmtime+wasi using the instructions here but couldn't get it to work https://devguide.python.org/getting-started/setup-building/#wasi. I get the following:

WebAssembly 'tail-call' feature not enabled

I don't know how to proceed further. But @garrettgu10 hopefully this helps you: you can run a quick and dirty pystones benchmark to test rather than the full pyperformance https://gist.github.com/Fidget-Spinner/e7bf204bf605680b0fc1540fe3777acf. On Ubuntu 22.04 x86-64 I get a 25% speedup with tail-calls (ThinLTO + PGO + clang-19 on both systems). Hopefully you get higher.

@hoodmane
Copy link

hoodmane commented Jan 6, 2025

Presumably you need to pass -mtail-call to the compiler/linker.

@garrettgu10
Copy link

garrettgu10 commented Jan 6, 2025

@Fidget-Spinner If you look at the diff for configure here you'll see where the -mtail-call option can be added for WASI and Emscripten targets.

I tried building your branch just now but I'm getting stuck in dependency hell while building the bootstrap python bc I haven't built in a few months. Will try again later.

@brettcannon
Copy link

I tried building on wasmtime+wasi using the instructions here but couldn't get it to work https://devguide.python.org/getting-started/setup-building/#wasi. I get the following:

WebAssembly 'tail-call' feature not enabled

Where did that show up? My guess is you need to tweak the flags used by wasmtime: --wasm tail-call=y turns them on in the runtime. You can tweak the args passed to wasmtime using the --host-runner flag to wasi.py.

@hoodmane
Copy link

hoodmane commented Jan 6, 2025

I think there's no longer a runtime flag since wasm tail calls is stage 4. Web assembly features says:

Enabled by default when using the Cranelift backend since wasmtime version 22, the s390x architecture supports it since 24

So this error message is presumably from the compiler toolchain.

@Fidget-Spinner
Copy link
Collaborator

Yup it was my compiler. Thanks for the help!

WASI results are disappointing. Mainly because preserve_none is not supported:

WASI:

../../Python/generated_cases_tail_call.c.h:10846:16: warning: 'preserve_none' calling convention is not supported for this target [-Wignored-attributes]
 10846 | __attribute__((preserve_none)) static PyObject *
       |                ^

Tail-calling interpreter:

Pystone(1.1) time for 1000000 passes = 8.58486
This machine benchmarks at 116484 pystones/second


Main:
Pystone(1.1) time for 1000000 passes = 4.27952
This machine benchmarks at 233671 pystones/second

Trying emscriptem next.

@hoodmane
Copy link

hoodmane commented Jan 6, 2025

Unfortunately, I would imagine the results for wasi and Emscripten will be about the same.

@garrettgu10
Copy link

Yeah, those results are consistent with what I found when I tried this a few months ago. Both Cranelift and V8 fail to preserve all the function inputs within registers between tail calls, so the code generation ends up being quite suboptimal. It seems difficult (impossible?) to communicate such a requirement through the compiled Wasm code.

I also observed about a 10% slowdown on x64 without the preserve_none calling convention, so that's likely the issue.

@markshannon
Copy link
Member Author

It's done 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants