Tail-calling interpreter #642

markshannon · 2024-01-08T09:53:53Z

An alternative dispatch mechanism to switch or computed gotos is to use tail calls.
In a tail calling interpreter, dispatch is implemented as a tail call to the next instruction.

For example: the C code generated for NOP would look something like:

funcptr INSTRUCTION_TABLE[256] = {
    [NOP] = nop_func,
    ...
};

RETURN_TYPE nop_func(PyObject **sp, PyInterpreterFrame *fp, PyThreadState *ts, _Py_CODEUNIT *next_instr, int oparg)
{
    next_instr += 1;
    int opcode = next_instr->op.code;
    int oparg = next_instr->op.arg;
    funcptr next = INSTRUCTION_TABLE[opcode];
    return next(sp, fp, ts, next_instr, oparg);
}

If we can get the C compiler to generate jumps for tailcalls and avoid lots of expensive spills, this can be faster than computed gotos.

The problem is that C compilers do not always generate tailcalls and the platform ABI force them to spill a lot of registers.

However, upcoming versions of Clang may support annotations to force tail calls, and LLVM already supports the GHC (Haskell) calling convention which should eliminate the spills.

This has the potential to speed up the tier 1 interpreter by several percent.

The text was updated successfully, but these errors were encountered:

gvanrossum · 2024-01-08T19:32:45Z

In the example, shouldn't these two lines be swapped?

    funcptr next = INSTRUCTION_TABLE[opcode];
    int opcode = next_instr->op.code;

ISTM that next depends on the new value of opcode, which is computed in the second line.

brandtbucher · 2024-01-08T20:20:24Z

However, upcoming versions of Clang may support annotations to force tail calls, and LLVM already supports the GHC (Haskell) calling convention which should eliminate the spills.

To clarify, this is backwards: Clang already has __attribute__((musttail)), and may be getting the GHC calling convention in the future.

So the tail-calling idea can be tested immediately, but may have performance issues without the calling convention change. And, of course, it only works on Clang builds.

garrettgu10 · 2024-06-03T22:22:51Z

Hello, I work on Pyodide at Cloudflare, and I did a quick and dirty implementation of this that seems to work. I'm mainly focused on Emscripten performance, so I haven't tried building with clang-19 with preserve_none calling convention. Feel free to take a look and hmu if you run into any issues.

garrettgu10/cpython#1

TheShermanTanker · 2024-07-09T13:57:27Z

(Coming in as a bit of an outsider to cpython)

Why keep this as just a tier 1 Interpreter improvement? Wouldn't the tier 2 Interpreter also benefit from this?

Also, given there are several ways to dispatch in an Interpreter (Direct Threading, Indirect Threading, Token Threading, Call Threading, Tail Calls, as this proposal suggests), would it be feasible to compare the performance of all of them to see which would fit cpython the best? In particular, with call threading, each handler can return the address of the next handler to call to the main loop, which will have the next handler's address nicely laid out in the return value register to be called again, until the loop terminates when program execution ends

Fidget-Spinner · 2025-01-04T21:09:39Z

It seems GCC now supports guaranteed tail call attributes https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html#index-musttail-statement-attribute. Not sure which version though.

With the interpreter generator in 3.14, we can generate multiple interpreters. For platforms that support it (mostly clang and gcc), we can generate a tail-calling one. We can leave MSVC alone with the normal interpreter.

This should also improve the perf on wasm-based platforms.

Wouldn't the tier 2 Interpreter also benefit from this?

The tier 2 interpreter is only used for debugging. We're only concerned with the perf of the tier 2 JIT.

However, if we can get this working, entering the JITted code should no longer require an expensive shim frame, and should be significantly cheaper. This also assumes we get the calling convention.

I'll put this on my queue of things to work on. Edit: I'm happy to shepherd such a change, but after hacking on it myself, it seems pretty intrusive to upstream. Relevant branch: https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tail-call

Note: it seems according to this blog post, https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html , we need preserve_none or preserve_most (either one of these) to make it work performantly.

Fidget-Spinner · 2025-01-05T01:51:22Z

Looking at @garrettgu10 's implementation, I'm pretty impressed at the tricks you used to make the tail-calls more efficient (like returning an enum of the err state). I naively tried to clone every error label.

Fidget-Spinner · 2025-01-05T18:28:04Z

I got a 10% geometric mean speedup on pyperformance, with up to 40% on my branch with the tail-calling interpreter and clang-19:
Base: clang-19 compiled CPython with ThinLTO+PGO
My branch: clang-19 compiled CPython with ThinLTO+PGO and tail-calling interpreter
System: Ubuntu 22.04

https://gist.github.com/Fidget-Spinner/497c664eef389622d146d632990b0d21

garrettgu10 · 2025-01-05T18:50:32Z

Woah, those numbers look incredible!! I appreciate the credit :-)

I've never tried running pyperformance on Pyodide but I shall posthaste 😮

TheShermanTanker · 2025-01-06T05:30:01Z

It seems GCC now supports guaranteed tail call attributes https://gcc.gnu.org/onlinedocs/gcc/Statement-Attributes.html#index-musttail-statement-attribute. Not sure which version though.

With the interpreter generator in 3.14, we can generate multiple interpreters. For platforms that support it (mostly clang and gcc), we can generate a tail-calling one. We can leave MSVC alone with the normal interpreter.

This should also improve the perf on wasm-based platforms.

Wouldn't the tier 2 Interpreter also benefit from this?

The tier 2 interpreter is only used for debugging. We're only concerned with the perf of the tier 2 JIT.

However, if we can get this working, entering the JITted code should no longer require an expensive shim frame, and should be significantly cheaper. This also assumes we get the calling convention.

I'll put this on my queue of things to work on. Edit: I'm happy to shepherd such a change, but after hacking on it myself, it seems pretty intrusive to upstream. Relevant branch: https://github.com/Fidget-Spinner/cpython/pull/new/Fidget-Spinner:cpython:tail-call

Note: it seems according to this blog post, https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html , we need preserve_none or preserve_most (either one of these) to make it work performantly.

Ah, alright. But wouldn't this conflict with the computed goto thing that cpython has in place for certain compilers currently?

garrettgu10 · 2025-01-06T05:53:41Z

Ah, alright. But wouldn't this conflict with the computed goto thing that cpython has in place for certain compilers currently?

We would only enable this dispatch mode for a small subset of architectures that have support for musttail. The existing switch and goto dispatch modes would still be available.

Fidget-Spinner · 2025-01-06T15:33:46Z

I tried building on wasmtime+wasi using the instructions here but couldn't get it to work https://devguide.python.org/getting-started/setup-building/#wasi. I get the following:

WebAssembly 'tail-call' feature not enabled

I don't know how to proceed further. But @garrettgu10 hopefully this helps you: you can run a quick and dirty pystones benchmark to test rather than the full pyperformance https://gist.github.com/Fidget-Spinner/e7bf204bf605680b0fc1540fe3777acf. On Ubuntu 22.04 x86-64 I get a 25% speedup with tail-calls (ThinLTO + PGO + clang-19 on both systems). Hopefully you get higher.

hoodmane · 2025-01-06T18:12:56Z

Presumably you need to pass -mtail-call to the compiler/linker.

garrettgu10 · 2025-01-06T19:46:02Z

@Fidget-Spinner If you look at the diff for configure here you'll see where the -mtail-call option can be added for WASI and Emscripten targets.

I tried building your branch just now but I'm getting stuck in dependency hell while building the bootstrap python bc I haven't built in a few months. Will try again later.

brettcannon · 2025-01-06T19:47:48Z

I tried building on wasmtime+wasi using the instructions here but couldn't get it to work https://devguide.python.org/getting-started/setup-building/#wasi. I get the following:
WebAssembly 'tail-call' feature not enabled

Where did that show up? My guess is you need to tweak the flags used by wasmtime: --wasm tail-call=y turns them on in the runtime. You can tweak the args passed to wasmtime using the --host-runner flag to wasi.py.

hoodmane · 2025-01-06T20:02:35Z

I think there's no longer a runtime flag since wasm tail calls is stage 4. Web assembly features says:

Enabled by default when using the Cranelift backend since wasmtime version 22, the s390x architecture supports it since 24

So this error message is presumably from the compiler toolchain.

Fidget-Spinner · 2025-01-06T20:43:05Z

Yup it was my compiler. Thanks for the help!

WASI results are disappointing. Mainly because preserve_none is not supported:

WASI:

../../Python/generated_cases_tail_call.c.h:10846:16: warning: 'preserve_none' calling convention is not supported for this target [-Wignored-attributes]
 10846 | __attribute__((preserve_none)) static PyObject *
       |                ^

Tail-calling interpreter:

Pystone(1.1) time for 1000000 passes = 8.58486
This machine benchmarks at 116484 pystones/second


Main:
Pystone(1.1) time for 1000000 passes = 4.27952
This machine benchmarks at 233671 pystones/second

Trying emscriptem next.

hoodmane · 2025-01-06T20:44:03Z

Unfortunately, I would imagine the results for wasi and Emscripten will be about the same.

garrettgu10 · 2025-01-06T21:36:02Z

Yeah, those results are consistent with what I found when I tried this a few months ago. Both Cranelift and V8 fail to preserve all the function inputs within registers between tail calls, so the code generation ends up being quite suboptimal. It seems difficult (impossible?) to communicate such a requirement through the compiled Wasm code.

I also observed about a 10% slowdown on x64 without the preserve_none calling convention, so that's likely the issue.

markshannon · 2025-02-06T16:46:36Z

It's done 🎉

markshannon mentioned this issue Jan 8, 2024

Persuade MSVC to generate indirect threaded code. #183

Closed

Fidget-Spinner assigned Fidget-Spinner and unassigned Fidget-Spinner Jan 4, 2025

Fidget-Spinner self-assigned this Jan 5, 2025

Fidget-Spinner mentioned this issue Jan 6, 2025

A new tail-calling interpreter for significantly better interpreter performance python/cpython#128563

Closed

markshannon closed this as completed Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tail-calling interpreter #642

Tail-calling interpreter #642

markshannon commented Jan 8, 2024 •

edited

Loading

gvanrossum commented Jan 8, 2024

brandtbucher commented Jan 8, 2024

garrettgu10 commented Jun 3, 2024 •

edited

Loading

TheShermanTanker commented Jul 9, 2024

Fidget-Spinner commented Jan 4, 2025 •

edited

Loading

Fidget-Spinner commented Jan 5, 2025

Fidget-Spinner commented Jan 5, 2025 •

edited

Loading

garrettgu10 commented Jan 5, 2025

TheShermanTanker commented Jan 6, 2025

garrettgu10 commented Jan 6, 2025

Fidget-Spinner commented Jan 6, 2025

hoodmane commented Jan 6, 2025

garrettgu10 commented Jan 6, 2025 •

edited

Loading

brettcannon commented Jan 6, 2025

hoodmane commented Jan 6, 2025 •

edited

Loading

Fidget-Spinner commented Jan 6, 2025

hoodmane commented Jan 6, 2025

garrettgu10 commented Jan 6, 2025

markshannon commented Feb 6, 2025

Tail-calling interpreter #642

Tail-calling interpreter #642

Comments

markshannon commented Jan 8, 2024 • edited Loading

gvanrossum commented Jan 8, 2024

brandtbucher commented Jan 8, 2024

garrettgu10 commented Jun 3, 2024 • edited Loading

TheShermanTanker commented Jul 9, 2024

Fidget-Spinner commented Jan 4, 2025 • edited Loading

Fidget-Spinner commented Jan 5, 2025

Fidget-Spinner commented Jan 5, 2025 • edited Loading

garrettgu10 commented Jan 5, 2025

TheShermanTanker commented Jan 6, 2025

garrettgu10 commented Jan 6, 2025

Fidget-Spinner commented Jan 6, 2025

hoodmane commented Jan 6, 2025

garrettgu10 commented Jan 6, 2025 • edited Loading

brettcannon commented Jan 6, 2025

hoodmane commented Jan 6, 2025 • edited Loading

Fidget-Spinner commented Jan 6, 2025

hoodmane commented Jan 6, 2025

garrettgu10 commented Jan 6, 2025

markshannon commented Feb 6, 2025

markshannon commented Jan 8, 2024 •

edited

Loading

garrettgu10 commented Jun 3, 2024 •

edited

Loading

Fidget-Spinner commented Jan 4, 2025 •

edited

Loading

Fidget-Spinner commented Jan 5, 2025 •

edited

Loading

garrettgu10 commented Jan 6, 2025 •

edited

Loading

hoodmane commented Jan 6, 2025 •

edited

Loading