Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate "switched-goto" for compilers without computed gotos #537

Open
lpereira opened this issue Jan 12, 2023 · 3 comments
Open

Investigate "switched-goto" for compilers without computed gotos #537

lpereira opened this issue Jan 12, 2023 · 3 comments

Comments

@lpereira
Copy link

lpereira commented Jan 12, 2023

Recently I came across this blog post that shows a rather weird way of having something between a standard switch dispatching for an eval loop and an eval loop with computed gotos. Now that we're experimenting with generating a lot of that code, we could maybe see if it makes any sense to adopt this strategy?

@lpereira lpereira changed the title Investigating "switched-goto" for compilers without computed gotos Investigate "switched-goto" for compilers without computed gotos Jan 12, 2023
@brandtbucher
Copy link
Member

Seems interesting!

We had a small related discussion here where the timings showed that computed gotos are only giving the eval loop a 1% speedup on Linux these days. (On the other hand, I got a 5-10% speedup on when I added computed gotos to the re engine last year.)

@gvanrossum
Copy link
Collaborator

IIUC the blog post essentially changes the DISPATCH() macro to be a switch on the next opcode whose cases contain gotos. E.g.

...
TARGET(OP1) {
   ...
   DISPATCH();
}
TARGET(OP2) {
    ...
    DISPATCH();
}
...

would expand to

label_OP1:
    ...
    switch (*next_instr++) {
        case OP1: goto label_OP1;
        case OP2: goto label_OP2;
        ...
    }
}
label_OP2:
    ...
    switch (*next_instr++) {
        case OP1: goto label_OP1;
        case OP2: goto label_OP2;
        ...
    }
}
...

and that switch would be repeated in each instruction. (Exactly where and hownext_instr is incremented is more complicated than shown here but doesn't affect reasoning about the scheme.)

And the theory is that having N copies of the switch (one for each opcode) helps the CPU's branch predictor because it will learn the most likely branch taken at the end of each opcode, so we won't need PREDICT() macros any more. (See #496.)

It would not be a very complicated experiment to carry out, except we'd need to wait until we have Windows benchmarking infrastructure in place, since on Linux/Mac we already have the computed goto.

One of my worries would be that the compiler sees that you have the same big piece of code in many places and it just unifies that into a single copy that it jumps to from everywhere. Compilers are weird that way.

@neonene
Copy link

neonene commented Jan 26, 2023

As for Windows, _PyEval_EvalFrameDefault will hit MSVC's stuck or C4883 issue if each instruction branch has a big jump table:
https://developercommunity.visualstudio.com/t/vs-155-and-vs-14-c-optimizer-fails-in-a-function-c/201909#T-N218093

The current 3.12 eval function can be less optimized getting the warning (/w14883) or the error (/we4883), even adding some code like 35000 of if (0);. And throwing /d2OptimizeHugeFunctions hidden cl flag seems actually not effective for PGO products, in which _PyEval_EvalFrameDefault is not profiled at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants