Investigate "switched-goto" for compilers without computed gotos #537

lpereira · 2023-01-12T19:33:49Z

Recently I came across this blog post that shows a rather weird way of having something between a standard switch dispatching for an eval loop and an eval loop with computed gotos. Now that we're experimenting with generating a lot of that code, we could maybe see if it makes any sense to adopt this strategy?

brandtbucher · 2023-01-12T22:47:07Z

Seems interesting!

We had a small related discussion here where the timings showed that computed gotos are only giving the eval loop a 1% speedup on Linux these days. (On the other hand, I got a 5-10% speedup on when I added computed gotos to the re engine last year.)

gvanrossum · 2023-01-13T19:15:27Z

IIUC the blog post essentially changes the DISPATCH() macro to be a switch on the next opcode whose cases contain gotos. E.g.

...
TARGET(OP1) {
   ...
   DISPATCH();
}
TARGET(OP2) {
    ...
    DISPATCH();
}
...

would expand to

label_OP1:
    ...
    switch (*next_instr++) {
        case OP1: goto label_OP1;
        case OP2: goto label_OP2;
        ...
    }
}
label_OP2:
    ...
    switch (*next_instr++) {
        case OP1: goto label_OP1;
        case OP2: goto label_OP2;
        ...
    }
}
...

and that switch would be repeated in each instruction. (Exactly where and hownext_instr is incremented is more complicated than shown here but doesn't affect reasoning about the scheme.)

And the theory is that having N copies of the switch (one for each opcode) helps the CPU's branch predictor because it will learn the most likely branch taken at the end of each opcode, so we won't need PREDICT() macros any more. (See #496.)

It would not be a very complicated experiment to carry out, except we'd need to wait until we have Windows benchmarking infrastructure in place, since on Linux/Mac we already have the computed goto.

One of my worries would be that the compiler sees that you have the same big piece of code in many places and it just unifies that into a single copy that it jumps to from everywhere. Compilers are weird that way.

neonene · 2023-01-26T13:44:57Z

As for Windows, _PyEval_EvalFrameDefault will hit MSVC's stuck or C4883 issue if each instruction branch has a big jump table:
https://developercommunity.visualstudio.com/t/vs-155-and-vs-14-c-optimizer-fails-in-a-function-c/201909#T-N218093

The current 3.12 eval function can be less optimized getting the warning (/w14883) or the error (/we4883), even adding some code like 35000 of if (0);. And throwing /d2OptimizeHugeFunctions hidden cl flag seems actually not effective for PGO products, in which _PyEval_EvalFrameDefault is not profiled at all.

lpereira changed the title ~~Investigating "switched-goto" for compilers without computed gotos~~ Investigate "switched-goto" for compilers without computed gotos Jan 12, 2023

gvanrossum mentioned this issue Jan 15, 2023

A proposal for variable-length instruction decoding #539

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate "switched-goto" for compilers without computed gotos #537

Investigate "switched-goto" for compilers without computed gotos #537

lpereira commented Jan 12, 2023 •

edited

Loading

brandtbucher commented Jan 12, 2023

gvanrossum commented Jan 13, 2023

neonene commented Jan 26, 2023

Investigate "switched-goto" for compilers without computed gotos #537

Investigate "switched-goto" for compilers without computed gotos #537

Comments

lpereira commented Jan 12, 2023 • edited Loading

brandtbucher commented Jan 12, 2023

gvanrossum commented Jan 13, 2023

neonene commented Jan 26, 2023

lpereira commented Jan 12, 2023 •

edited

Loading