meta_framegap.lua deopt crash #1571

vext01 · 2025-01-29T15:16:31Z

With today's yklua (e99ff2) and yk (3dc428) an OpenResty test causes a deopt crash:

$ YK_OPT=0 YKD_SERIALISE_COMPILATION=1 ~/research/yklua/src/lua luajit2-test-suite/test/misc/meta_framegap.lua
thread '<unnamed>' panicked at ykrt/src/compile/jitc_yk/codegen/x64/deopt.rs:40:30:
no entry found for key

To get the test file, you can use the branch from this PR

(Once it's merged, you can just use main)

The text was updated successfully, but these errors were encountered:

ltratt · 2025-01-30T17:32:05Z

Now that is a very interesting error: most obviously, it shouldn't ever be possible to reference a guard that doesn't exist, but the most likely way this sort of bug would occur is that we pick up the wrong guard. The fact that we hit a guard that doesn't exist will make fixing this much easier. Does the bug happen deterministically? If so, this is a wonderful shrinkray candidate!

vext01 · 2025-01-30T21:13:45Z

It's very deterministic, so I have high hopes of reducing it and tracking down the cause.

vext01 · 2025-01-31T16:10:54Z

More info on this crash.

It is 100% deterministic with YK_OPT=0 YKD_SERIALISE_COMPILATION=1 which helps.

There are no side-traces involved. There are two traces executed, call them T1 and T2.

T1 has 83 deopt exits. T2 has only 11.

T1 is executed once at the beginning, then only T2 is executed after that (many times).

Here is a reduced test case:

A = setmetatable({}, {
	__add = function(A, B)
		if B > 200 then
			for B = 1, B do
			end
		end
	end,
})
function C(A, B)
	do
		return A + B
	end
end
for B = 1, 300 do
	C(A, B)
end

Putting prints in __yk_guardcheck() the crash appears to be caused when we look up guard index 56 in a trace (T1) with guard indices 0..10:

guard index 56
ctr has these guard indices: [4, 0, 7, 10, 1, 2, 8, 9, 3]

Looking at the jit-asm confirms 56 is well out of bounds, but there is another compiled trace for which that'd be a valid index (i.e. T2). This made me question whether the thread's idea of the "currently running trace". Is it using a guardindex for T2 when running T2?

Putting more prints around and pumping up the logging I can see something else fishy. A common pattern in the output is:

...
yk-jit-event: enter-jit-code
set a running trace: 0x7fa1851d9490
root is None
__yk_guardcheck()
ctr addr: 0x7fa1851d9490
guard index 8
ctr has these guard indices: [4, 0, 7, 10, 1, 2, 8, 9, 3]
yk-jit-event: deoptimise
replace stack
control point
yk-jit-event: enter-jit-code
...

That all makes sense to me.

But when we crash, we deviate from this pattern:

...
yk-jit-event: enter-jit-code
set a running trace: 0x7fa1851d9490
root is None
__yk_guardcheck()
ctr addr: 0x7fa1851d9490
guard index 8
ctr has these guard indices: [4, 0, 7, 10, 1, 2, 8, 9, 3]
yk-jit-event: deoptimise
replace stack
__yk_guardcheck()      <--------------- How does this happen????
ctr addr: 0x7fa1851d9490
guard index 56
ctr has these guard indices: [4, 0, 7, 10, 1, 2, 8, 9, 3]

thread '<unnamed>' panicked at ykrt/src/compile/jitc_yk/codegen/x64/deopt.rs:44:30:
no entry found for key
...

After replace stack we somehow end up in __yk_guardcheck() (which you'd only expect to be called from JITted code), but we've not seen a control point or enter-jit-code since we last deopted, which doesn't match my mental model of how the system should work. It looks to me like the system should be in the interpreting state.

A further reduced input is:

A = setmetatable({}, {
	__add = function(A, B)
		if B > 00 then
			for B = 0, B do
			end
		end
	end,
})
function C(A, B)
	do
		return A + B
	end
end
for B = 0, 200 do
	C(A, B)
end

This one more often that not gives a different error:

thread '<unnamed>' panicked at ykrt/src/compile/jitc_yk/trace_builder.rs:433:69:
no entry found for key

(It must also give the original error from before, or it wouldn't have passed the interestingness test. The tries 500 times to get the deopt.rs:40, requiring it to only show once to be interesting)

Diff between those two inputs:

--- meta_framegap.lua	Fri Jan 31 15:45:00 2025
+++ meta_framegap_werid.lua	Fri Jan 31 15:56:38 2025
@@ -1,7 +1,7 @@
 A = setmetatable({}, {
 	__add = function(A, B)
-		if B > 200 then
-			for B = 1, B do
+		if B > 00 then
+			for B = 0, B do
 			end
 		end
 	end,
@@ -11,6 +11,6 @@
 		return A + B
 	end
 end
-for B = 1, 300 do
+for B = 0, 200 do
 	C(A, B)
 end

vext01 · 2025-02-03T10:33:00Z

Here's what's happening:

A trace, T1 is compiled and executed, which sets the tstate to Executing. The trace contains a call to luaV_execute (this is likely lua interpreting a call to a lua function). We then compile another trace inside this nested luaV_execute and execute it, which sets the tstate to Executing again. The stack looks something like this now:

luaV_execute (1)
trace_1
luaV_execute (2)
trace_2

We then deopt from trace_2 back into luaV_execute (2). Note, that it appears that we DO NOT set tstate back to Interpreting (should we?). The state remains in Executing. We then return from luaV_execute (2) back into trace_1. At this point the current running trace in tstate = Executing is still trace_2, because we haven't updated tstate. Then a guard fails in trace_1 which looks up the running trace, gets trace_2 and bad things happen.

Question 1) Should we allow compiling traces from within luaV_execute calls that are called from trace?

Question 2) If we do, and our feeling is we do, then how do we make sure that the running trace is always up to date. A simple fix would be to have a vector and push to it when executing a trace, and popping when deopting. We probably also want to set tstate back to Interpreting at deopt time.

vext01 · 2025-02-04T16:59:55Z

Small test case for this:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <yk.h>
#include <yk_testing.h>

void f(YkMT *mt, int who, YkLocation *loc1, YkLocation *loc2, int i) {
  fprintf(stderr, "enter\n");
  while (i > 0) {
    yk_mt_control_point(mt, loc1);
    if (who) {
      if (i == 1) {
        fprintf(stderr, "a\n");
      }
      if (i == 2) {
        fprintf(stderr, "b\n");
      }
      if (i == 3) {
        fprintf(stderr, "c\n");
      }
    }
    fprintf(stderr, "%d\n", i);
    i -= 1;
    if (loc2 != NULL) {
      f(mt, 0, loc2, NULL, i);
    }
  }
  fprintf(stderr, "exit\n");
}

int main(int argc, char **argv) {
  YkMT *mt = yk_mt_new(NULL);
  yk_mt_hot_threshold_set(mt, 0);
  YkLocation loc1 = yk_location_new();
  YkLocation loc2 = yk_location_new();
  int i = 6;
  NOOPT_VAL(loc1);
  NOOPT_VAL(loc2);
  NOOPT_VAL(i);
  f(mt, 1, &loc1, &loc2, i);
  yk_location_drop(loc1);
  yk_location_drop(loc2);
  yk_mt_shutdown(mt);
  return (EXIT_SUCCESS);
}

Previously we -- well, this one is entirely my fault, so "I"! -- tracked per-thread meta-tracing state as a single `MTThreadState` that we updates as necessary. This doesn't work when we have nested execution, tracing and the like, as the bug in ykjit#1571 highlighted. This commit moves us to a stack of `MTThreadState`s. The basic idea is that the stack always has at least one element: `Interpreting`. As we go through other states, we push / pop as appropriate. The test below (from Edd and Lukas) fails on `master` but is fixed by this commit. The implementation is a bit more awkward than one might hope as naive implementations either: 1. Spread lots of knowledge about the stack around the code. That's a disaster waiting to happen. 2. Run into borrow checker problems. This commit gets around this in two phases: 1. We pass closures to `MTThread` which can peek at the current `MTThreadState` (which isn't `Copy`!). 2. Those closures return the "things" we need to update outside that context. This is a bit awkward, and perhaps there's a better API, but this one is at least safe. Co-authored-by: Edd Barrett <[email protected]> Co-authored-by: Lukas Diekmann <[email protected]>

vext01 assigned ptersilie Jan 29, 2025

vext01 mentioned this issue Jan 30, 2025

Add third party yklua tests to CI. #1575

Open

ltratt mentioned this issue Feb 4, 2025

Track nested meta-tracing states. #1582

Merged

vext01 closed this as completed in #1582 Feb 5, 2025

ltratt mentioned this issue Feb 7, 2025

Don't overwrite registers that we may still need to move. #1593

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

meta_framegap.lua deopt crash #1571

meta_framegap.lua deopt crash #1571

vext01 commented Jan 29, 2025

ltratt commented Jan 30, 2025

vext01 commented Jan 30, 2025

vext01 commented Jan 31, 2025

vext01 commented Feb 3, 2025

vext01 commented Feb 4, 2025 •

edited

Loading

meta_framegap.lua deopt crash #1571

meta_framegap.lua deopt crash #1571

Comments

vext01 commented Jan 29, 2025

ltratt commented Jan 30, 2025

vext01 commented Jan 30, 2025

vext01 commented Jan 31, 2025

vext01 commented Feb 3, 2025

vext01 commented Feb 4, 2025 • edited Loading

vext01 commented Feb 4, 2025 •

edited

Loading