Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Panic in Concatenate Engine #15434

Closed
GuptaManan100 opened this issue Mar 11, 2024 · 4 comments · Fixed by #15454
Closed

Bug Report: Panic in Concatenate Engine #15434

GuptaManan100 opened this issue Mar 11, 2024 · 4 comments · Fixed by #15454

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

This bug was found while investigating the flakiness in TestUnionAll.

It was noticed that running the following query causes vtgates to panic -

select id1 from t1 where id1 in (1, 2, 3, 4, 5, 6, 7, 8) union all select id1 from t1 where id1 in (1, 2, 3, 4, 5, 6, 7, 8)

The vtgate logs have -

W0308 15:06:30.305024   15472 log.go:39] Failed to read in config : Config File "vtconfig" Not Found in "[/Users/manangupta/vitess/go/test/endtoend/vtgate/queries/union]". This is optional, and can be ignored if you are not using config files. For a detailed explanation, see https://github.com/vitessio/vitess/blob/main/doc/viper/viper.md#config-files.
panic: runtime error: index out of range [0] with length 0

goroutine 396 [running]:
vitess.io/vitess/go/vt/vtgate/engine.(*Concatenate).parallelStreamExec.func1(0x14000f29110, 0x30?)
	vitess.io/vitess/go/vt/vtgate/engine/concatenate.go:266 +0x2a8
vitess.io/vitess/go/vt/vtgate/engine.(*Concatenate).parallelStreamExec.func2.1(0x14000f29110)
	vitess.io/vitess/go/vt/vtgate/engine/concatenate.go:323 +0x18c
vitess.io/vitess/go/vt/vtgate/engine.(*Route).streamExecuteShards.func1(0x0?)
	vitess.io/vitess/go/vt/vtgate/engine/route.go:307 +0x150
vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute.func1.1(0x14000b20ec8?)
	vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:198 +0x34
vitess.io/vitess/go/vt/vttablet/grpctabletconn.(*gRPCQueryClient).StreamExecute(0x14000271d00, {0x10639f388?, 0x14000f8b540?}, 0x14000ce13e0, {0x14000055a00, 0x28}, 0x14000ce3ad0, 0x0, 0x0, 0x14000d2f5f0, ...)
	vitess.io/vitess/go/vt/vttablet/grpctabletconn/conn.go:190 +0x168
vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute.func1({0x10639f388, 0x14000f8b540}, 0x14000ce13e0, {0x1063bf1d8, 0x14000271d00})
	vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:196 +0xfc
vitess.io/vitess/go/vt/vtgate.(*TabletGateway).withRetry(0x140002b6690, {0x10639f388, 0x14000f8b540}, 0x14000ce13e0, {0x14000a04a98?, 0x10506a5c8?}, {0x40?, 0x10628f120?}, 0x0, 0x14000ce8940)
	vitess.io/vitess/go/vt/vtgate/tabletgateway.go:339 +0x420
vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute(0x1400079afd8, {0x10639f388, 0x14000f8b540}, 0x14000ce13e0, {0x14000055a00, 0x28}, 0x14000ce3ad0, 0x0, 0x0, 0x14000d2f5f0, ...)
	vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:194 +0x118
vitess.io/vitess/go/vt/vtgate.(*ScatterConn).StreamExecuteMulti.func1(0x14000cd7f08, 0x1, 0x14000cdb1a0)
	vitess.io/vitess/go/vt/vtgate/scatter_conn.go:408 +0x1ec
vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction.func1(0x14000cd7f08, 0x1)
	vitess.io/vitess/go/vt/vtgate/scatter_conn.go:640 +0x138
vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction.func2(0x0?, 0x1400059cea0?)
	vitess.io/vitess/go/vt/vtgate/scatter_conn.go:668 +0x58
created by vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction in goroutine 451
	vitess.io/vitess/go/vt/vtgate/scatter_conn.go:666 +0x188

The vitess version that was deployed for this test - 96e0a62

Reproduction Steps

  1. Run TestUnionAll over and over until the panic is witnessed

Binary Version

main

Operating System and Environment details

-

Log Fragments

No response

@GuptaManan100
Copy link
Member Author

GuptaManan100 commented Mar 11, 2024

Looking at the code in the concatenate engine, the panic is happening because we end up calling callback function before the fields list has been populated.

The callback function is called in 2 places. The first one is only after we populate the fields list. From the stack trace, we however know that the panic is happening because of the second call in line 323.

I tried reproducing the problem while adding more log lines to see what result sets we are receiving to debug the problem, but I couldn't find a failing run with the logs added in over 1500 runs.

So, I started speculating on how we could end up in this state of panic, and I can only think of 2 ways -

  1. The result we got back from the underlying primitive had empty fields and that said result got processed before the fields could be populated causing the panic. To verify this isn't the case, I looked at the plan for the query, which looks like -
{
  "OperatorType": "Concatenate",
  "Inputs": [
    {
      "OperatorType": "Route",
      "Variant": "IN",
      "Keyspace": {
        "Name": "ks_union",
        "Sharded": true
      },
      "FieldQuery": "select id1 from t1 where 1 != 1",
      "Query": "select id1 from t1 where id1 in ::__vals",
      "Table": "t1",
      "Values": [
        "(1, 2, 3, 4, 5, 6, 7, 8)"
      ],
      "Vindex": "hash"
    },
    {
      "OperatorType": "Route",
      "Variant": "IN",
      "Keyspace": {
        "Name": "ks_union",
        "Sharded": true
      },
      "FieldQuery": "select id1 from t1 where 1 != 1",
      "Query": "select id1 from t1 where id1 in ::__vals",
      "Table": "t1",
      "Values": [
        "(1, 2, 3, 4, 5, 6, 7, 8)"
      ],
      "Vindex": "hash"
    }
  ]
}

Also, in the stack trace its evident that the source of the Concatenate operator are just Routes. I looked at the code on the vttablets and they do send the fields in the first packet for streaming queries. Since gRPC promises to preserve the order of the results in a streaming call, I don't see how we could receive a result from either of the 2 routes with an empty result set before the fields has been processed and populated.

  1. I checked the implementation of for any possible issues with the usage of sync.Cond, but that also looks totally correct to me.

Other than trying to reproduce the problem with logs to see the result sets we are receiving, I don't see how to proceed.

@GuptaManan100
Copy link
Member Author

There is another concern with this panic. If we see the stack trace, we can see that the stack doesn't have any sort of panic handling in it. Since we are creating new go routines in multiGoTransactions, the panic handing that we have at a higher level at ComQuery etc, doesn't help with this panic. So this panic causes the vtgate to crash completely. That needs to be fixed too. That is however a relatively straight-forward fix.

@harshit-gangal
Copy link
Member

There is another concern with this panic. If we see the stack trace, we can see that the stack doesn't have any sort of panic handling in it. Since we are creating new go routines in multiGoTransactions, the panic handing that we have at a higher level at ComQuery etc, doesn't help with this panic. So this panic causes the vtgate to crash completely. That needs to be fixed too. That is however a relatively straight-forward fix.

This part should be fixed in scatter_conn, It is a known issue

@systay
Copy link
Collaborator

systay commented Mar 12, 2024

This part should be fixed in scatter_conn, It is a known issue

#15450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants