roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

cockroach-teamcity · 2019-02-07T05:55:07Z

SHA: https://github.com/cockroachdb/cockroach/commits/66f13c1d9a12c31e18a198da4ff5ac0bbe2db781

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1127849&tab=buildLog

The test failed on provisional_201902062125_v2.2.0-alpha.20190211:
	test.go:743,cluster.go:1585,scaledata.go:126,scaledata.go:53: unexpected node event: 5: dead

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2019-02-07T07:13:36Z

I190207 05:53:43.874729 92446 sql/distsql_running.go:149  [n5,client=10.138.0.23:35640,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=01a3514e key=/Table/53/1/424175199244877825/0 rw=true pri=0.05352502 stat=ABORTED epo=2 ts=1549518823.483367759,0 orig=1549518823.483367759,0 max=1549518823.981082001,0 wto=false seq=2
F190207 05:53:44.488019 93661 kv/txn_interceptor_heartbeat.go:447  txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=01a3514e key=/Table/53/1/424175199244877825/0 rw=true pri=0.05352502 stat=ABORTED epo=2 ts=1549518823.483367759,0 orig=1549518823.483367759,0 max=1549518823.981082001,0 wto=false seq=2
goroutine 93661 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0004ba000, 0xc0004ba060, 0x5188c00, 0x1f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x58f9dc0, 0xc000000004, 0x5188cff, 0x1f, 0x1bf, 0xc0006a4240, 0x114)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38cd5c0, 0xc00782c1e0, 0x7f5600000004, 0x2, 0x31d686f, 0x51, 0xc00879fad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38cd5c0, 0xc00782c1e0, 0x1, 0xc000000004, 0x31d686f, 0x51, 0xc007a09ad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38cd5c0, 0xc00782c1e0, 0x31d686f, 0x51, 0xc007a09ad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeat(0xc005ddd988, 0x38cd5c0, 0xc00782c1e0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:447 +0x150
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeatLoop(0xc005ddd988, 0x38cd5c0, 0xc00782c1e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:411 +0x1f7
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).startHeartbeatLoopLocked.func1(0x38cd5c0, 0xc00782c1e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:370 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001dfc20, 0x38cd5c0, 0xc00782c1e0, 0xc007551260, 0x29, 0x3902420, 0xc00031f760, 0xc00852fd20)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:323 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:318 +0x134

@andreimatei looks like #34341. Does the previous log line: client rejected when attempting to run DistSQL plan make what's going on any clearer?

Should we keep this issue open or fold it into #34341?

cockroach-teamcity · 2019-02-07T13:22:29Z

SHA: https://github.com/cockroachdb/cockroach/commits/4e2e450ecf5c3f73c7de46e48696528468803db7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1128035&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1585,scaledata.go:126,scaledata.go:53: unexpected node event: 2: dead

cockroach-teamcity · 2019-02-08T13:29:33Z

SHA: https://github.com/cockroachdb/cockroach/commits/70cd0458348655fb74f4e2cb89aa202b5dd89ed0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1129807&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1206: unexpected node event: 6: dead

cockroach-teamcity · 2019-02-09T14:22:57Z

SHA: https://github.com/cockroachdb/cockroach/commits/40e403544d60b1a44b8b1ed961a817c77d67aa31

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1131411&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1206: unexpected node event: 2: dead

tbg · 2019-02-12T14:15:17Z

Seems to be happening with some regularity now @andreimatei.

andreimatei · 2019-02-13T18:59:02Z

looking

andreimatei · 2019-02-13T20:25:35Z

I've been able to reproduce the unexpected non-pending txn in augmentMetaLocked fatal we saw in #33698.
It's possible that the recently-merged #34387 changed the failure mode from txn committed or aborted but heartbeat loop hasn't been signaled to stop to unexpected non-pending txn in augmentMetaLocked.

andreimatei · 2019-02-13T23:15:26Z

So here's what I do know:

The unexpected non-pending txn in augmentMetaLocked fatal message has been introduced in #34387. When a remote flow encounters a retriable error, it sends it sends it as metadata, and then it also sends the TxnCoordMeta to the root as trailing metadata. The check assumed that only the error is sent, which isn't true. However, I'm still not completely sure why the check fires; when the root processes the error from the leaf, it's supposed to create a new txn proto, and them it's supposed to discard further metadata (because it checks that the current txn id corresponds to the metadata's txn id).

I've gotten a repro before #34387 too. In that case, what we see is that a leaf sends a TransactionAbortedError(ABORT_REASON_ABORT_SPAN):

2397:I190213 22:28:15.742284 48142 sql/distsqlrun/processors.go:643  !!! MoveToDraining with err: TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=1 ts=1550096895.088786801,2 orig=1550096895.088786801,2 max=1550096895.091239421,0 wto=false seq=0
2398:I190213 22:28:15.742333 48142 sql/distsqlrun/base.go:267  [n3] !!! extracting meta from non-pending txn: "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=1 ts=1550096895.088786801,2 orig=1550096895.088786801,2 max=1550096895.588786801,0 wto=false seq=0

and seemingly afterwards the root prints this:

189:I190213 22:28:15.743316 218 sql/distsql_running.go:149  [n6,client=10.142.0.43:55902,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=2 ts=1550096895.090877187,0 orig=1550096895.090877187,0 max=1550096895.588786801,0 wto=false seq=2
190:F190213 22:28:16.106026 6459 kv/txn_interceptor_heartbeat.go:447  txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=2 ts=1550096895.090877187,0 orig=1550096895.090877187,0 max=1550096895.588786801,0 wto=false seq=2

I don't understand how the "client rejected" message happens after the leaf has already run a flow...

To be continued.

cockroach-teamcity · 2019-02-14T14:33:17Z

SHA: https://github.com/cockroachdb/cockroach/commits/acba091f04f3d8ecabf51009bf394951fbd3643c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1137872&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 3: dead

tbg · 2019-02-14T14:36:36Z

Like clockwork.

F190214 14:31:13.261904 154517 kv/txn_coord_sender.go:547  [n2,client=10.142.0.35:38250,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=6a5901da key=/Table/53/1/426258863112060931/0 rw=true pri=0.12223283 stat=ABORTED epo=1 ts=1550154672.746287552,1 orig=1550154672.746287552,1 max=1550154673.095495206,0 wto=false seq=0
goroutine 154517 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51b2600, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5921260, 0xc000000004, 0x51b26e5, 0x16, 0x223, 0xc006d666c0, 0x11e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38e9280, 0xc005894510, 0x4, 0x2, 0x31be09e, 0x33, 0xc0091a3960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38e9280, 0xc005894510, 0x1, 0x4, 0x31be09e, 0x33, 0xc0091a3960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38e9280, 0xc005894510, 0x31be09e, 0x33, 0xc0091a3960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc009b3bb00, 0x38e9280, 0xc005894510, 0xd04895b2da01596a, 0x9645202943dc15ad, 0xc0063452d0, 0xc, 0x10, 0x1, 0x1583417637f451c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:547 +0x1d3
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc009b3bb00, 0x38e9280, 0xc005894510, 0xd04895b2da01596a, 0x9645202943dc15ad, 0xc0063452d0, 0xc, 0x10, 0x1, 0x1583417637f451c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:540 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc007a50240, 0x38e9280, 0xc005894510, 0xd04895b2da01596a, 0x9645202943dc15ad, 0xc0063452d0, 0xc, 0x10, 0x1, 0x1583417637f451c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:868 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc00079b080, 0x0, 0x0, 0x0, 0xc00593af60, 0xc005368430)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:463 +0x493
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*accountClearingRowReceiver).Push(0xc005802420, 0x0, 0x0, 0x0, 0xc00593af60, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:266 +0x83
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc0059ebc00, 0x0, 0x0, 0x0, 0xc00593af60, 0xc000000000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x38e9280, 0xc005802180, 0x38f5280, 0xc005368000, 0x38da980, 0xc0059ebc00)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc005368000, 0x38e9280, 0xc005802180)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc006396e00, 0x38e9280, 0xc005802180, 0x329ac68, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:649 +0x209
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc000648dc0, 0xc00656c240, 0xc007a50240, 0xc0091a4990, 0xc00079b080, 0xc0075d2518, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8a0
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc000648dc0, 0x38e9280, 0xc005894510, 0xc0075d2518, 0xc00656c240, 0xc007a50240, 0x38eae80, 0xc004fd0e00, 0xc00079b080)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:795 +0x227
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc0075d2000, 0x38e9280, 0xc005894510, 0xc0075d2458, 0x3, 0x7fee015a3bd0, 0xc009946160, 0x1, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1044 +0x26e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc0075d2000, 0x38e9280, 0xc005894510, 0xc0075d2458, 0x7fee015a3bd0, 0xc009946160, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:900 +0x644
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc0075d2000, 0x38e9280, 0xc005894510, 0x38eecc0, 0xc000799ac0, 0xc0075d8e22, 0x5a, 0x1, 0xc008acc000, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:438 +0xe49
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc0075d2000, 0x38e9280, 0xc005894510, 0x38eecc0, 0xc000799ac0, 0xc0075d8e22, 0x5a, 0x1, 0xc008acc000, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:93 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc0075d2000, 0x38e91c0, 0xc0094c6280, 0xc0007e27b8, 0x5400, 0x15000, 0xc0007e2850, 0xc008b062f0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1241 +0x1439
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000173860, 0x38e91c0, 0xc0094c6280, 0xc0075d2000, 0x5400, 0x15000, 0xc0007e2850, 0xc008b062f0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000173860, 0xc00502db00, 0xc009008994, 0xc008b062f0, 0x38e91c0, 0xc0094c6280, 0xc0075d2000, 0x5400, 0x15000, 0xc0007e2850, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:331 +0xcf
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:314 +0x1040

andreimatei · 2019-02-14T20:48:14Z

I think I got the sucker. I believe the crash happens when there's a race between two retriable errors, out of which exactly one is a TxnAbortedError and the other error wins the race (in the sense that it's handled by the DistSQLReceiver first). If that happens, the DistSQLReceiver will not handle the second error (the TxnAbortedError), but it will still ingest the TxnCoordMeta with an aborted proto sent by the processor that got the TxnAbortedError. Once that happens, the TxnCoordSender is in an inconsistent state - with an ABORTED proto but a heartbeat loop still running.
Since #34387, we crash as soon as that TxnCoordMeta is ingested. Before, we used to crash only once the hb loop ran again (and I guess it was possible that the transaction could be rolled back before that, so perhaps we're more likely to crash than before... I'm not sure).

Now figuring out what to do about it.

cockroach-teamcity · 2019-02-19T15:12:50Z

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 5: dead

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and I believe that has fixed the majority of crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError should also be well possible) Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 (I believe all the issues above were really fixed by cockroachdb#35105 but this patch makes it more convincing) Release note: None

cockroach-teamcity · 2019-03-03T14:16:28Z

SHA: https://github.com/cockroachdb/cockroach/commits/d888b76df319571afe4d5816f1a1f0f53905653f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1159641&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1211: unexpected node event: 6: dead

andreimatei · 2019-03-03T21:45:10Z

Hmm I had expected the failures to go away after #35105. Will look at what happened.

andreimatei · 2019-03-04T16:31:15Z

I believe the last failure is #34241

tbg · 2019-03-05T09:53:29Z

@andreimatei I'm trying to repro #34241 but I'm immediately getting this:

F190305 09:49:03.407254 13841 kv/txn_coord_sender.go:577  [n1,client=10.142.0.177:55174,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=90facfb2 key=/Table/53/1/431583545172230147/0 rw=true pri=0.04260144 stat=ABORTED epo=1 ts=1551779342.984215841,0 orig=1551779333.979284010,2 max=1551779334.416902041,2 wto=false seq=0
goroutine 13841 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x537d500, 0x16)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b15080, 0xc000000004, 0x537d597, 0x16, 0x241, 0xc00dc03200, 0x11f)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a1f1a0, 0xc009f4c300, 0x4, 0x2, 0x32e3c04, 0x33, 0xc00fcf98a0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5

on top of this commit:

commit 047556f
Merge: 9659281 63448fb
Author: craig[bot] [email protected]
Date: Tue Mar 5 01:58:08 2019 +0000
Merge #35366

tbg · 2019-03-05T09:58:15Z

and again, with stack

github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc008922d80, 0x3a1f1a0, 0xc007124d80, 0x9449c5c798d2132c, 0x70b47e738bbc6cb1, 0xc007a36180, 0xc, 0x10, 0x1, 0x15890782fe2f25e0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:570 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc0068a6fc0, 0x3a1f1a0, 0xc007124d80, 0x9449c5c798d2132c, 0x70b47e738bbc6cb1, 0xc007a36180, 0xc, 0x10, 0x1, 0x15890782fe2f25e0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:854 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc000e60b00, 0x0, 0x0, 0x0, 0xc00769b260, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:478 +0x4c6
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc008575c00, 0x0, 0x0, 0x0, 0xc00769b260, 0xc000000000)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x3a1f1a0, 0xc0071256e0, 0x3a2b3a0, 0xc007bf2480, 0x3a106a0, 0xc008575c00)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc007bf2480, 0x3a1f1a0, 0xc0071256e0)

tbg · 2019-03-05T09:59:51Z

@petermattis should this be fetching Core dumps, because it isn't (likely due to using log.Fatal which calls os.Exit(2)). Should we dump the core in this case? I think we should.

tbg · 2019-03-05T13:45:46Z

I've probably repro'ed that 10 times out of 40 runs. @andreimatei could you prioritize this? Looks like the bug is still around (or I'm epic failing no my choice of binary, but I've tripled checked)

petermattis · 2019-03-05T14:11:51Z

Should we dump the core in this case? I think we should.

We'd have to replace the os.Exit() call with either a panic or sending ourselves SIGABRT. Seems reasonable to do if GOTRACEBACK=crash is set.

cockroach-teamcity · 2019-03-09T14:40:52Z

SHA: https://github.com/cockroachdb/cockroach/commits/38bb1e7905b89f911bd74be4f5830217ffb7b343

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168752&tab=buildLog

The test failed on master:
	cluster.go:1603,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 2: dead

cockroach-teamcity · 2019-03-12T13:08:16Z

SHA: https://github.com/cockroachdb/cockroach/commits/9d058d53c8a82fceb2205f1827c26f1bf36c32ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1172386&tab=buildLog

The test failed on master:
	cluster.go:1210,chaos.go:92,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1172386-scaledata-distributed-semaphore-nodes-6:2 returned:
		stderr:
		
		stdout:
		teamcity-1172386-scaledata-distributed-semaphore-nodes-6: stopping and waiting: signal: killed
	cluster.go:1603,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 5: dead

nvanbenschoten · 2019-03-12T17:39:19Z

Still failing with:

F190312 13:05:53.498427 15810 kv/txn_coord_sender.go:641  [n4,client=10.142.0.54:48654,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=ce3c9e7b key=/Table/53/1/433602889857662979/0 rw=true pri=0.01216643 stat=ABORTED epo=1 ts=1552395953.208717498,3 orig=1552395953.208717498,3 max=1552395953.708054880,0 wto=false seq=0
goroutine 15810 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b01, 0xc000057b60, 0x53a1c00, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b3a9e0, 0xc000000004, 0x53a1c0f, 0x16, 0x281, 0xc00840cea0, 0x11e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:876 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a3a180, 0xc009e234d0, 0x4, 0x2, 0x32ef6b5, 0x33, 0xc007ddd870, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a3a180, 0xc009e234d0, 0x1, 0x4, 0x32ef6b5, 0x33, 0xc007ddd870, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3a3a180, 0xc009e234d0, 0x32ef6b5, 0x33, 0xc007ddd870, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc008c02900, 0x3a3a180, 0xc009e234d0, 0xed46e1367b9e3cce, 0xa13e11e30de99297, 0xc00705f950, 0xc, 0x10, 0x1, 0x158b37e4fdc270ba, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:641 +0x1d3
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc008c02900, 0x3a3a180, 0xc009e234d0, 0xed46e1367b9e3cce, 0xa13e11e30de99297, 0xc00705f950, 0xc, 0x10, 0x1, 0x158b37e4fdc270ba, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:634 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc005b89dd0, 0x3a3a180, 0xc009e234d0, 0xed46e1367b9e3cce, 0xa13e11e30de99297, 0xc00705f950, 0xc, 0x10, 0x1, 0x158b37e4fdc270ba, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:855 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc000eaa840, 0x0, 0x0, 0x0, 0xc005123080, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:478 +0x4c6
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc008763c00, 0x0, 0x0, 0x0, 0xc005123080, 0xc000000000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x3a3a180, 0xc007f6a690, 0x3a46380, 0xc007ecd680, 0x3a2b600, 0xc008763c00)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc007ecd680, 0x3a3a180, 0xc007f6a690)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc009134e00, 0x3a3a180, 0xc007f6a690, 0x33d08d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:624 +0x1ec
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc00004f180, 0xc0059259e0, 0xc005b89dd0, 0xc007dde898, 0xc000eaa840, 0xc000a1cc90, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8ad
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc00004f180, 0x3a3a180, 0xc009e234d0, 0xc000a1cc90, 0xc0059259e0, 0xc005b89dd0, 0x3a3be40, 0xc008dfe1a0, 0xc000eaa840)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:839 +0x227
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0xc000a1cbd0, 0x3, 0x7f8b821757f0, 0xc005657ce0, 0x1, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1100 +0x283
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0xc000a1cbd0, 0x7f8b821757f0, 0xc005657ce0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:945 +0x658
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0x3a3fcc0, 0xc007ed1e00, 0xc009307410, 0x5a, 0x1, 0xc009430ea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:457 +0xdd4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0x3a3fcc0, 0xc007ed1e00, 0xc009307410, 0x5a, 0x1, 0xc009430ea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:103 +0x610
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc000a1c880, 0x3a3a0c0, 0xc005b7ce80, 0xc000100338, 0x5400, 0x15000, 0xc0001003d0, 0xc0050b15d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x140b
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000ab7ba0, 0x3a3a0c0, 0xc005b7ce80, 0xc000a1c880, 0x5400, 0x15000, 0xc0001003d0, 0xc0050b15d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000ab7ba0, 0xc00607a000, 0xc006ac57c4, 0xc0050b15d0, 0x3a3a0c0, 0xc005b7ce80, 0xc000a1c880, 0x5400, 0x15000, 0xc0001003d0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:332 +0xcf
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:315 +0x1040

cockroach-teamcity · 2019-03-18T20:01:32Z

SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1183678&tab=buildLog

The test failed on provisional_201903122203_v19.1.0-beta.20190318:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 2: dead

nvanbenschoten · 2019-03-18T20:37:16Z

^^^ Same as above.

cockroach-teamcity · 2019-03-20T12:23:04Z

SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 5: dead

tbg · 2019-03-20T12:25:12Z

Same as the nodes=3 flavor:

E190320 12:20:53.981796 334 util/log/crash_reporting.go:216  [n5,client=10.142.0.184:45916,user=root] a panic has occurred!
panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: panic while executing 1 statements: SELECT * FROM _ WHERE (_._ IS _) AND (((_ = $1) AND (_ = $2))); caused by non-values node passed as fixed value to zigzag join

goroutine 357 [running]:
panic(0x3103ee0, 0xc00560fa70)
	/usr/local/go/src/runtime/panic.go:556 +0x2cb fp=0xc005609408 sp=0xc005609378 pc=0x72b49b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc000cb8d80, 0x39fc4a0, 0xc0052ca900, 0x2d97f80, 0x39b5110)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:714 +0x36d fp=0xc0056094c8 sp=0xc005609408 pc=0x1fadaad
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc000cb8d80, 0x39fc4a0, 0xc0052ca900)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:426 +0x61 fp=0xc005609500 sp=0xc0056094c8 pc=0x213aa11
runtime.call32(0x0, 0x33f85f0, 0xc0054eff70, 0x1800000018)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc005609530 sp=0xc005609500 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0056095c0 sp=0xc005609530 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build.func1(0xc00560a9a8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:66 +0x9b fp=0xc0056095e8 sp=0xc0056095c0 pc=0x1bcb19b
runtime.call32(0x0, 0x33f9038, 0xc00555f770, 0x800000008)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc005609618 sp=0xc0056095e8 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0056096a8 sp=0xc005609618 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql.(*execFactory).ConstructZigzagJoin(0xc000a6ea48, 0x3a3c160, 0xc000ab5e10, 0x3a2aa80, 0xc005614568, 0x3a3c160, 0xc000ab5e10, 0x3a2aa80, 0xc0056145d0, 0xc000385d78, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt_exec_factory.go:736 +0x8ea fp=0xc005609758 sp=0xc0056096a8 pc=0x206d37a
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildZigzagJoin(0xc005612ae0, 0xc0055e66f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc005609ba0, 0x1aef637)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1235 +0xe46 fp=0xc005609b10 sp=0xc005609758 pc=0x1bc0626
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc005612ae0, 0x3a36200, 0xc0055e66f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:190 +0x732 fp=0xc00560a058 sp=0xc005609b10 pc=0x1bb5022
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildLookupJoin(0xc005612ae0, 0xc005523800, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00560a3c0, 0xc00560a3c8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1128 +0x97 fp=0xc00560a328 sp=0xc00560a058 pc=0x1bbedf7
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc005612ae0, 0x3a353a0, 0xc005523800, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00555f740, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:187 +0xe31 fp=0xc00560a870 sp=0xc00560a328 pc=0x1bb5721
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).build(0xc005612ae0, 0x3a0a4e0, 0xc005523800, 0x0, 0xc00560a980, 0x70b8a8, 0x60)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:83 +0xab fp=0xc00560a930 sp=0xc00560a870 pc=0x1bb36fb
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build(0xc005612ae0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:71 +0x78 fp=0xc00560a990 sp=0xc00560a930 pc=0x1bb3578
github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan(0xc000cb90d0, 0x39fc560, 0xc00560f260, 0x32b8e38, 0x19, 0xc74a0180a, 0x3a798abd)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:162 +0x18b fp=0xc00560aa18 sp=0xc00560a990 pc=0x209988b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0xc000cb90d0, 0x6, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:981 +0x131 fp=0xc00560aad8 sp=0xc00560aa18 pc=0x1fc00e1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0xc000cb90d0, 0x7f799378e5c8, 0xc0007bb6b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:868 +0x148 fp=0xc00560abd8 sp=0xc00560aad8 pc=0x1fbf3c8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0x3a020a0, 0xc0052d8580, 0xc00559e023, 0x6e, 0x2, 0xc00544cea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:459 +0xdf6 fp=0xc00560b3b8 sp=0xc00560abd8 pc=0x1fbb976
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0x3a020a0, 0xc0052d8580, 0xc00559e023, 0x6e, 0x2, 0xc00544cea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:101 +0x610 fp=0xc00560b660 sp=0xc00560b3b8 pc=0x1fba700
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc000cb8d80, 0x39fc4a0, 0xc0052ca900, 0xc000100e78, 0x5400, 0x15000, 0xc000100f10, 0xc0003d7310, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x1418 fp=0xc00560bec8 sp=0xc00560b660 pc=0x1fb04d8
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000772000, 0x39fc4a0, 0xc0052ca900, 0xc000cb8d80, 0x5400, 0x15000, 0xc000100f10, 0xc0003d7310, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:428 +0xce fp=0xc00560bf28 sp=0xc00560bec8 pc=0x1fabe2e
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000772000, 0x39fc4a0, 0xc0052ca900, 0xc00010f600, 0xc00049a8b4, 0xc0003d7310, 0xc000cb8d80, 0x5400, 0x15000, 0xc000100f10, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:338 +0xe6 fp=0xc00560bf88 sp=0xc00560bf28 pc=0x22323b6
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00560bf90 sp=0xc00560bf88 pc=0x75bd81
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:321 +0x1080

@justinj is already on the case.

tbg · 2019-03-20T14:17:37Z

Hmm, the augmentMetaLocked error doesn't repro as readily as I'd like. Took around 8 days = 8 runs between the last two iterations, we'll see when it shows up for me.

This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695. Release note: None

tbg · 2019-03-20T17:07:43Z

Got myself a repro:

I190320 14:43:49.062897 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 timestamp:<wall_time:1553093029060073742 > priority:215514 > name:"sql txn" last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029060073742 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:1 timestamp:<wall_time:1553093029062268602 > > observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > epoch_zero_timestamp:<> > command_count:1 refresh_reads:<key:"\275\211\375\006\014\224\370S\272\000\006\210" end_key:"\275\212" >  RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.062964 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 timestamp:<wall_time:1553093029060073742 > priority:215514 > name:"sql txn" last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029060073742 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > observed_timestamps:<node_id:6 timestamp:<wall_time:1553093029062276140 > > epoch_zero_timestamp:<> > command_count:1 refresh_reads:<key:"\275\211\375\006\014\224\370P$\000\005" end_key:"\275\211\375\006\014\224\370S\272\000\006\210" >  RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.080441 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 timestamp:<wall_time:1553093029060073742 > priority:215514 > name:"sql txn" last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029060073742 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:3 timestamp:<wall_time:1553093029062349585 > > observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > epoch_zero_timestamp:<> > command_count:1 refresh_reads:<key:"\275\211" end_key:"\275\211\375\006\014\224\370P$\000\005" >  RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.122788 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:sql/distsqlrun/inbound.go:99 in func2(): (08006) communication error: rpc error: code = Canceled desc = context canceled TraceData:[] TxnCoordMeta:<nil> RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.355369 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" id=f0b6ff59 key=/Table/53/1/435887058280611844/0 rw=true pri=0.04248642 stat=ABORTED epo=1 ts=1553093029.081121839,1 orig=1553093029.081121839,1 max=1553093029.062268602,0 wto=false seq=0 TraceData:[] TxnCoordMeta:<nil> RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.355384 30840 sql/distsql_running.go:512  TSX discarding TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" id=f0b6ff59 key=/Table/53/1/435887058280611844/0 rw=true pri=0.04248642 stat=ABORTED epo=1 ts=1553093029.081121839,1 orig=1553093029.081121839,1 max=1553093029.062268602,0 wto=false seq=0
I190320 14:43:49.355487 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 key:"\275\211\375\006\014\224\370W\001\200\004\210" epoch:1 timestamp:<wall_time:1553093029081121839 logical:1 > priority:912389 > name:"sql txn" status:ABORTED last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029081121839 logical:1 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:1 timestamp:<wall_time:1553093029062268602 > > observed_timestamps:<node_id:3 timestamp:<wall_time:1553093029062349585 > > observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > observed_timestamps:<node_id:6 timestamp:<wall_time:1553093029062276140 > > deprecated_writing:true epoch_zero_timestamp:<wall_time:1553093029060073742 > > command_count:1  RowNum:<nil> SamplerProgress:<nil>}
F190320 14:43:49.355566 30840 kv/txn_coord_sender.go:641  [n4,client=10.142.0.211:40984,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=f0b6ff59 key=/Table/53/1/435887058280611844/0 rw=true pri=0.04248642 stat=ABORTED epo=1 ts=1553093029.081121839,1 orig=1553093029.081121839,1 max=1553093029.560073742,0 wto=false seq=0

This seems to explain things, though not fully to me. Things happen in this block:

cockroach/pkg/sql/distsql_running.go

Lines 474 to 506 in 1352687

    
           if meta != nil { 
        
           	if meta.TxnCoordMeta != nil { 
        
           		if r.txn != nil { 
        
           			if r.txn.ID() == meta.TxnCoordMeta.Txn.ID { 
        
           				r.txn.AugmentTxnCoordMeta(r.ctx, *meta.TxnCoordMeta) 
        
           			} 
        
           		} else { 
        
           			r.resultWriter.SetError( 
        
           				errors.Errorf("received a leaf TxnCoordMeta (%s); but have no root", meta.TxnCoordMeta)) 
        
           		} 
        
           	} 
        
           	if meta.Err != nil { 
        
           		// Check if the error we just received should take precedence over a 
        
           		// previous error (if any). 
        
           		if errPriority(meta.Err) > errPriority(r.resultWriter.Err()) { 
        
           			if r.txn != nil { 
        
           				if retryErr, ok := meta.Err.(*roachpb.UnhandledRetryableError); ok { 
        
           					// Update the txn in response to remote errors. In the non-DistSQL 
        
           					// world, the TxnCoordSender handles "unhandled" retryable errors, 
        
           					// but this one is coming from a distributed SQL node, which has 
        
           					// left the handling up to the root transaction. 
        
           					meta.Err = r.txn.UpdateStateOnRemoteRetryableErr(r.ctx, &retryErr.PErr) 
        
           					// Update the clock with information from the error. On non-DistSQL 
        
           					// code paths, the DistSender does this. 
        
           					// TODO(andrei): We don't propagate clock signals on success cases 
        
           					// through DistSQL; we should. We also don't propagate them through 
        
           					// non-retryable errors; we also should. 
        
           					r.updateClock(retryErr.PErr.Now) 
        
           				} 
        
           			} 
        
           			r.resultWriter.SetError(meta.Err) 
        
           		} 
        
           	}

First, we ingest an RPC error without a TxnCoordMeta.
Second, we observe a TransactionAbortedError without a TxnMeta and the meta is fed to the TxnCoordSender. However, we discard the error because we already saw an error before because that previous error probably got a scoreNonRetriable which beats all else (thanks chaos testing). I guess if we didn't drop the error, we'd feed it into UpdateStateOnRemoteRetryableErr, and would get a new txn.ID() preventing the following:
we get an update with no error but the aborted TxnMeta, and since the txn ids match it's fed all the way through to the assertion and panics.

@andreimatei, WDYT?

andreimatei · 2019-03-20T17:32:41Z

Great, that's exactly what #35249 theorized. And now we know what such a rogue error racing with the TxnAbortedError could be - an RPC error (btw, how did that happen? :P)
I'll write a test in #35249 and merge it. Thanks bro, respect.

Btw, a lot of this crap about ingesting errors is pretty dirty. I'm also attempting to improve things in #35224.

This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695. Release note: None

35655: exec: add multi column equality to vectorized mergejoiner r=georgeutsin a=georgeutsin Added the ability to join on multiple columns in the vectorized merge joiner. This was done by inverting the previous method of generating groups; instead of assuming no matches and discovering matches group by group, if we assume that everything is a match and filter out the rows that don't match in the join, we can use multiple columns in this approach. This PR also includes some work on output shaping, to ensure that we still have decently shaped outputs, even when a batch ends on a small run. Also fixed the benchmarks in this PR. Updated benchmarks for the mergejoiner: ``` BenchmarkMergeJoiner/rows=1024-8 20000 67231 ns/op 974.77 MB/s BenchmarkMergeJoiner/rows=4096-8 5000 257487 ns/op 1018.08 MB/s BenchmarkMergeJoiner/rows=16384-8 2000 995437 ns/op 1053.38 MB/s BenchmarkMergeJoiner/rows=1048576-8 20 62822638 ns/op 1068.23 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=1024-8 20000 67343 ns/op 973.16 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=4096-8 5000 256988 ns/op 1020.06 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=16384-8 2000 997690 ns/op 1051.00 MB/s BenchmarkMergeJoiner/oneSideRepeat-rows=1048576-8 20 62184741 ns/op 1079.19 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=1024-8 20000 68672 ns/op 954.32 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=4096-8 1000 1336630 ns/op 196.12 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=16384-8 100 18478767 ns/op 56.74 MB/s BenchmarkMergeJoiner/bothSidesRepeat-rows=32768-8 20 80351272 ns/op 26.10 MB/s ``` Note that the performance degradation in the 'bothSidesRepeat' is actually due to incorrect benchmarking in the first place. Regardless, this is still an area of investigation. Release note: None 35989: opt: fix zigzag joins with IS condition r=justinj a=justinj This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock #34695. Release note: None 35994: sqlsmith: add x IN (SELECT ...) r=justinj a=justinj I added this in an attempt to repro a specific bug. I was unsuccessful, but figured I should PR this since I already did the work. This is technically just another instance of a comparison op, it could probably be refactored to be lumped in with those once there's more structure around how those are generated. Release note: None 35995: sql: fix panic when searching for equivalent renders r=jordanlewis a=jordanlewis Previously, certain kinds of window functions could generate a panic when trying to reuse equivalent renders when those renders were tuples. This is fixed by comparing the types using the Equivalent method instead of ==. Release note (bug fix): fix panics caused by certain window functions that operate on tuples Co-authored-by: George Utsin <[email protected]> Co-authored-by: Justin Jaffray <[email protected]> Co-authored-by: Jordan Lewis <[email protected]>

cockroach-teamcity · 2019-03-21T13:03:30Z

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 4: dead

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <[email protected]>

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

This commit fixes a problem where the constraint logic could infer that a column was constant, but our method of determining what that constant value was was incomplete and so we ended up with incomplete information. We now correctly infer the value for IS expressions, and also skip over cases where this happens again so we degrade more gracefully (don't run a zigzag join, as opposed to panicking). A more correct fix here would extract the values directly from the constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695. Release note: None

cockroach-teamcity added this to the 2.2 milestone Feb 7, 2019

cockroach-teamcity assigned nvanbenschoten Feb 7, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Feb 7, 2019

nvanbenschoten assigned andreimatei and unassigned nvanbenschoten Feb 7, 2019

andreimatei mentioned this issue Feb 13, 2019

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

Closed

tbg mentioned this issue Feb 27, 2019

*log.safeError: txn_interceptor_heartbeat.go:429: txn status: %s, but heartbeat loop hasn't been signaled to stop... #34341

Closed

andreimatei mentioned this issue Feb 27, 2019

kv: don't ingest aborted TxnCoordMeta #35249

Merged

andreimatei mentioned this issue Mar 4, 2019

Panic runtime error: index out of range when running 6 node tpc-c without range merging #34241

Closed

andreimatei mentioned this issue Mar 8, 2019

19.1 release blockers list #35554

Closed

23 tasks

justinj mentioned this issue Mar 20, 2019

opt: fix zigzag joins with IS condition #35989

Merged

craig bot closed this as completed in #35249 Mar 21, 2019

andreimatei mentioned this issue Mar 21, 2019

release-19.1: kv: don't ingest aborted TxnCoordMeta #36041

Merged

justinj mentioned this issue Mar 22, 2019

backport-19.1: opt: fix zigzag joins with IS condition #36057

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

cockroach-teamcity commented Feb 7, 2019

nvanbenschoten commented Feb 7, 2019

cockroach-teamcity commented Feb 7, 2019

cockroach-teamcity commented Feb 8, 2019

cockroach-teamcity commented Feb 9, 2019

tbg commented Feb 12, 2019

andreimatei commented Feb 13, 2019

andreimatei commented Feb 13, 2019

andreimatei commented Feb 13, 2019

cockroach-teamcity commented Feb 14, 2019

tbg commented Feb 14, 2019

andreimatei commented Feb 14, 2019

cockroach-teamcity commented Feb 19, 2019

cockroach-teamcity commented Mar 3, 2019

andreimatei commented Mar 3, 2019

andreimatei commented Mar 4, 2019

tbg commented Mar 5, 2019

tbg commented Mar 5, 2019

tbg commented Mar 5, 2019

tbg commented Mar 5, 2019

petermattis commented Mar 5, 2019

cockroach-teamcity commented Mar 9, 2019

cockroach-teamcity commented Mar 12, 2019

nvanbenschoten commented Mar 12, 2019

cockroach-teamcity commented Mar 18, 2019

nvanbenschoten commented Mar 18, 2019

cockroach-teamcity commented Mar 20, 2019

tbg commented Mar 20, 2019

tbg commented Mar 20, 2019

tbg commented Mar 20, 2019

andreimatei commented Mar 20, 2019

cockroach-teamcity commented Mar 21, 2019

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

Comments

cockroach-teamcity commented Feb 7, 2019

nvanbenschoten commented Feb 7, 2019

cockroach-teamcity commented Feb 7, 2019

cockroach-teamcity commented Feb 8, 2019

cockroach-teamcity commented Feb 9, 2019

tbg commented Feb 12, 2019

andreimatei commented Feb 13, 2019

andreimatei commented Feb 13, 2019

andreimatei commented Feb 13, 2019

cockroach-teamcity commented Feb 14, 2019

tbg commented Feb 14, 2019

andreimatei commented Feb 14, 2019

cockroach-teamcity commented Feb 19, 2019

cockroach-teamcity commented Mar 3, 2019

andreimatei commented Mar 3, 2019

andreimatei commented Mar 4, 2019

tbg commented Mar 5, 2019

tbg commented Mar 5, 2019

tbg commented Mar 5, 2019

tbg commented Mar 5, 2019

petermattis commented Mar 5, 2019

cockroach-teamcity commented Mar 9, 2019

cockroach-teamcity commented Mar 12, 2019

nvanbenschoten commented Mar 12, 2019

cockroach-teamcity commented Mar 18, 2019

nvanbenschoten commented Mar 18, 2019

cockroach-teamcity commented Mar 20, 2019

tbg commented Mar 20, 2019

tbg commented Mar 20, 2019

tbg commented Mar 20, 2019

andreimatei commented Mar 20, 2019

cockroach-teamcity commented Mar 21, 2019