Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

Closed
cockroach-teamcity opened this issue Feb 7, 2019 · 31 comments · Fixed by #35249
Closed

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

cockroach-teamcity opened this issue Feb 7, 2019 · 31 comments · Fixed by #35249
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/66f13c1d9a12c31e18a198da4ff5ac0bbe2db781

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1127849&tab=buildLog

The test failed on provisional_201902062125_v2.2.0-alpha.20190211:
	test.go:743,cluster.go:1585,scaledata.go:126,scaledata.go:53: unexpected node event: 5: dead

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Feb 7, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. labels Feb 7, 2019
@nvanbenschoten
Copy link
Member

I190207 05:53:43.874729 92446 sql/distsql_running.go:149  [n5,client=10.138.0.23:35640,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=01a3514e key=/Table/53/1/424175199244877825/0 rw=true pri=0.05352502 stat=ABORTED epo=2 ts=1549518823.483367759,0 orig=1549518823.483367759,0 max=1549518823.981082001,0 wto=false seq=2
F190207 05:53:44.488019 93661 kv/txn_interceptor_heartbeat.go:447  txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=01a3514e key=/Table/53/1/424175199244877825/0 rw=true pri=0.05352502 stat=ABORTED epo=2 ts=1549518823.483367759,0 orig=1549518823.483367759,0 max=1549518823.981082001,0 wto=false seq=2
goroutine 93661 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc0004ba000, 0xc0004ba060, 0x5188c00, 0x1f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x58f9dc0, 0xc000000004, 0x5188cff, 0x1f, 0x1bf, 0xc0006a4240, 0x114)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38cd5c0, 0xc00782c1e0, 0x7f5600000004, 0x2, 0x31d686f, 0x51, 0xc00879fad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38cd5c0, 0xc00782c1e0, 0x1, 0xc000000004, 0x31d686f, 0x51, 0xc007a09ad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38cd5c0, 0xc00782c1e0, 0x31d686f, 0x51, 0xc007a09ad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeat(0xc005ddd988, 0x38cd5c0, 0xc00782c1e0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:447 +0x150
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeatLoop(0xc005ddd988, 0x38cd5c0, 0xc00782c1e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:411 +0x1f7
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).startHeartbeatLoopLocked.func1(0x38cd5c0, 0xc00782c1e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:370 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001dfc20, 0x38cd5c0, 0xc00782c1e0, 0xc007551260, 0x29, 0x3902420, 0xc00031f760, 0xc00852fd20)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:323 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:318 +0x134

@andreimatei looks like #34341. Does the previous log line: client rejected when attempting to run DistSQL plan make what's going on any clearer?

Should we keep this issue open or fold it into #34341?

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4e2e450ecf5c3f73c7de46e48696528468803db7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1128035&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1585,scaledata.go:126,scaledata.go:53: unexpected node event: 2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/70cd0458348655fb74f4e2cb89aa202b5dd89ed0

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1129807&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1206: unexpected node event: 6: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/40e403544d60b1a44b8b1ed961a817c77d67aa31

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1131411&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1206: unexpected node event: 2: dead

@tbg
Copy link
Member

tbg commented Feb 12, 2019

Seems to be happening with some regularity now @andreimatei.

@andreimatei
Copy link
Contributor

looking

@andreimatei
Copy link
Contributor

I've been able to reproduce the unexpected non-pending txn in augmentMetaLocked fatal we saw in #33698.
It's possible that the recently-merged #34387 changed the failure mode from txn committed or aborted but heartbeat loop hasn't been signaled to stop to unexpected non-pending txn in augmentMetaLocked.

@andreimatei
Copy link
Contributor

So here's what I do know:

The unexpected non-pending txn in augmentMetaLocked fatal message has been introduced in #34387. When a remote flow encounters a retriable error, it sends it sends it as metadata, and then it also sends the TxnCoordMeta to the root as trailing metadata. The check assumed that only the error is sent, which isn't true. However, I'm still not completely sure why the check fires; when the root processes the error from the leaf, it's supposed to create a new txn proto, and them it's supposed to discard further metadata (because it checks that the current txn id corresponds to the metadata's txn id).

I've gotten a repro before #34387 too. In that case, what we see is that a leaf sends a TransactionAbortedError(ABORT_REASON_ABORT_SPAN):

2397:I190213 22:28:15.742284 48142 sql/distsqlrun/processors.go:643  !!! MoveToDraining with err: TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=1 ts=1550096895.088786801,2 orig=1550096895.088786801,2 max=1550096895.091239421,0 wto=false seq=0
2398:I190213 22:28:15.742333 48142 sql/distsqlrun/base.go:267  [n3] !!! extracting meta from non-pending txn: "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=1 ts=1550096895.088786801,2 orig=1550096895.088786801,2 max=1550096895.588786801,0 wto=false seq=0

and seemingly afterwards the root prints this:

189:I190213 22:28:15.743316 218 sql/distsql_running.go:149  [n6,client=10.142.0.43:55902,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=2 ts=1550096895.090877187,0 orig=1550096895.090877187,0 max=1550096895.588786801,0 wto=false seq=2
190:F190213 22:28:16.106026 6459 kv/txn_interceptor_heartbeat.go:447  txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=24ae2a9b key=/Table/53/1/426070369879851014/0 rw=true pri=0.01659421 stat=ABORTED epo=2 ts=1550096895.090877187,0 orig=1550096895.090877187,0 max=1550096895.588786801,0 wto=false seq=2

I don't understand how the "client rejected" message happens after the leaf has already run a flow...

To be continued.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/acba091f04f3d8ecabf51009bf394951fbd3643c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1137872&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 3: dead

@tbg
Copy link
Member

tbg commented Feb 14, 2019

Like clockwork.

F190214 14:31:13.261904 154517 kv/txn_coord_sender.go:547  [n2,client=10.142.0.35:38250,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=6a5901da key=/Table/53/1/426258863112060931/0 rw=true pri=0.12223283 stat=ABORTED epo=1 ts=1550154672.746287552,1 orig=1550154672.746287552,1 max=1550154673.095495206,0 wto=false seq=0
goroutine 154517 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51b2600, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5921260, 0xc000000004, 0x51b26e5, 0x16, 0x223, 0xc006d666c0, 0x11e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38e9280, 0xc005894510, 0x4, 0x2, 0x31be09e, 0x33, 0xc0091a3960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38e9280, 0xc005894510, 0x1, 0x4, 0x31be09e, 0x33, 0xc0091a3960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38e9280, 0xc005894510, 0x31be09e, 0x33, 0xc0091a3960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc009b3bb00, 0x38e9280, 0xc005894510, 0xd04895b2da01596a, 0x9645202943dc15ad, 0xc0063452d0, 0xc, 0x10, 0x1, 0x1583417637f451c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:547 +0x1d3
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc009b3bb00, 0x38e9280, 0xc005894510, 0xd04895b2da01596a, 0x9645202943dc15ad, 0xc0063452d0, 0xc, 0x10, 0x1, 0x1583417637f451c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:540 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc007a50240, 0x38e9280, 0xc005894510, 0xd04895b2da01596a, 0x9645202943dc15ad, 0xc0063452d0, 0xc, 0x10, 0x1, 0x1583417637f451c0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:868 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc00079b080, 0x0, 0x0, 0x0, 0xc00593af60, 0xc005368430)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:463 +0x493
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*accountClearingRowReceiver).Push(0xc005802420, 0x0, 0x0, 0x0, 0xc00593af60, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:266 +0x83
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc0059ebc00, 0x0, 0x0, 0x0, 0xc00593af60, 0xc000000000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x38e9280, 0xc005802180, 0x38f5280, 0xc005368000, 0x38da980, 0xc0059ebc00)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc005368000, 0x38e9280, 0xc005802180)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc006396e00, 0x38e9280, 0xc005802180, 0x329ac68, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:649 +0x209
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc000648dc0, 0xc00656c240, 0xc007a50240, 0xc0091a4990, 0xc00079b080, 0xc0075d2518, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8a0
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc000648dc0, 0x38e9280, 0xc005894510, 0xc0075d2518, 0xc00656c240, 0xc007a50240, 0x38eae80, 0xc004fd0e00, 0xc00079b080)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:795 +0x227
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc0075d2000, 0x38e9280, 0xc005894510, 0xc0075d2458, 0x3, 0x7fee015a3bd0, 0xc009946160, 0x1, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1044 +0x26e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc0075d2000, 0x38e9280, 0xc005894510, 0xc0075d2458, 0x7fee015a3bd0, 0xc009946160, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:900 +0x644
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc0075d2000, 0x38e9280, 0xc005894510, 0x38eecc0, 0xc000799ac0, 0xc0075d8e22, 0x5a, 0x1, 0xc008acc000, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:438 +0xe49
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc0075d2000, 0x38e9280, 0xc005894510, 0x38eecc0, 0xc000799ac0, 0xc0075d8e22, 0x5a, 0x1, 0xc008acc000, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:93 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc0075d2000, 0x38e91c0, 0xc0094c6280, 0xc0007e27b8, 0x5400, 0x15000, 0xc0007e2850, 0xc008b062f0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1241 +0x1439
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000173860, 0x38e91c0, 0xc0094c6280, 0xc0075d2000, 0x5400, 0x15000, 0xc0007e2850, 0xc008b062f0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000173860, 0xc00502db00, 0xc009008994, 0xc008b062f0, 0x38e91c0, 0xc0094c6280, 0xc0075d2000, 0x5400, 0x15000, 0xc0007e2850, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:331 +0xcf
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:314 +0x1040

@andreimatei
Copy link
Contributor

I think I got the sucker. I believe the crash happens when there's a race between two retriable errors, out of which exactly one is a TxnAbortedError and the other error wins the race (in the sense that it's handled by the DistSQLReceiver first). If that happens, the DistSQLReceiver will not handle the second error (the TxnAbortedError), but it will still ingest the TxnCoordMeta with an aborted proto sent by the processor that got the TxnAbortedError. Once that happens, the TxnCoordSender is in an inconsistent state - with an ABORTED proto but a heartbeat loop still running.
Since #34387, we crash as soon as that TxnCoordMeta is ingested. Before, we used to crash only once the hb loop ran again (and I guess it was possible that the transaction could be rolled back before that, so perhaps we're more likely to crash than before... I'm not sure).

Now figuring out what to do about it.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 5: dead

andreimatei added a commit to andreimatei/cockroach that referenced this issue Feb 27, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and I
believe that has fixed the majority of crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError should also be well
possible)

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

(I believe all the issues above were really fixed by cockroachdb#35105 but this
patch makes it more convincing)

Release note: None
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d888b76df319571afe4d5816f1a1f0f53905653f

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1159641&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1211: unexpected node event: 6: dead

@andreimatei
Copy link
Contributor

Hmm I had expected the failures to go away after #35105. Will look at what happened.

@andreimatei
Copy link
Contributor

I believe the last failure is #34241

@tbg
Copy link
Member

tbg commented Mar 5, 2019

@andreimatei I'm trying to repro #34241 but I'm immediately getting this:

F190305 09:49:03.407254 13841 kv/txn_coord_sender.go:577  [n1,client=10.142.0.177:55174,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=90facfb2 key=/Table/53/1/431583545172230147/0 rw=true pri=0.04260144 stat=ABORTED epo=1 ts=1551779342.984215841,0 orig=1551779333.979284010,2 max=1551779334.416902041,2 wto=false seq=0
goroutine 13841 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x537d500, 0x16)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b15080, 0xc000000004, 0x537d597, 0x16, 0x241, 0xc00dc03200, 0x11f)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a1f1a0, 0xc009f4c300, 0x4, 0x2, 0x32e3c04, 0x33, 0xc00fcf98a0, 0x1, 0x1)
        /go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5

on top of this commit:

commit 047556f
Merge: 9659281 63448fb
Author: craig[bot] [email protected]
Date: Tue Mar 5 01:58:08 2019 +0000
Merge #35366

@tbg
Copy link
Member

tbg commented Mar 5, 2019

and again, with stack

github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc008922d80, 0x3a1f1a0, 0xc007124d80, 0x9449c5c798d2132c, 0x70b47e738bbc6cb1, 0xc007a36180, 0xc, 0x10, 0x1, 0x15890782fe2f25e0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:570 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc0068a6fc0, 0x3a1f1a0, 0xc007124d80, 0x9449c5c798d2132c, 0x70b47e738bbc6cb1, 0xc007a36180, 0xc, 0x10, 0x1, 0x15890782fe2f25e0, ...)
        /go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:854 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc000e60b00, 0x0, 0x0, 0x0, 0xc00769b260, 0x0)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:478 +0x4c6
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc008575c00, 0x0, 0x0, 0x0, 0xc00769b260, 0xc000000000)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x3a1f1a0, 0xc0071256e0, 0x3a2b3a0, 0xc007bf2480, 0x3a106a0, 0xc008575c00)
        /go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc007bf2480, 0x3a1f1a0, 0xc0071256e0)

@tbg
Copy link
Member

tbg commented Mar 5, 2019

@petermattis should this be fetching Core dumps, because it isn't (likely due to using log.Fatal which calls os.Exit(2)). Should we dump the core in this case? I think we should.

@tbg
Copy link
Member

tbg commented Mar 5, 2019

I've probably repro'ed that 10 times out of 40 runs. @andreimatei could you prioritize this? Looks like the bug is still around (or I'm epic failing no my choice of binary, but I've tripled checked)

@petermattis
Copy link
Collaborator

Should we dump the core in this case? I think we should.

We'd have to replace the os.Exit() call with either a panic or sending ourselves SIGABRT. Seems reasonable to do if GOTRACEBACK=crash is set.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/38bb1e7905b89f911bd74be4f5830217ffb7b343

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1168752&tab=buildLog

The test failed on master:
	cluster.go:1603,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/9d058d53c8a82fceb2205f1827c26f1bf36c32ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1172386&tab=buildLog

The test failed on master:
	cluster.go:1210,chaos.go:92,cluster.go:1582,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1172386-scaledata-distributed-semaphore-nodes-6:2 returned:
		stderr:
		
		stdout:
		teamcity-1172386-scaledata-distributed-semaphore-nodes-6: stopping and waiting: signal: killed
	cluster.go:1603,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 5: dead

@nvanbenschoten
Copy link
Member

Still failing with:

F190312 13:05:53.498427 15810 kv/txn_coord_sender.go:641  [n4,client=10.142.0.54:48654,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=ce3c9e7b key=/Table/53/1/433602889857662979/0 rw=true pri=0.01216643 stat=ABORTED epo=1 ts=1552395953.208717498,3 orig=1552395953.208717498,3 max=1552395953.708054880,0 wto=false seq=0
goroutine 15810 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b01, 0xc000057b60, 0x53a1c00, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x5b3a9e0, 0xc000000004, 0x53a1c0f, 0x16, 0x281, 0xc00840cea0, 0x11e)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:876 +0x93d
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x3a3a180, 0xc009e234d0, 0x4, 0x2, 0x32ef6b5, 0x33, 0xc007ddd870, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x3a3a180, 0xc009e234d0, 0x1, 0x4, 0x32ef6b5, 0x33, 0xc007ddd870, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x3a3a180, 0xc009e234d0, 0x32ef6b5, 0x33, 0xc007ddd870, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc008c02900, 0x3a3a180, 0xc009e234d0, 0xed46e1367b9e3cce, 0xa13e11e30de99297, 0xc00705f950, 0xc, 0x10, 0x1, 0x158b37e4fdc270ba, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:641 +0x1d3
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc008c02900, 0x3a3a180, 0xc009e234d0, 0xed46e1367b9e3cce, 0xa13e11e30de99297, 0xc00705f950, 0xc, 0x10, 0x1, 0x158b37e4fdc270ba, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:634 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc005b89dd0, 0x3a3a180, 0xc009e234d0, 0xed46e1367b9e3cce, 0xa13e11e30de99297, 0xc00705f950, 0xc, 0x10, 0x1, 0x158b37e4fdc270ba, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:855 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc000eaa840, 0x0, 0x0, 0x0, 0xc005123080, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:478 +0x4c6
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc008763c00, 0x0, 0x0, 0x0, 0xc005123080, 0xc000000000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x3a3a180, 0xc007f6a690, 0x3a46380, 0xc007ecd680, 0x3a2b600, 0xc008763c00)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc007ecd680, 0x3a3a180, 0xc007f6a690)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc009134e00, 0x3a3a180, 0xc007f6a690, 0x33d08d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:624 +0x1ec
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc00004f180, 0xc0059259e0, 0xc005b89dd0, 0xc007dde898, 0xc000eaa840, 0xc000a1cc90, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8ad
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc00004f180, 0x3a3a180, 0xc009e234d0, 0xc000a1cc90, 0xc0059259e0, 0xc005b89dd0, 0x3a3be40, 0xc008dfe1a0, 0xc000eaa840)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:839 +0x227
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0xc000a1cbd0, 0x3, 0x7f8b821757f0, 0xc005657ce0, 0x1, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1100 +0x283
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0xc000a1cbd0, 0x7f8b821757f0, 0xc005657ce0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:945 +0x658
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0x3a3fcc0, 0xc007ed1e00, 0xc009307410, 0x5a, 0x1, 0xc009430ea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:457 +0xdd4
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc000a1c880, 0x3a3a180, 0xc009e234d0, 0x3a3fcc0, 0xc007ed1e00, 0xc009307410, 0x5a, 0x1, 0xc009430ea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:103 +0x610
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc000a1c880, 0x3a3a0c0, 0xc005b7ce80, 0xc000100338, 0x5400, 0x15000, 0xc0001003d0, 0xc0050b15d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x140b
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000ab7ba0, 0x3a3a0c0, 0xc005b7ce80, 0xc000a1c880, 0x5400, 0x15000, 0xc0001003d0, 0xc0050b15d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000ab7ba0, 0xc00607a000, 0xc006ac57c4, 0xc0050b15d0, 0x3a3a0c0, 0xc005b7ce80, 0xc000a1c880, 0x5400, 0x15000, 0xc0001003d0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:332 +0xcf
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:315 +0x1040

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/a512e390f7f2f2629f3f811bab5866c46e3e5713

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1183678&tab=buildLog

The test failed on provisional_201903122203_v19.1.0-beta.20190318:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 2: dead

@nvanbenschoten
Copy link
Member

^^^ Same as above.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 5: dead

@tbg
Copy link
Member

tbg commented Mar 20, 2019

Same as the nodes=3 flavor:

E190320 12:20:53.981796 334 util/log/crash_reporting.go:216  [n5,client=10.142.0.184:45916,user=root] a panic has occurred!
panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: panic while executing 1 statements: SELECT * FROM _ WHERE (_._ IS _) AND (((_ = $1) AND (_ = $2))); caused by non-values node passed as fixed value to zigzag join

goroutine 357 [running]:
panic(0x3103ee0, 0xc00560fa70)
	/usr/local/go/src/runtime/panic.go:556 +0x2cb fp=0xc005609408 sp=0xc005609378 pc=0x72b49b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc000cb8d80, 0x39fc4a0, 0xc0052ca900, 0x2d97f80, 0x39b5110)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:714 +0x36d fp=0xc0056094c8 sp=0xc005609408 pc=0x1fadaad
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc000cb8d80, 0x39fc4a0, 0xc0052ca900)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:426 +0x61 fp=0xc005609500 sp=0xc0056094c8 pc=0x213aa11
runtime.call32(0x0, 0x33f85f0, 0xc0054eff70, 0x1800000018)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc005609530 sp=0xc005609500 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0056095c0 sp=0xc005609530 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build.func1(0xc00560a9a8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:66 +0x9b fp=0xc0056095e8 sp=0xc0056095c0 pc=0x1bcb19b
runtime.call32(0x0, 0x33f9038, 0xc00555f770, 0x800000008)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc005609618 sp=0xc0056095e8 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0056096a8 sp=0xc005609618 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql.(*execFactory).ConstructZigzagJoin(0xc000a6ea48, 0x3a3c160, 0xc000ab5e10, 0x3a2aa80, 0xc005614568, 0x3a3c160, 0xc000ab5e10, 0x3a2aa80, 0xc0056145d0, 0xc000385d78, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt_exec_factory.go:736 +0x8ea fp=0xc005609758 sp=0xc0056096a8 pc=0x206d37a
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildZigzagJoin(0xc005612ae0, 0xc0055e66f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc005609ba0, 0x1aef637)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1235 +0xe46 fp=0xc005609b10 sp=0xc005609758 pc=0x1bc0626
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc005612ae0, 0x3a36200, 0xc0055e66f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:190 +0x732 fp=0xc00560a058 sp=0xc005609b10 pc=0x1bb5022
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildLookupJoin(0xc005612ae0, 0xc005523800, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00560a3c0, 0xc00560a3c8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1128 +0x97 fp=0xc00560a328 sp=0xc00560a058 pc=0x1bbedf7
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc005612ae0, 0x3a353a0, 0xc005523800, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc00555f740, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:187 +0xe31 fp=0xc00560a870 sp=0xc00560a328 pc=0x1bb5721
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).build(0xc005612ae0, 0x3a0a4e0, 0xc005523800, 0x0, 0xc00560a980, 0x70b8a8, 0x60)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:83 +0xab fp=0xc00560a930 sp=0xc00560a870 pc=0x1bb36fb
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build(0xc005612ae0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:71 +0x78 fp=0xc00560a990 sp=0xc00560a930 pc=0x1bb3578
github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan(0xc000cb90d0, 0x39fc560, 0xc00560f260, 0x32b8e38, 0x19, 0xc74a0180a, 0x3a798abd)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:162 +0x18b fp=0xc00560aa18 sp=0xc00560a990 pc=0x209988b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0xc000cb90d0, 0x6, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:981 +0x131 fp=0xc00560aad8 sp=0xc00560aa18 pc=0x1fc00e1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0xc000cb90d0, 0x7f799378e5c8, 0xc0007bb6b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:868 +0x148 fp=0xc00560abd8 sp=0xc00560aad8 pc=0x1fbf3c8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0x3a020a0, 0xc0052d8580, 0xc00559e023, 0x6e, 0x2, 0xc00544cea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:459 +0xdf6 fp=0xc00560b3b8 sp=0xc00560abd8 pc=0x1fbb976
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc000cb8d80, 0x39fc560, 0xc00560f260, 0x3a020a0, 0xc0052d8580, 0xc00559e023, 0x6e, 0x2, 0xc00544cea0, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:101 +0x610 fp=0xc00560b660 sp=0xc00560b3b8 pc=0x1fba700
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc000cb8d80, 0x39fc4a0, 0xc0052ca900, 0xc000100e78, 0x5400, 0x15000, 0xc000100f10, 0xc0003d7310, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x1418 fp=0xc00560bec8 sp=0xc00560b660 pc=0x1fb04d8
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000772000, 0x39fc4a0, 0xc0052ca900, 0xc000cb8d80, 0x5400, 0x15000, 0xc000100f10, 0xc0003d7310, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:428 +0xce fp=0xc00560bf28 sp=0xc00560bec8 pc=0x1fabe2e
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000772000, 0x39fc4a0, 0xc0052ca900, 0xc00010f600, 0xc00049a8b4, 0xc0003d7310, 0xc000cb8d80, 0x5400, 0x15000, 0xc000100f10, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:338 +0xe6 fp=0xc00560bf88 sp=0xc00560bf28 pc=0x22323b6
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc00560bf90 sp=0xc00560bf88 pc=0x75bd81
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:321 +0x1080

@justinj is already on the case.

@tbg
Copy link
Member

tbg commented Mar 20, 2019

Hmm, the augmentMetaLocked error doesn't repro as readily as I'd like. Took around 8 days = 8 runs between the last two iterations, we'll see when it shows up for me.

justinj pushed a commit to justinj/cockroach that referenced this issue Mar 20, 2019
This commit fixes a problem where the constraint logic could infer that
a column was constant, but our method of determining what that constant
value was was incomplete and so we ended up with incomplete information.
We now correctly infer the value for IS expressions, and also skip over
cases where this happens again so we degrade more gracefully (don't run
a zigzag join, as opposed to panicking).

A more correct fix here would extract the values directly from the
constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695.

Release note: None
@tbg
Copy link
Member

tbg commented Mar 20, 2019

Got myself a repro:

I190320 14:43:49.062897 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 timestamp:<wall_time:1553093029060073742 > priority:215514 > name:"sql txn" last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029060073742 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:1 timestamp:<wall_time:1553093029062268602 > > observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > epoch_zero_timestamp:<> > command_count:1 refresh_reads:<key:"\275\211\375\006\014\224\370S\272\000\006\210" end_key:"\275\212" >  RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.062964 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 timestamp:<wall_time:1553093029060073742 > priority:215514 > name:"sql txn" last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029060073742 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > observed_timestamps:<node_id:6 timestamp:<wall_time:1553093029062276140 > > epoch_zero_timestamp:<> > command_count:1 refresh_reads:<key:"\275\211\375\006\014\224\370P$\000\005" end_key:"\275\211\375\006\014\224\370S\272\000\006\210" >  RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.080441 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 timestamp:<wall_time:1553093029060073742 > priority:215514 > name:"sql txn" last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029060073742 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:3 timestamp:<wall_time:1553093029062349585 > > observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > epoch_zero_timestamp:<> > command_count:1 refresh_reads:<key:"\275\211" end_key:"\275\211\375\006\014\224\370P$\000\005" >  RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.122788 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:sql/distsqlrun/inbound.go:99 in func2(): (08006) communication error: rpc error: code = Canceled desc = context canceled TraceData:[] TxnCoordMeta:<nil> RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.355369 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" id=f0b6ff59 key=/Table/53/1/435887058280611844/0 rw=true pri=0.04248642 stat=ABORTED epo=1 ts=1553093029.081121839,1 orig=1553093029.081121839,1 max=1553093029.062268602,0 wto=false seq=0 TraceData:[] TxnCoordMeta:<nil> RowNum:<nil> SamplerProgress:<nil>}
I190320 14:43:49.355384 30840 sql/distsql_running.go:512  TSX discarding TransactionAbortedError(ABORT_REASON_ABORT_SPAN): "sql txn" id=f0b6ff59 key=/Table/53/1/435887058280611844/0 rw=true pri=0.04248642 stat=ABORTED epo=1 ts=1553093029.081121839,1 orig=1553093029.081121839,1 max=1553093029.062268602,0 wto=false seq=0
I190320 14:43:49.355487 30840 sql/distsql_running.go:480  TSX meta for f0b6ff59-aded-44ba-b911-6383e7cbcda0: &{Ranges:[] Err:<nil> TraceData:[] TxnCoordMeta:txn:<meta:<id:f0b6ff59-aded-44ba-b911-6383e7cbcda0 key:"\275\211\375\006\014\224\370W\001\200\004\210" epoch:1 timestamp:<wall_time:1553093029081121839 logical:1 > priority:912389 > name:"sql txn" status:ABORTED last_heartbeat:<wall_time:1553093029060073742 > orig_timestamp:<wall_time:1553093029081121839 logical:1 > max_timestamp:<wall_time:1553093029560073742 > refreshed_timestamp:<> observed_timestamps:<node_id:1 timestamp:<wall_time:1553093029062268602 > > observed_timestamps:<node_id:3 timestamp:<wall_time:1553093029062349585 > > observed_timestamps:<node_id:4 timestamp:<wall_time:1553093029060073742 > > observed_timestamps:<node_id:6 timestamp:<wall_time:1553093029062276140 > > deprecated_writing:true epoch_zero_timestamp:<wall_time:1553093029060073742 > > command_count:1  RowNum:<nil> SamplerProgress:<nil>}
F190320 14:43:49.355566 30840 kv/txn_coord_sender.go:641  [n4,client=10.142.0.211:40984,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=f0b6ff59 key=/Table/53/1/435887058280611844/0 rw=true pri=0.04248642 stat=ABORTED epo=1 ts=1553093029.081121839,1 orig=1553093029.081121839,1 max=1553093029.560073742,0 wto=false seq=0

This seems to explain things, though not fully to me. Things happen in this block:

if meta != nil {
if meta.TxnCoordMeta != nil {
if r.txn != nil {
if r.txn.ID() == meta.TxnCoordMeta.Txn.ID {
r.txn.AugmentTxnCoordMeta(r.ctx, *meta.TxnCoordMeta)
}
} else {
r.resultWriter.SetError(
errors.Errorf("received a leaf TxnCoordMeta (%s); but have no root", meta.TxnCoordMeta))
}
}
if meta.Err != nil {
// Check if the error we just received should take precedence over a
// previous error (if any).
if errPriority(meta.Err) > errPriority(r.resultWriter.Err()) {
if r.txn != nil {
if retryErr, ok := meta.Err.(*roachpb.UnhandledRetryableError); ok {
// Update the txn in response to remote errors. In the non-DistSQL
// world, the TxnCoordSender handles "unhandled" retryable errors,
// but this one is coming from a distributed SQL node, which has
// left the handling up to the root transaction.
meta.Err = r.txn.UpdateStateOnRemoteRetryableErr(r.ctx, &retryErr.PErr)
// Update the clock with information from the error. On non-DistSQL
// code paths, the DistSender does this.
// TODO(andrei): We don't propagate clock signals on success cases
// through DistSQL; we should. We also don't propagate them through
// non-retryable errors; we also should.
r.updateClock(retryErr.PErr.Now)
}
}
r.resultWriter.SetError(meta.Err)
}
}

  • First, we ingest an RPC error without a TxnCoordMeta.
  • Second, we observe a TransactionAbortedError without a TxnMeta and the meta is fed to the TxnCoordSender. However, we discard the error because we already saw an error before because that previous error probably got a scoreNonRetriable which beats all else (thanks chaos testing). I guess if we didn't drop the error, we'd feed it into UpdateStateOnRemoteRetryableErr, and would get a new txn.ID() preventing the following:
  • we get an update with no error but the aborted TxnMeta, and since the txn ids match it's fed all the way through to the assertion and panics.

@andreimatei, WDYT?

@andreimatei
Copy link
Contributor

Great, that's exactly what #35249 theorized. And now we know what such a rogue error racing with the TxnAbortedError could be - an RPC error (btw, how did that happen? :P)
I'll write a test in #35249 and merge it. Thanks bro, respect.

Btw, a lot of this crap about ingesting errors is pretty dirty. I'm also attempting to improve things in #35224.

justinj pushed a commit to justinj/cockroach that referenced this issue Mar 20, 2019
This commit fixes a problem where the constraint logic could infer that
a column was constant, but our method of determining what that constant
value was was incomplete and so we ended up with incomplete information.
We now correctly infer the value for IS expressions, and also skip over
cases where this happens again so we degrade more gracefully (don't run
a zigzag join, as opposed to panicking).

A more correct fix here would extract the values directly from the
constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695.

Release note: None
craig bot pushed a commit that referenced this issue Mar 20, 2019
35655: exec: add multi column equality to vectorized mergejoiner r=georgeutsin a=georgeutsin

Added the ability to join on multiple columns in the vectorized
merge joiner. This was done by inverting the previous method of
generating groups; instead of assuming no matches and discovering
matches group by group, if we assume that everything is a match
and filter out the rows that don't match in the join, we can use
multiple columns in this approach.

This PR also includes some work on output shaping, to ensure that
we still have decently shaped outputs, even when a batch ends on
a small run. Also fixed the benchmarks in this PR.

Updated benchmarks for the mergejoiner:

```
BenchmarkMergeJoiner/rows=1024-8         	   20000	     67231 ns/op	 974.77 MB/s
BenchmarkMergeJoiner/rows=4096-8         	    5000	    257487 ns/op	1018.08 MB/s
BenchmarkMergeJoiner/rows=16384-8        	    2000	    995437 ns/op	1053.38 MB/s
BenchmarkMergeJoiner/rows=1048576-8      	      20	  62822638 ns/op	1068.23 MB/s
BenchmarkMergeJoiner/oneSideRepeat-rows=1024-8         	   20000	     67343 ns/op	 973.16 MB/s
BenchmarkMergeJoiner/oneSideRepeat-rows=4096-8         	    5000	    256988 ns/op	1020.06 MB/s
BenchmarkMergeJoiner/oneSideRepeat-rows=16384-8        	    2000	    997690 ns/op	1051.00 MB/s
BenchmarkMergeJoiner/oneSideRepeat-rows=1048576-8      	      20	  62184741 ns/op	1079.19 MB/s
BenchmarkMergeJoiner/bothSidesRepeat-rows=1024-8       	   20000	     68672 ns/op	 954.32 MB/s
BenchmarkMergeJoiner/bothSidesRepeat-rows=4096-8       	    1000	   1336630 ns/op	 196.12 MB/s
BenchmarkMergeJoiner/bothSidesRepeat-rows=16384-8      	     100	  18478767 ns/op	  56.74 MB/s
BenchmarkMergeJoiner/bothSidesRepeat-rows=32768-8      	      20	  80351272 ns/op	  26.10 MB/s
```

Note that the performance degradation in the 'bothSidesRepeat'
is actually due to incorrect benchmarking in the first place.
Regardless, this is still an area of investigation.

Release note: None

35989: opt: fix zigzag joins with IS condition r=justinj a=justinj

This commit fixes a problem where the constraint logic could infer that
a column was constant, but our method of determining what that constant
value was was incomplete and so we ended up with incomplete information.
We now correctly infer the value for IS expressions, and also skip over
cases where this happens again so we degrade more gracefully (don't run
a zigzag join, as opposed to panicking).

A more correct fix here would extract the values directly from the
constraints, but I'd like to get this fixed on master to unblock #34695.

Release note: None

35994: sqlsmith: add x IN (SELECT ...) r=justinj a=justinj

I added this in an attempt to repro a specific bug. I was unsuccessful,
but figured I should PR this since I already did the work.

This is technically just another instance of a comparison op, it could
probably be refactored to be lumped in with those once there's more
structure around how those are generated.

Release note: None

35995: sql: fix panic when searching for equivalent renders r=jordanlewis a=jordanlewis

Previously, certain kinds of window functions could generate a panic
when trying to reuse equivalent renders when those renders were tuples.
This is fixed by comparing the types using the Equivalent method instead
of ==.

Release note (bug fix): fix panics caused by certain window functions
that operate on tuples

Co-authored-by: George Utsin <[email protected]>
Co-authored-by: Justin Jaffray <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=6 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 4: dead

andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
craig bot pushed a commit that referenced this issue Mar 21, 2019
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei

Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since #35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes #34695
Fixes #34341
Fixes #33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.

Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in #35249 Mar 21, 2019
andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
justinj pushed a commit to justinj/cockroach that referenced this issue Mar 22, 2019
This commit fixes a problem where the constraint logic could infer that
a column was constant, but our method of determining what that constant
value was was incomplete and so we ended up with incomplete information.
We now correctly infer the value for IS expressions, and also skip over
cases where this happens again so we degrade more gracefully (don't run
a zigzag join, as opposed to panicking).

A more correct fix here would extract the values directly from the
constraints, but I'd like to get this fixed on master to unblock cockroachdb#34695.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants