roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

cockroach-teamcity · 2019-01-13T12:31:42Z

SHA: https://github.com/cockroachdb/cockroach/commits/766a53defc27940d41ae8987792ecd79e76d2b65

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1090948&tab=buildLog

The test failed on master:
	test.go:1101: test timed out (20m0s)
	test.go:696,cluster.go:1511,scaledata.go:126,scaledata.go:53: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1090948-scaledata-distributed-semaphore-nodes-3:4 -- ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.154:26257,10.128.0.124:26257,10.128.0.155:26257'  returned:
		stderr:
		13 12:19:43.818029251 +0000 UTC m=+574.303702067, now = 2019-01-13 12:19:58.760941404 +0000 UTC m=+589.246614276, took 14.942912209s
		2019/01/13 12:19:58 Aborting Retries because this error of type *crdb.AmbiguousCommitError is not retryable : pq: result is ambiguous (error=initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.128.0.124:26257: connect: connection refused" [exhausted])
		2019/01/13 12:19:58 RobustDB.RandomDB chose DB at index 0
		2019/01/13 12:19:58 [/Users/nathan/Go/src/github.com/scaledata/rksql/src/go/src/rubrik/sqlapp/tx.go:29 pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=4ed1fae5 key=/Table/54/1/417173639025852417/0 rw=true pri=0.00407482 stat=PENDING epo=0 ts=1547381994.115138047,1 orig=1547381983.836350074,0 max=1547381984.336350074,0 wto=true seq=2]
		
		stdout:
		e timestamp 1547381409.525352672,1 within uncertainty interval `t <= 1547381409.538741774,0`; observed timestamps: [{1 1547381409.538741774,0} {2 1547381409.524794522,0}]
		pq: restart transaction: TransactionRetryWithProtoRefreshError: WriteTooOldError: write at timestamp 1547381409.540432066,0 too old; wrote at 1547381409.540432066,2
		pq: restart transaction: TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError: read at time 1547381409.523763080,0 encountered previous write with future timestamp 1547381409.540432066,1 within uncertainty interval `t <= 1547381409.603062510,0`; observed timestamps: [{1 1547381409.603062510,0} {3 1547381409.523763080,0}]
		pq: restart transaction: TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError: read at time 1547381409.524120255,0 encountered previous write with future timestamp 1547381409.540432066,1 within uncertainty interval `t <= 1547381409.603154178,0`; observed timestamps: [{1 1547381409.603154178,0} {3 1547381409.524120255,0}]
		: signal: killed

The text was updated successfully, but these errors were encountered:

tbg · 2019-01-15T12:39:28Z

Hmm I can't really read these logs.

ExecuteTx retry attempt 7 failed

is the highest retry count I can find (i.e. there isn't anything that retries 8 times). Also, the log covers approximately 10 minutes, which is for how long the workload is supposed to run for. So what happened to the other 10 minutes? The latest sqlapp log entry is

2019/01/13 12:19:58 [/Users/nathan/Go/src/github.com/scaledata/rksql/src/go/src/rubrik/sqlapp/tx.go:29 pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=4ed1fae5 key=/Table/54/1/417173639025852417/0 rw=true pri=0.00407482 stat=PENDING epo=0 ts=1547381994.115138047,1 orig=1547381983.836350074,0 max=1547381984.336350074,0 wto=true seq=2]

which is just about when it was supposed to shut down. It's also when the chaos runner stops. But then the test just hangs for another ~10min. Looking at the code, this must be caused by sqlapp not returning. Unclear to me why, the cluster is left intact by the chaos runner (i.e. all nodes up). The node logs are innocuous, the only thing I don't understand is this error that pops up a small number of times:

W190113 12:29:48.051981 59542 vendor/google.golang.org/grpc/server.go:666 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"

CHAOS: 12:20:06 chaos.go:97: restarting :2 (chaos is done)
12:20:06 cluster.go:251: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1090948-scaledata-distributed-semaphore-nodes-3:2 --encrypt
teamcity-1090948-scaledata-distributed-semaphore-nodes-3: starting..2: 4510

CHAOS: 12:20:08 chaos.go:65: chaos stopping: <nil>
12:29:47 cluster.go:911: fetching debug zip

Not sure what to make of this. I did poke at the debug zip but it doesn't contain SHOW QUERIES or SHOW SESSIONS. We should add both.

tbg · 2019-01-22T15:51:26Z

Actually, maybe this log means something bad:

  2019/01/13 12:19:58 Aborting Retries because this error of type *crdb.AmbiguousCommitError is not retryable : pq: result is ambiguous (error=initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.128.0.124:26257: connect: connection refused" [exhausted])

It shouldn't, but just going out on a limb.

We should edit the source code of this generator to emit better logging that would help investigating this particular problem.

cockroach-teamcity · 2019-02-06T01:13:12Z

SHA: https://github.com/cockroachdb/cockroach/commits/82026117d83262e87873aad52b8eca2dd0bea335

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126329&tab=buildLog

The test failed on provisional_201902052032_v2.2.0-alpha.20190211:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1126329-scaledata-distributed-semaphore-nodes-3 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		Creating cluster teamcity-1126329-scaledata-distributed-semaphore-nodes-3 with 4 nodes
		Unable to create cluster:
		in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20190122a --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script756390223 --project cockroach-ephemeral]
		Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  '(resource type:pd-ssd)'.
		
		: exit status 1
		Cleaning up...
		: exit status 1

cockroach-teamcity · 2019-02-06T12:23:20Z

SHA: https://github.com/cockroachdb/cockroach/commits/b9242838f18107318f53c4e22266fd12223f7651

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126617&tab=buildLog

The test failed on master:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1126617-scaledata-distributed-semaphore-nodes-3 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		Creating cluster teamcity-1126617-scaledata-distributed-semaphore-nodes-3 with 4 nodes
		Unable to create cluster:
		in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20190122a --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script828998365 --project cockroach-ephemeral]
		Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.
		
		: exit status 1
		Cleaning up...
		: exit status 1

cockroach-teamcity · 2019-02-07T13:19:07Z

SHA: https://github.com/cockroachdb/cockroach/commits/4e2e450ecf5c3f73c7de46e48696528468803db7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1128035&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1585,scaledata.go:126,scaledata.go:53: unexpected node event: 2: dead

tbg · 2019-02-11T11:25:15Z

@andreimatei

txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=2726a642 key=/Table/53/1/424262250136109059/0 rw=true pri=0.04558042 stat=ABORTED epo=2 ts=1549545461.435381416,0 orig=1549545461.435381416,0 max=1549545461.932542083,0

I190207 13:17:41.818247 92503 sql/distsql_running.go:149  [n3,client=10.142.0.26:45770,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=2726a642 key=/Table/53/1/424262250136109059/0 rw=true pri=0.04558042 stat=ABORTED epo=2 ts=1549545461.435381416,0 orig=1549545461.435381416,0 max=1549545461.932542083,0 wto=false seq=2
F190207 13:17:42.443155 92621 kv/txn_interceptor_heartbeat.go:447  txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=2726a642 key=/Table/53/1/424262250136109059/0 rw=true pri=0.04558042 stat=ABORTED epo=2 ts=1549545461.435381416,0 orig=1549545461.435381416,0 max=1549545461.932542083,0 wto=false seq=2
goroutine 92621 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x518a800, 0x1f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x58fadc0, 0xc000000004, 0x518a8f7, 0x1f, 0x1bf, 0xc005ca9560, 0x114)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38cdfc0, 0xc007205d40, 0x4, 0x2, 0x31d72f1, 0x51, 0xc006005ad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38cdfc0, 0xc007205d40, 0x1, 0xc000000004, 0x31d72f1, 0x51, 0xc00618cad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38cdfc0, 0xc007205d40, 0x31d72f1, 0x51, 0xc00618cad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeat(0xc008976d88, 0x38cdfc0, 0xc007205d40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:447 +0x150
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeatLoop(0xc008976d88, 0x38cdfc0, 0xc007205d40)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:411 +0x1f7
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).startHeartbeatLoopLocked.func1(0x38cdfc0, 0xc007205d40)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:370 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001c4630, 0x38cdfc0, 0xc007205d40, 0xc0083f11a0, 0x29, 0x3902ec0, 0xc0002eac00, 0xc009802120)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:323 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:318 +0x134

cockroach-teamcity · 2019-02-13T14:11:18Z

SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135549&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 2: dead

tbg · 2019-02-13T14:32:19Z

Same. cc @andreimatei

F190213 14:08:54.279702 129680 kv/txn_coord_sender.go:547  [n2,client=10.142.0.2:56154,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=a713e91e key=/Table/53/1/425970970990706690/0 rw=true pri=0.05974113 stat=ABORTED epo=1 ts=1550066933.723333955,1 orig=1550066933.723333955,1 max=1550066934.222991287,0 wto=false seq=0
goroutine 129680 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51af200, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x591e240, 0xc000000004, 0x51af2fd, 0x16, 0x223, 0xc000a85320, 0x11d)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38e6ea0, 0xc005c91440, 0x4, 0x2, 0x31bc6b3, 0x33, 0xc005135960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38e6ea0, 0xc005c91440, 0x1, 0x4, 0x31bc6b3, 0x33, 0xc005135960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38e6ea0, 0xc005c91440, 0x31bc6b3, 0x33, 0xc005135960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc0053ea000, 0x38e6ea0, 0xc005c91440, 0x6d418db41ee913a7, 0x36fe8481653232bd, 0xc005f94be0, 0xc, 0x12, 0x1, 0x1582f1a9e2a88543, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:547 +0x1d3
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc0053ea000, 0x38e6ea0, 0xc005c91440, 0x6d418db41ee913a7, 0x36fe8481653232bd, 0xc005f94be0, 0xc, 0x12, 0x1, 0x1582f1a9e2a88543, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:540 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc009113b00, 0x38e6ea0, 0xc005c91440, 0x6d418db41ee913a7, 0x36fe8481653232bd, 0xc005f94be0, 0xc, 0x12, 0x1, 0x1582f1a9e2a88543, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:868 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc0004c8dc0, 0x0, 0x0, 0x0, 0xc0099530d8, 0xc00922c8b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:463 +0x493
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*accountClearingRowReceiver).Push(0xc006d6fd40, 0x0, 0x0, 0x0, 0xc0099530d8, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:266 +0x83
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc005674700, 0x0, 0x0, 0x0, 0xc0099530d8, 0xc000000000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x38e6ea0, 0xc006d6fa70, 0x38f2ea0, 0xc00922c480, 0x38d85a0, 0xc005674700)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc00922c480, 0x38e6ea0, 0xc006d6fa70)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc0089c6380, 0x38e6ea0, 0xc006d6fa70, 0x3299218, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:649 +0x209
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc000553540, 0xc00690e6c0, 0xc009113b00, 0xc005136990, 0xc0004c8dc0, 0xc0060d3518, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8a0
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc000553540, 0x38e6ea0, 0xc005c91440, 0xc0060d3518, 0xc00690e6c0, 0xc009113b00, 0x38e8aa0, 0xc009654700, 0xc0004c8dc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:795 +0x227
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0xc0060d3458, 0x3, 0x7f83873dfae0, 0xc006275b80, 0x1, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1044 +0x26e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0xc0060d3458, 0x7f83873dfae0, 0xc006275b80, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:900 +0x644
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0x38ec8e0, 0xc005d2b980, 0xc00573940e, 0x5a, 0x1, 0xc00540e900, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:438 +0xe49
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0x38ec8e0, 0xc005d2b980, 0xc00573940e, 0x5a, 0x1, 0xc00540e900, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:93 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc0060d3000, 0x38e6de0, 0xc0061de940, 0xc0005dd9b8, 0x5400, 0x15000, 0xc0005dda50, 0xc009098510, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1241 +0x1439
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000c381a0, 0x38e6de0, 0xc0061de940, 0xc0060d3000, 0x5400, 0x15000, 0xc0005dda50, 0xc009098510, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000c381a0, 0xc00616e000, 0xc00906e724, 0xc009098510, 0x38e6de0, 0xc0061de940, 0xc0060d3000, 0x5400, 0x15000, 0xc0005dda50, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:331 +0xcf
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:314 +0x1040

andreimatei · 2019-02-13T19:20:08Z

#34695 is the failure for the 6-node version of this test. They're probably related.

@tbg you say "same", but the fatal you pasted is "unexpected non-pending txn in augmentMetaLocked" which is a new one, isn't it? Although I'm sure it's related to the "txn committed or aborted but heartbeat loop hasn't been signaled to stop" condition we've seen before.

tbg · 2019-02-13T21:35:13Z

I was eyeballing, possible they're not the same.

…

On Wed, Feb 13, 2019, 20:20 Andrei Matei ***@***.*** wrote: #34341 <#34341> is the failure for the 6-node version of this test. They're probably related. @tbg <https://github.com/tbg> you say "same", but the fatal you pasted is "unexpected non-pending txn in augmentMetaLocked" which is a new one, isn't it? Although I'm sure it's related to the "txn committed or aborted but heartbeat loop hasn't been signaled to stop" condition we've seen before. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#33698 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135ME-PFRIErBNJy8ghO1xfnDgcvBDks5vNGV9gaJpZM4Z9JP2> .

andreimatei · 2019-02-13T21:36:08Z

I've been making progress in the meantime.

cockroach-teamcity · 2019-02-16T14:13:44Z

SHA: https://github.com/cockroachdb/cockroach/commits/4fc4d63ddddeb507564bfe53c6cd79549ff2fd27

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1141650&tab=buildLog

The test failed on master:
	cluster.go:1192,chaos.go:92,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1141650-scaledata-distributed-semaphore-nodes-3:1 returned:
		stderr:
		
		stdout:
		teamcity-1141650-scaledata-distributed-semaphore-nodes-3: stopping and waiting: signal: killed
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 2: dead

cockroach-teamcity · 2019-02-17T14:31:27Z

SHA: https://github.com/cockroachdb/cockroach/commits/8e9a1e310e3e8e37f091b7ca8bd204084ad9e2e5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1142461&tab=buildLog

The test failed on master:
	cluster.go:1192,chaos.go:92,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1142461-scaledata-distributed-semaphore-nodes-3:1 returned:
		stderr:
		
		stdout:
		teamcity-1142461-scaledata-distributed-semaphore-nodes-3: stopping and waiting: signal: killed
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 3: dead

tbg · 2019-02-18T08:50:05Z

F190217 14:29:10.671394 63133 kv/txn_coord_sender.go:577  [n2,client=10.142.0.5:33496,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=33909433 key=/Table/53/1/427108402845548547/0 rw=true pri=0.12336164 stat=ABORTED epo=1 ts=1550413750.090045549,17 orig=1550413750.090045549,17 max=1550413750.568148075,7 wto=false seq=0
goroutine 63133 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x5206300, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x59774e0, 0xc000000004, 0x52063f6, 0x16, 0x241, 0xc008de0480, 0x11f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a

cockroach-teamcity · 2019-02-19T15:05:54Z

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 3: dead

cockroach-teamcity · 2019-02-22T07:42:13Z

SHA: https://github.com/cockroachdb/cockroach/commits/763e90b043b5d732856d3ecf1d7b0d6aa33e3b26

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1147715&tab=buildLog

The test failed on provisional_201902192209_v19.1.0-beta.20190225:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1211: unexpected node event: 2: dead

tbg · 2019-02-22T08:22:58Z

^- same as usual

cockroach-teamcity · 2019-02-23T15:21:07Z

SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149020-scaledata-distributed-semaphore-nodes-3:4 -- ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.142.0.10:26257,10.142.0.29:26257,10.142.0.7:26257'  returned:
		stderr:
		2019/02/23 15:19:37 connecting to CockroachDB in insecure mode
		2019/02/23 15:19:37 WARNING: crdbutil: connecting to crdb using root user
		2019/02/23 15:19:37 connecting to CockroachDB in insecure mode
		2019/02/23 15:19:37 WARNING: crdbutil: connecting to crdb using root user
		2019/02/23 15:19:37 connecting to CockroachDB in insecure mode
		2019/02/23 15:19:37 WARNING: crdbutil: connecting to crdb using root user
		2019/02/23 15:19:37 dial tcp 10.142.0.29:26257: connect: connection refused
		Error:  exit status 255
		
		stdout:
		: exit status 1

tbg · 2019-02-27T12:52:00Z

#34695 (comment)

andreimatei · 2019-02-27T20:44:32Z

@nvanbenschoten would you mind taking a look at the last failure here? It's different than the other (and also I think I had fixed the others). This one looks like the semaphores app crashed because it couldn't connect to node 2, and node 2 was dead because of chaos at that time. Do we have sources for these "apps"? I see the test downloads some binaries.
One thing I've noticed in the node logs is that schema changes seemed to be running minutes after test start, so I'm thinking perhaps the application started up way later than it usually does in this particular test run and it overlapped with the chaos differently than usually?

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and I believe that has fixed the majority of crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError should also be well possible) Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 (I believe all the issues above were really fixed by cockroachdb#35105 but this patch makes it more convincing) Release note: None

tbg · 2019-02-28T10:02:43Z

@andreimatei the sources are here: https://github.com/scaledata/rksql/tree/master/src/go/src/rubrik/sqlapp

cockroach-teamcity · 2019-03-20T12:19:37Z

SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 3: dead

tbg · 2019-03-20T12:21:28Z

Looks like this tickled a zig zag bug:

E190320 12:17:53.060953 523 util/log/crash_reporting.go:216  [n3,client=10.142.0.179:55746,user=root] a panic has occurred!
panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: panic while executing 1 statements: SELECT * FROM _ WHERE (_._ IS _) AND (((_ = $1) AND (_ = $2))); caused by non-values node passed as fixed value to zigzag join

goroutine 565 [running]:
panic(0x3103ee0, 0xc00644cab0)
	/usr/local/go/src/runtime/panic.go:556 +0x2cb fp=0xc006433408 sp=0xc006433378 pc=0x72b49b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc005efd100, 0x39fc4a0, 0xc0062e5c80, 0x2d97f80, 0x39b5110)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:714 +0x36d fp=0xc0064334c8 sp=0xc006433408 pc=0x1fadaad
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc005efd100, 0x39fc4a0, 0xc0062e5c80)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:426 +0x61 fp=0xc006433500 sp=0xc0064334c8 pc=0x213aa11
runtime.call32(0x0, 0x33f85f0, 0xc005377700, 0x1800000018)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc006433530 sp=0xc006433500 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0064335c0 sp=0xc006433530 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build.func1(0xc0064349a8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:66 +0x9b fp=0xc0064335e8 sp=0xc0064335c0 pc=0x1bcb19b
runtime.call32(0x0, 0x33f9038, 0xc000094a70, 0x800000008)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc006433618 sp=0xc0064335e8 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0064336a8 sp=0xc006433618 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql.(*execFactory).ConstructZigzagJoin(0xc005b09a88, 0x3a3c160, 0xc0064eb040, 0x3a2aa80, 0xc005da5d28, 0x3a3c160, 0xc0064eb040, 0x3a2aa80, 0xc005da5d90, 0xc0064f0bb8, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt_exec_factory.go:736 +0x8ea fp=0xc006433758 sp=0xc0064336a8 pc=0x206d37a
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildZigzagJoin(0xc0064462a0, 0xc0063f71f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc006433ba0, 0x1aef637)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1235 +0xe46 fp=0xc006433b10 sp=0xc006433758 pc=0x1bc0626
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc0064462a0, 0x3a36200, 0xc0063f71f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:190 +0x732 fp=0xc006434058 sp=0xc006433b10 pc=0x1bb5022
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildLookupJoin(0xc0064462a0, 0xc0064e0a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0064343c0, 0xc0064343c8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1128 +0x97 fp=0xc006434328 sp=0xc006434058 pc=0x1bbedf7
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc0064462a0, 0x3a353a0, 0xc0064e0a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc000094a40, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:187 +0xe31 fp=0xc006434870 sp=0xc006434328 pc=0x1bb5721
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).build(0xc0064462a0, 0x3a0a4e0, 0xc0064e0a00, 0x0, 0xc006434980, 0x70b8a8, 0x60)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:83 +0xab fp=0xc006434930 sp=0xc006434870 pc=0x1bb36fb
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build(0xc0064462a0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:71 +0x78 fp=0xc006434990 sp=0xc006434930 pc=0x1bb3578
github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan(0xc005efd450, 0x39fc560, 0xc00644c2a0, 0x32b8e38, 0x19, 0xbe52b8db5, 0x391d63e)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:162 +0x18b fp=0xc006434a18 sp=0xc006434990 pc=0x209988b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0xc005efd450, 0x6, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:981 +0x131 fp=0xc006434ad8 sp=0xc006434a18 pc=0x1fc00e1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0xc005efd450, 0x7fc652d8a340, 0xc0059c5d90, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:868 +0x148 fp=0xc006434bd8 sp=0xc006434ad8 pc=0x1fbf3c8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0x3a020a0, 0xc0062e5d80, 0xc00638e023, 0x6e, 0x2, 0xc0064fcd80, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:459 +0xdf6 fp=0xc0064353b8 sp=0xc006434bd8 pc=0x1fbb976
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0x3a020a0, 0xc0062e5d80, 0xc00638e023, 0x6e, 0x2, 0xc0064fcd80, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:101 +0x610 fp=0xc006435660 sp=0xc0064353b8 pc=0x1fba700
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc005efd100, 0x39fc4a0, 0xc0062e5c80, 0xc000959538, 0x5400, 0x15000, 0xc0009595d0, 0xc0063604b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x1418 fp=0xc006435ec8 sp=0xc006435660 pc=0x1fb04d8
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000c54000, 0x39fc4a0, 0xc0062e5c80, 0xc005efd100, 0x5400, 0x15000, 0xc0009595d0, 0xc0063604b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:428 +0xce fp=0xc006435f28 sp=0xc006435ec8 pc=0x1fabe2e
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000c54000, 0x39fc4a0, 0xc0062e5c80, 0xc000012400, 0xc00636aba4, 0xc0063604b0, 0xc005efd100, 0x5400, 0x15000, 0xc0009595d0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:338 +0xe6 fp=0xc006435f88 sp=0xc006435f28 pc=0x22323b6
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc006435f90 sp=0xc006435f88 pc=0x75bd81
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:321 +0x1080

cc @justinj

cockroach-teamcity · 2019-03-21T12:58:02Z

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 3: dead

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since #35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes #34695 Fixes #34341 Fixes #33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors. Co-authored-by: Andrei Matei <[email protected]>

Remote DistSQL flows pass TxnCoordMeta records to the Root Txn(CoordSender) as trailing metadata. The TCS ingests these records and updates its state (mostly for read spans). This patch makes it so that we don't ingest records with an ABORTED txn proto. Why not? Because, well, unfortunately we are not well equiped at the moment for finding out about an aborted txn this way. The idea is that, if the Root was running along happily and all of a sudden ingests one of these Aborted protos, it would put it in an inconsistent state: with an Aborted proto but with the heartbeat loop still running. We don't like that state and we have assertions against it. The expectation is that the TCS finds out about aborted txns in one of two ways: through a TxnAbortedError, in which case it rolls back the txn, or through the heartbeat loop discovering the aborted txn, in which case it again rolls back (and a 3rd way through a remote TxnAbortedErr; see below). We have not considered this 4th way of finding out, through a remote TxnCoordMeta, and I don't really want to deal with it because, with current code, it's already awkward enough to handle the other cases. In practice, a TxnCoordMeta with an ABORTED proto is expected to follow a TxnAbortedError that is passed through DistSQL to the gateway (the DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject retriable errors into the Root txn and the TCS rolls back. After this rollback, injesting the ABORTED proto just works (it's a no-op). However, alas, there's a case where the TxnAbortedError is not passed to the TCS: this is when another concurrent error was received first by the DistSQLReceiver. In that case, the 2nd error is ignored, and so this patch makes it so that we also effectively ignore the upcoming TxnCoordMeta. I'm separately reworking the way error handling happens in the Txn/TCS and that work should make this unfortunate patch unnecessary. (since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be ignored; other retriable errors no longer cause it to be ignored and that has fixed the some crashes that we've seen because of this inconsistent state that this patch is trying to avoid. However, non-retriable errors racing with a TxnAbortedError are also possible, and we've seen them happen and leading to crashes - in particular, we've seen RPC errors). Fixes cockroachdb#34695 Fixes cockroachdb#34341 Fixes cockroachdb#33698 Release note (bug fix): Fix crashes with the message "unexpected non-pending txn in augmentMetaLocked" caused by distributed queries encountering multiple errors.

cockroach-teamcity added this to the 2.2 milestone Jan 13, 2019

cockroach-teamcity assigned andreimatei Jan 13, 2019

cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 13, 2019

tbg assigned tbg and unassigned andreimatei Jan 22, 2019

jordanlewis added the O-roachtest label Feb 5, 2019

tbg assigned andreimatei and unassigned tbg Feb 11, 2019

andreimatei mentioned this issue Feb 13, 2019

roachtest: scaledata/distributed_semaphore/nodes=6 failed #34695

Closed

andreimatei mentioned this issue Feb 27, 2019

kv: don't ingest aborted TxnCoordMeta #35249

Merged

craig bot closed this as completed in #35249 Mar 21, 2019

andreimatei mentioned this issue Mar 21, 2019

release-19.1: kv: don't ingest aborted TxnCoordMeta #36041

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

cockroach-teamcity commented Jan 13, 2019

tbg commented Jan 15, 2019

tbg commented Jan 22, 2019

cockroach-teamcity commented Feb 6, 2019

cockroach-teamcity commented Feb 6, 2019

cockroach-teamcity commented Feb 7, 2019

tbg commented Feb 11, 2019

cockroach-teamcity commented Feb 13, 2019

tbg commented Feb 13, 2019

andreimatei commented Feb 13, 2019 •

edited

Loading

tbg commented Feb 13, 2019 via email

andreimatei commented Feb 13, 2019

cockroach-teamcity commented Feb 16, 2019

cockroach-teamcity commented Feb 17, 2019

tbg commented Feb 18, 2019

cockroach-teamcity commented Feb 19, 2019

cockroach-teamcity commented Feb 22, 2019

tbg commented Feb 22, 2019

cockroach-teamcity commented Feb 23, 2019

tbg commented Feb 27, 2019

andreimatei commented Feb 27, 2019 •

edited

Loading

tbg commented Feb 28, 2019

cockroach-teamcity commented Mar 20, 2019

tbg commented Mar 20, 2019

cockroach-teamcity commented Mar 21, 2019

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

Comments

cockroach-teamcity commented Jan 13, 2019

tbg commented Jan 15, 2019

tbg commented Jan 22, 2019

cockroach-teamcity commented Feb 6, 2019

cockroach-teamcity commented Feb 6, 2019

cockroach-teamcity commented Feb 7, 2019

tbg commented Feb 11, 2019

cockroach-teamcity commented Feb 13, 2019

tbg commented Feb 13, 2019

andreimatei commented Feb 13, 2019 • edited Loading

tbg commented Feb 13, 2019 via email

andreimatei commented Feb 13, 2019

cockroach-teamcity commented Feb 16, 2019

cockroach-teamcity commented Feb 17, 2019

tbg commented Feb 18, 2019

cockroach-teamcity commented Feb 19, 2019

cockroach-teamcity commented Feb 22, 2019

tbg commented Feb 22, 2019

cockroach-teamcity commented Feb 23, 2019

tbg commented Feb 27, 2019

andreimatei commented Feb 27, 2019 • edited Loading

tbg commented Feb 28, 2019

cockroach-teamcity commented Mar 20, 2019

tbg commented Mar 20, 2019

cockroach-teamcity commented Mar 21, 2019

andreimatei commented Feb 13, 2019 •

edited

Loading

andreimatei commented Feb 27, 2019 •

edited

Loading