Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

Closed
cockroach-teamcity opened this issue Jan 13, 2019 · 24 comments · Fixed by #35249
Closed

roachtest: scaledata/distributed_semaphore/nodes=3 failed #33698

cockroach-teamcity opened this issue Jan 13, 2019 · 24 comments · Fixed by #35249
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

SHA: https://github.com/cockroachdb/cockroach/commits/766a53defc27940d41ae8987792ecd79e76d2b65

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1090948&tab=buildLog

The test failed on master:
	test.go:1101: test timed out (20m0s)
	test.go:696,cluster.go:1511,scaledata.go:126,scaledata.go:53: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1090948-scaledata-distributed-semaphore-nodes-3:4 -- ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.128.0.154:26257,10.128.0.124:26257,10.128.0.155:26257'  returned:
		stderr:
		13 12:19:43.818029251 +0000 UTC m=+574.303702067, now = 2019-01-13 12:19:58.760941404 +0000 UTC m=+589.246614276, took 14.942912209s
		2019/01/13 12:19:58 Aborting Retries because this error of type *crdb.AmbiguousCommitError is not retryable : pq: result is ambiguous (error=initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.128.0.124:26257: connect: connection refused" [exhausted])
		2019/01/13 12:19:58 RobustDB.RandomDB chose DB at index 0
		2019/01/13 12:19:58 [/Users/nathan/Go/src/github.com/scaledata/rksql/src/go/src/rubrik/sqlapp/tx.go:29 pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=4ed1fae5 key=/Table/54/1/417173639025852417/0 rw=true pri=0.00407482 stat=PENDING epo=0 ts=1547381994.115138047,1 orig=1547381983.836350074,0 max=1547381984.336350074,0 wto=true seq=2]
		
		stdout:
		e timestamp 1547381409.525352672,1 within uncertainty interval `t <= 1547381409.538741774,0`; observed timestamps: [{1 1547381409.538741774,0} {2 1547381409.524794522,0}]
		pq: restart transaction: TransactionRetryWithProtoRefreshError: WriteTooOldError: write at timestamp 1547381409.540432066,0 too old; wrote at 1547381409.540432066,2
		pq: restart transaction: TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError: read at time 1547381409.523763080,0 encountered previous write with future timestamp 1547381409.540432066,1 within uncertainty interval `t <= 1547381409.603062510,0`; observed timestamps: [{1 1547381409.603062510,0} {3 1547381409.523763080,0}]
		pq: restart transaction: TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError: read at time 1547381409.524120255,0 encountered previous write with future timestamp 1547381409.540432066,1 within uncertainty interval `t <= 1547381409.603154178,0`; observed timestamps: [{1 1547381409.603154178,0} {3 1547381409.524120255,0}]
		: signal: killed

@cockroach-teamcity cockroach-teamcity added this to the 2.2 milestone Jan 13, 2019
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 13, 2019
@tbg
Copy link
Member

tbg commented Jan 15, 2019

Hmm I can't really read these logs.

ExecuteTx retry attempt 7 failed

is the highest retry count I can find (i.e. there isn't anything that retries 8 times). Also, the log covers approximately 10 minutes, which is for how long the workload is supposed to run for. So what happened to the other 10 minutes? The latest sqlapp log entry is

2019/01/13 12:19:58 [/Users/nathan/Go/src/github.com/scaledata/rksql/src/go/src/rubrik/sqlapp/tx.go:29 pq: restart transaction: TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE): "sql txn" id=4ed1fae5 key=/Table/54/1/417173639025852417/0 rw=true pri=0.00407482 stat=PENDING epo=0 ts=1547381994.115138047,1 orig=1547381983.836350074,0 max=1547381984.336350074,0 wto=true seq=2]

which is just about when it was supposed to shut down. It's also when the chaos runner stops. But then the test just hangs for another ~10min. Looking at the code, this must be caused by sqlapp not returning. Unclear to me why, the cluster is left intact by the chaos runner (i.e. all nodes up). The node logs are innocuous, the only thing I don't understand is this error that pops up a small number of times:

W190113 12:29:48.051981 59542 vendor/google.golang.org/grpc/server.go:666 grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"

CHAOS: 12:20:06 chaos.go:97: restarting :2 (chaos is done)
12:20:06 cluster.go:251: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod start teamcity-1090948-scaledata-distributed-semaphore-nodes-3:2 --encrypt
teamcity-1090948-scaledata-distributed-semaphore-nodes-3: starting..2: 4510

CHAOS: 12:20:08 chaos.go:65: chaos stopping: <nil>
12:29:47 cluster.go:911: fetching debug zip

Not sure what to make of this. I did poke at the debug zip but it doesn't contain SHOW QUERIES or SHOW SESSIONS. We should add both.

@tbg
Copy link
Member

tbg commented Jan 22, 2019

Actually, maybe this log means something bad:

  2019/01/13 12:19:58 Aborting Retries because this error of type *crdb.AmbiguousCommitError is not retryable : pq: result is ambiguous (error=initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.128.0.124:26257: connect: connection refused" [exhausted])

It shouldn't, but just going out on a limb.

We should edit the source code of this generator to emit better logging that would help investigating this particular problem.

@tbg tbg assigned tbg and unassigned andreimatei Jan 22, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/82026117d83262e87873aad52b8eca2dd0bea335

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126329&tab=buildLog

The test failed on provisional_201902052032_v2.2.0-alpha.20190211:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1126329-scaledata-distributed-semaphore-nodes-3 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		Creating cluster teamcity-1126329-scaledata-distributed-semaphore-nodes-3 with 4 nodes
		Unable to create cluster:
		in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20190122a --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script756390223 --project cockroach-ephemeral]
		Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  '(resource type:pd-ssd)'.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/b9242838f18107318f53c4e22266fd12223f7651

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1126617&tab=buildLog

The test failed on master:
	test.go:743,test.go:755: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-1126617-scaledata-distributed-semaphore-nodes-3 -n 4 --gce-machine-type=n1-standard-4 --gce-zones=us-central1-b,us-west1-b,europe-west2-b --local-ssd-no-ext4-barrier returned:
		stderr:
		
		stdout:
		Creating cluster teamcity-1126617-scaledata-distributed-semaphore-nodes-3 with 4 nodes
		Unable to create cluster:
		in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-1604-xenial-v20190122a --image-project ubuntu-os-cloud --boot-disk-size 10 --boot-disk-type pd-ssd --service-account [email protected] --local-ssd interface=SCSI --machine-type n1-standard-4 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script828998365 --project cockroach-ephemeral]
		Output: ERROR: (gcloud.compute.instances.create) Could not fetch resource:
		 - The zone 'projects/cockroach-ephemeral/zones/us-central1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.
		
		: exit status 1
		Cleaning up...
		: exit status 1

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4e2e450ecf5c3f73c7de46e48696528468803db7

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1128035&tab=buildLog

The test failed on master:
	test.go:743,cluster.go:1585,scaledata.go:126,scaledata.go:53: unexpected node event: 2: dead

@tbg
Copy link
Member

tbg commented Feb 11, 2019

@andreimatei

txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=2726a642 key=/Table/53/1/424262250136109059/0 rw=true pri=0.04558042 stat=ABORTED epo=2 ts=1549545461.435381416,0 orig=1549545461.435381416,0 max=1549545461.932542083,0

I190207 13:17:41.818247 92503 sql/distsql_running.go:149  [n3,client=10.142.0.26:45770,user=root] client rejected when attempting to run DistSQL plan: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" id=2726a642 key=/Table/53/1/424262250136109059/0 rw=true pri=0.04558042 stat=ABORTED epo=2 ts=1549545461.435381416,0 orig=1549545461.435381416,0 max=1549545461.932542083,0 wto=false seq=2
F190207 13:17:42.443155 92621 kv/txn_interceptor_heartbeat.go:447  txn committed or aborted but heartbeat loop hasn't been signaled to stop. txn: "sql txn" id=2726a642 key=/Table/53/1/424262250136109059/0 rw=true pri=0.04558042 stat=ABORTED epo=2 ts=1549545461.435381416,0 orig=1549545461.435381416,0 max=1549545461.932542083,0 wto=false seq=2
goroutine 92621 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x518a800, 0x1f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x58fadc0, 0xc000000004, 0x518a8f7, 0x1f, 0x1bf, 0xc005ca9560, 0x114)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38cdfc0, 0xc007205d40, 0x4, 0x2, 0x31d72f1, 0x51, 0xc006005ad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38cdfc0, 0xc007205d40, 0x1, 0xc000000004, 0x31d72f1, 0x51, 0xc00618cad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38cdfc0, 0xc007205d40, 0x31d72f1, 0x51, 0xc00618cad8, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeat(0xc008976d88, 0x38cdfc0, 0xc007205d40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:447 +0x150
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).heartbeatLoop(0xc008976d88, 0x38cdfc0, 0xc007205d40)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:411 +0x1f7
github.com/cockroachdb/cockroach/pkg/kv.(*txnHeartbeat).startHeartbeatLoopLocked.func1(0x38cdfc0, 0xc007205d40)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_interceptor_heartbeat.go:370 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc0001c4630, 0x38cdfc0, 0xc007205d40, 0xc0083f11a0, 0x29, 0x3902ec0, 0xc0002eac00, 0xc009802120)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:323 +0xe6
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:318 +0x134

@tbg tbg assigned andreimatei and unassigned tbg Feb 11, 2019
@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/10f8010fa5778e740c057905e2d7664b5fd5d647

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1135549&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 2: dead

@tbg
Copy link
Member

tbg commented Feb 13, 2019

Same. cc @andreimatei

F190213 14:08:54.279702 129680 kv/txn_coord_sender.go:547  [n2,client=10.142.0.2:56154,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=a713e91e key=/Table/53/1/425970970990706690/0 rw=true pri=0.05974113 stat=ABORTED epo=1 ts=1550066933.723333955,1 orig=1550066933.723333955,1 max=1550066934.222991287,0 wto=false seq=0
goroutine 129680 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x51af200, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1016 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x591e240, 0xc000000004, 0x51af2fd, 0x16, 0x223, 0xc000a85320, 0x11d)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:872 +0x95a
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x38e6ea0, 0xc005c91440, 0x4, 0x2, 0x31bc6b3, 0x33, 0xc005135960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:85 +0x2d5
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x38e6ea0, 0xc005c91440, 0x1, 0x4, 0x31bc6b3, 0x33, 0xc005135960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:71 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatalf(0x38e6ea0, 0xc005c91440, 0x31bc6b3, 0x33, 0xc005135960, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:182 +0x7e
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).augmentMetaLocked(0xc0053ea000, 0x38e6ea0, 0xc005c91440, 0x6d418db41ee913a7, 0x36fe8481653232bd, 0xc005f94be0, 0xc, 0x12, 0x1, 0x1582f1a9e2a88543, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:547 +0x1d3
github.com/cockroachdb/cockroach/pkg/kv.(*TxnCoordSender).AugmentMeta(0xc0053ea000, 0x38e6ea0, 0xc005c91440, 0x6d418db41ee913a7, 0x36fe8481653232bd, 0xc005f94be0, 0xc, 0x12, 0x1, 0x1582f1a9e2a88543, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/txn_coord_sender.go:540 +0x101
github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).AugmentTxnCoordMeta(0xc009113b00, 0x38e6ea0, 0xc005c91440, 0x6d418db41ee913a7, 0x36fe8481653232bd, 0xc005f94be0, 0xc, 0x12, 0x1, 0x1582f1a9e2a88543, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:868 +0xce
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLReceiver).Push(0xc0004c8dc0, 0x0, 0x0, 0x0, 0xc0099530d8, 0xc00922c8b0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:463 +0x493
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*accountClearingRowReceiver).Push(0xc006d6fd40, 0x0, 0x0, 0x0, 0xc0099530d8, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:266 +0x83
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*copyingRowReceiver).Push(0xc005674700, 0x0, 0x0, 0x0, 0xc0099530d8, 0xc000000000)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:693 +0x18b
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.Run(0x38e6ea0, 0xc006d6fa70, 0x38f2ea0, 0xc00922c480, 0x38d85a0, 0xc005674700)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/base.go:178 +0x5e
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*ProcessorBase).Run(0xc00922c480, 0x38e6ea0, 0xc006d6fa70)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/processors.go:801 +0x92
github.com/cockroachdb/cockroach/pkg/sql/distsqlrun.(*Flow).Run(0xc0089c6380, 0x38e6ea0, 0xc006d6fa70, 0x3299218, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsqlrun/flow.go:649 +0x209
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc000553540, 0xc00690e6c0, 0xc009113b00, 0xc005136990, 0xc0004c8dc0, 0xc0060d3518, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:252 +0x8a0
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc000553540, 0x38e6ea0, 0xc005c91440, 0xc0060d3518, 0xc00690e6c0, 0xc009113b00, 0x38e8aa0, 0xc009654700, 0xc0004c8dc0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:795 +0x227
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0xc0060d3458, 0x3, 0x7f83873dfae0, 0xc006275b80, 0x1, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1044 +0x26e
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0xc0060d3458, 0x7f83873dfae0, 0xc006275b80, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:900 +0x644
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0x38ec8e0, 0xc005d2b980, 0xc00573940e, 0x5a, 0x1, 0xc00540e900, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:438 +0xe49
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc0060d3000, 0x38e6ea0, 0xc005c91440, 0x38ec8e0, 0xc005d2b980, 0xc00573940e, 0x5a, 0x1, 0xc00540e900, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:93 +0x36f
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc0060d3000, 0x38e6de0, 0xc0061de940, 0xc0005dd9b8, 0x5400, 0x15000, 0xc0005dda50, 0xc009098510, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1241 +0x1439
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000c381a0, 0x38e6de0, 0xc0061de940, 0xc0060d3000, 0x5400, 0x15000, 0xc0005dda50, 0xc009098510, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:429 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000c381a0, 0xc00616e000, 0xc00906e724, 0xc009098510, 0x38e6de0, 0xc0061de940, 0xc0060d3000, 0x5400, 0x15000, 0xc0005dda50, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:331 +0xcf
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:314 +0x1040

@andreimatei
Copy link
Contributor

andreimatei commented Feb 13, 2019

#34695 is the failure for the 6-node version of this test. They're probably related.

@tbg you say "same", but the fatal you pasted is "unexpected non-pending txn in augmentMetaLocked" which is a new one, isn't it? Although I'm sure it's related to the "txn committed or aborted but heartbeat loop hasn't been signaled to stop" condition we've seen before.

@tbg
Copy link
Member

tbg commented Feb 13, 2019 via email

@andreimatei
Copy link
Contributor

I've been making progress in the meantime.

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/4fc4d63ddddeb507564bfe53c6cd79549ff2fd27

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1141650&tab=buildLog

The test failed on master:
	cluster.go:1192,chaos.go:92,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1141650-scaledata-distributed-semaphore-nodes-3:1 returned:
		stderr:
		
		stdout:
		teamcity-1141650-scaledata-distributed-semaphore-nodes-3: stopping and waiting: signal: killed
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 2: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/8e9a1e310e3e8e37f091b7ca8bd204084ad9e2e5

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1142461&tab=buildLog

The test failed on master:
	cluster.go:1192,chaos.go:92,cluster.go:1564,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-1142461-scaledata-distributed-semaphore-nodes-3:1 returned:
		stderr:
		
		stdout:
		teamcity-1142461-scaledata-distributed-semaphore-nodes-3: stopping and waiting: signal: killed
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 3: dead

@tbg
Copy link
Member

tbg commented Feb 18, 2019

F190217 14:29:10.671394 63133 kv/txn_coord_sender.go:577  [n2,client=10.142.0.5:33496,user=root] unexpected non-pending txn in augmentMetaLocked: "sql txn" id=33909433 key=/Table/53/1/427108402845548547/0 rw=true pri=0.12336164 stat=ABORTED epo=1 ts=1550413750.090045549,17 orig=1550413750.090045549,17 max=1550413750.568148075,7 wto=false seq=0
goroutine 63133 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0xc000057b00, 0xc000057b60, 0x5206300, 0x16)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:1018 +0xd4
github.com/cockroachdb/cockroach/pkg/util/log.(*loggingT).outputLogEntry(0x59774e0, 0xc000000004, 0x52063f6, 0x16, 0x241, 0xc008de0480, 0x11f)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:874 +0x95a

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/bd80a74f882a583d6bb2a04dfdb57b49254bc7ba

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1143393&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1212: unexpected node event: 3: dead

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/763e90b043b5d732856d3ecf1d7b0d6aa33e3b26

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1147715&tab=buildLog

The test failed on provisional_201902192209_v19.1.0-beta.20190225:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1211: unexpected node event: 2: dead

@tbg
Copy link
Member

tbg commented Feb 22, 2019

^- same as usual

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/d0a93d286ee42ceb94f986b6a06a1afd52cf914e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1149020&tab=buildLog

The test failed on master:
	cluster.go:1585,scaledata.go:126,scaledata.go:53,test.go:1211: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1149020-scaledata-distributed-semaphore-nodes-3:4 -- ./distributed_semaphore  --duration_secs=600 --num_workers=16 --cockroach_ip_addresses_csv='10.142.0.10:26257,10.142.0.29:26257,10.142.0.7:26257'  returned:
		stderr:
		2019/02/23 15:19:37 connecting to CockroachDB in insecure mode
		2019/02/23 15:19:37 WARNING: crdbutil: connecting to crdb using root user
		2019/02/23 15:19:37 connecting to CockroachDB in insecure mode
		2019/02/23 15:19:37 WARNING: crdbutil: connecting to crdb using root user
		2019/02/23 15:19:37 connecting to CockroachDB in insecure mode
		2019/02/23 15:19:37 WARNING: crdbutil: connecting to crdb using root user
		2019/02/23 15:19:37 dial tcp 10.142.0.29:26257: connect: connection refused
		Error:  exit status 255
		
		stdout:
		: exit status 1

@tbg
Copy link
Member

tbg commented Feb 27, 2019

#34695 (comment)

@andreimatei
Copy link
Contributor

andreimatei commented Feb 27, 2019

@nvanbenschoten would you mind taking a look at the last failure here? It's different than the other (and also I think I had fixed the others). This one looks like the semaphores app crashed because it couldn't connect to node 2, and node 2 was dead because of chaos at that time. Do we have sources for these "apps"? I see the test downloads some binaries.
One thing I've noticed in the node logs is that schema changes seemed to be running minutes after test start, so I'm thinking perhaps the application started up way later than it usually does in this particular test run and it overlapped with the chaos differently than usually?

andreimatei added a commit to andreimatei/cockroach that referenced this issue Feb 27, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and I
believe that has fixed the majority of crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError should also be well
possible)

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

(I believe all the issues above were really fixed by cockroachdb#35105 but this
patch makes it more convincing)

Release note: None
@tbg
Copy link
Member

tbg commented Feb 28, 2019

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/3a7ea2d8c9d4a3e0d97f8f106fcf95b3f03765ec

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1187480&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 3: dead

@tbg
Copy link
Member

tbg commented Mar 20, 2019

Looks like this tickled a zig zag bug:

E190320 12:17:53.060953 523 util/log/crash_reporting.go:216  [n3,client=10.142.0.179:55746,user=root] a panic has occurred!
panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: non-values node passed as fixed value to zigzag join [recovered]
	panic: panic while executing 1 statements: SELECT * FROM _ WHERE (_._ IS _) AND (((_ = $1) AND (_ = $2))); caused by non-values node passed as fixed value to zigzag join

goroutine 565 [running]:
panic(0x3103ee0, 0xc00644cab0)
	/usr/local/go/src/runtime/panic.go:556 +0x2cb fp=0xc006433408 sp=0xc006433378 pc=0x72b49b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).closeWrapper(0xc005efd100, 0x39fc4a0, 0xc0062e5c80, 0x2d97f80, 0x39b5110)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:714 +0x36d fp=0xc0064334c8 sp=0xc006433408 pc=0x1fadaad
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn.func1(0xc005efd100, 0x39fc4a0, 0xc0062e5c80)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:426 +0x61 fp=0xc006433500 sp=0xc0064334c8 pc=0x213aa11
runtime.call32(0x0, 0x33f85f0, 0xc005377700, 0x1800000018)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc006433530 sp=0xc006433500 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0064335c0 sp=0xc006433530 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build.func1(0xc0064349a8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:66 +0x9b fp=0xc0064335e8 sp=0xc0064335c0 pc=0x1bcb19b
runtime.call32(0x0, 0x33f9038, 0xc000094a70, 0x800000008)
	/usr/local/go/src/runtime/asm_amd64.s:522 +0x3b fp=0xc006433618 sp=0xc0064335e8 pc=0x75a06b
panic(0x2d97f80, 0x39b5110)
	/usr/local/go/src/runtime/panic.go:513 +0x1b9 fp=0xc0064336a8 sp=0xc006433618 pc=0x72b389
github.com/cockroachdb/cockroach/pkg/sql.(*execFactory).ConstructZigzagJoin(0xc005b09a88, 0x3a3c160, 0xc0064eb040, 0x3a2aa80, 0xc005da5d28, 0x3a3c160, 0xc0064eb040, 0x3a2aa80, 0xc005da5d90, 0xc0064f0bb8, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt_exec_factory.go:736 +0x8ea fp=0xc006433758 sp=0xc0064336a8 pc=0x206d37a
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildZigzagJoin(0xc0064462a0, 0xc0063f71f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc006433ba0, 0x1aef637)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1235 +0xe46 fp=0xc006433b10 sp=0xc006433758 pc=0x1bc0626
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc0064462a0, 0x3a36200, 0xc0063f71f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:190 +0x732 fp=0xc006434058 sp=0xc006433b10 pc=0x1bb5022
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildLookupJoin(0xc0064462a0, 0xc0064e0a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc0064343c0, 0xc0064343c8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:1128 +0x97 fp=0xc006434328 sp=0xc006434058 pc=0x1bbedf7
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).buildRelational(0xc0064462a0, 0x3a353a0, 0xc0064e0a00, 0x0, 0x0, 0x0, 0x0, 0x0, 0xc000094a40, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/relational_builder.go:187 +0xe31 fp=0xc006434870 sp=0xc006434328 pc=0x1bb5721
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).build(0xc0064462a0, 0x3a0a4e0, 0xc0064e0a00, 0x0, 0xc006434980, 0x70b8a8, 0x60)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:83 +0xab fp=0xc006434930 sp=0xc006434870 pc=0x1bb36fb
github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder.(*Builder).Build(0xc0064462a0, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/opt/exec/execbuilder/builder.go:71 +0x78 fp=0xc006434990 sp=0xc006434930 pc=0x1bb3578
github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan(0xc005efd450, 0x39fc560, 0xc00644c2a0, 0x32b8e38, 0x19, 0xbe52b8db5, 0x391d63e)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:162 +0x18b fp=0xc006434a18 sp=0xc006434990 pc=0x209988b
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0xc005efd450, 0x6, 0x70b01f)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:981 +0x131 fp=0xc006434ad8 sp=0xc006434a18 pc=0x1fc00e1
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0xc005efd450, 0x7fc652d8a340, 0xc0059c5d90, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:868 +0x148 fp=0xc006434bd8 sp=0xc006434ad8 pc=0x1fbf3c8
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0x3a020a0, 0xc0062e5d80, 0xc00638e023, 0x6e, 0x2, 0xc0064fcd80, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:459 +0xdf6 fp=0xc0064353b8 sp=0xc006434bd8 pc=0x1fbb976
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc005efd100, 0x39fc560, 0xc00644c2a0, 0x3a020a0, 0xc0062e5d80, 0xc00638e023, 0x6e, 0x2, 0xc0064fcd80, 0x7, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:101 +0x610 fp=0xc006435660 sp=0xc0064353b8 pc=0x1fba700
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc005efd100, 0x39fc4a0, 0xc0062e5c80, 0xc000959538, 0x5400, 0x15000, 0xc0009595d0, 0xc0063604b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1237 +0x1418 fp=0xc006435ec8 sp=0xc006435660 pc=0x1fb04d8
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc000c54000, 0x39fc4a0, 0xc0062e5c80, 0xc005efd100, 0x5400, 0x15000, 0xc0009595d0, 0xc0063604b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:428 +0xce fp=0xc006435f28 sp=0xc006435ec8 pc=0x1fabe2e
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl.func4(0xc000c54000, 0x39fc4a0, 0xc0062e5c80, 0xc000012400, 0xc00636aba4, 0xc0063604b0, 0xc005efd100, 0x5400, 0x15000, 0xc0009595d0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:338 +0xe6 fp=0xc006435f88 sp=0xc006435f28 pc=0x22323b6
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1333 +0x1 fp=0xc006435f90 sp=0xc006435f88 pc=0x75bd81
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:321 +0x1080

cc @justinj

@cockroach-teamcity
Copy link
Member Author

SHA: https://github.com/cockroachdb/cockroach/commits/dfa23c01e4ea39b19ca8b2e5c8a4e7cf9b9445f4

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=scaledata/distributed_semaphore/nodes=3 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1189954&tab=buildLog

The test failed on master:
	cluster.go:1626,scaledata.go:126,scaledata.go:53,test.go:1214: unexpected node event: 3: dead

andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
craig bot pushed a commit that referenced this issue Mar 21, 2019
35249: kv: don't ingest aborted TxnCoordMeta r=andreimatei a=andreimatei

Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since #35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes #34695
Fixes #34341
Fixes #33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.

Co-authored-by: Andrei Matei <[email protected]>
@craig craig bot closed this as completed in #35249 Mar 21, 2019
andreimatei added a commit to andreimatei/cockroach that referenced this issue Mar 21, 2019
Remote DistSQL flows pass TxnCoordMeta records to the Root
Txn(CoordSender) as trailing metadata. The TCS ingests these records and
updates its state (mostly for read spans).
This patch makes it so that we don't ingest records with an ABORTED txn
proto. Why not? Because, well, unfortunately we are not well equiped at
the moment for finding out about an aborted txn this way.
The idea is that, if the Root was running along happily and all of a
sudden ingests one of these Aborted protos, it would put it in an
inconsistent state: with an Aborted proto but with the heartbeat loop
still running. We don't like that state and we have assertions against
it. The expectation is that the TCS finds out about aborted txns in one
of two ways: through a TxnAbortedError, in which case it rolls back the
txn, or through the heartbeat loop discovering the aborted txn, in which
case it again rolls back (and a 3rd way through a remote TxnAbortedErr;
see below). We have not considered this 4th way of finding
out, through a remote TxnCoordMeta, and I don't really want to deal with
it because, with current code, it's already awkward enough to handle the
other cases.

In practice, a TxnCoordMeta with an ABORTED proto is expected to follow
a TxnAbortedError that is passed through DistSQL to the gateway (the
DistSQLReceiver) before the TxnCoordMeta. That case we handle; we inject
retriable errors into the Root txn and the TCS rolls back. After this
rollback, injesting the ABORTED proto just works (it's a no-op).
However, alas, there's a case where the TxnAbortedError is not passed to
the TCS: this is when another concurrent error was received first by the
DistSQLReceiver. In that case, the 2nd error is ignored, and so this
patch makes it so that we also effectively ignore the upcoming
TxnCoordMeta.

I'm separately reworking the way error handling happens in the Txn/TCS
and that work should make this unfortunate patch unnecessary.

(since cockroachdb#35105 not all preceding errors cause the TxnAbortedError to be
ignored; other retriable errors no longer cause it to be ignored and
that has fixed the some crashes that we've seen because
of this inconsistent state that this patch is trying to avoid. However,
non-retriable errors racing with a TxnAbortedError are also possible,
and we've seen them happen and leading to crashes - in particular, we've
seen RPC errors).

Fixes cockroachdb#34695
Fixes cockroachdb#34341
Fixes cockroachdb#33698

Release note (bug fix): Fix crashes with the message "unexpected
non-pending txn in augmentMetaLocked" caused by distributed queries
encountering multiple errors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants