roachtest: tpcc/nodes=3/w=max failed with foreign key violation #35812

nvanbenschoten · 2019-03-15T23:24:10Z

SHA:

https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e

Failed test:

https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog

The test failed on master:
	cluster.go:1251,tpcc.go:126,cluster.go:1589,errgroup.go:57: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1176948-tpcc-nodes-3-w-max:4 -- ./workload run tpcc --warehouses=1350 --histograms=logs/stats.json  --ramp=5m0s --duration=2h0m0s {pgurl:1-3} returned:
		stderr:
		
		stdout:
		 243.7          253.9   3489.7  10737.4  28991.0  49392.1 newOrder
		   43m5s        0           25.0           25.6   1208.0   3221.2   3221.2   3221.2 orderStatus
		   43m5s        0          314.6          255.0   3623.9  27917.3  77309.4 103079.2 payment
		   43m5s        0           23.0           25.5   2080.4   3489.7  11274.3  11274.3 stockLevel
		   43m6s        0           15.0           25.5   4563.4  17179.9  34359.7  34359.7 delivery
		   43m6s        0          248.3          253.9   3221.2  13421.8  47244.6 103079.2 newOrder
		   43m6s        0           23.0           25.6    805.3   2952.8   2952.8   2952.8 orderStatus
		   43m6s        0          180.2          255.0   2818.6  15032.4  40802.2 103079.2 payment
		   43m6s        0           22.0           25.5    771.8   3087.0   3087.0   3087.0 stockLevel
		Error: error in newOrder: ERROR: foreign key violation: value [914 3 3050] not found in order@primary [o_w_id o_d_id o_id] (txn=f995c83d-a90d-4856-9b61-72713ca22db5) (SQLSTATE 23503)
		Error:  exit status 1
		: exit status 1
	cluster.go:1610,tpcc.go:136,tpcc.go:160,test.go:1214: Goexit() was called

Artifacts:

https://drive.google.com/open?id=1bQWTo6DOlNj8ie1cFepZdMNMirRANMq8

A few theories we can immediately ignore:

this was a very long running transaction

We see from the workload logs a cumulative ops/sec for new_order transactions of 253.9. The load ran for 43m6s, so we expect to have performed about 253.9*(43*60+5) = 656,332 new order transactions.

We see from the error that this was order 3050 in warehouse 914 and district 3. Each district begins the workload with 3000 orders. That means that this was the 50th new_order transaction performed for this warehouse/district.

The test ran with 1350 warehouses, which means it ran with 13500 unique districts. Given a uniform distribution of new_order transactions, we expect the 50th order for a given district to take place after 1350*10*49 = 661,500 completed new_order transactions. Since this is close to our estimate for the total number of new_order transactions performed, we can conclude that the victim transaction began very recently.

the transaction was aborted and its abort span was already GCed

The abort span is not removed until an hour after the transaction is last active:

cockroach/pkg/storage/storagebase/base.go

Lines 37 to 44 in 2695531

    
           // TxnCleanupThreshold is the threshold after which a transaction is 
        
           // considered abandoned and fit for removal, as measured by the 
        
           // maximum of its last heartbeat and timestamp. Abort spans for the 
        
           // transaction are cleaned up at the same time. 
        
           // 
        
           // TODO(tschottdorf): need to enforce at all times that this is much 
        
           // larger than the heartbeat interval used by the coordinator. 
        
           const TxnCleanupThreshold = time.Hour

The load was only running for 43 minutes, so no abort spans should have been GCed.

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2019-03-16T01:16:25Z

A QueryIntent request being run on a follower replica as a follower read might result in an issue like this. Follower reads were enabled at this SHA:

cockroach/pkg/storage/replica_follower_read.go

Lines 31 to 35 in 57e825a

    
           var FollowerReadsEnabled = settings.RegisterBoolSetting( 
        
           	"kv.closed_timestamp.follower_reads_enabled", 
        
           	"allow (all) replicas to serve consistent historical reads based on closed timestamp information", 
        
           	true, 
        
           )

nvanbenschoten · 2019-03-16T04:05:28Z

Actually quite the opposite, if we saw an inconsistency where a write that the transaction performed was dropped before the transaction came back to read it, the QueryIntent appended to the front of the read should be protecting us and resulting in a RETRY_ASYNC_WRITE_FAILURE error.

So why didn't the QueryIntent throw an error? Perhaps the write was already proven by a previous FK check? I'm not sure whether that is possible though, as the only table with a foreign key on the order table is the new_order table, and there should only be a single batched access to that table, which I assume would result in a single batched fk check. Perhaps this batch got split and the latter group went to a follower who was not sufficiently caught up?

Unrelated note: a new_order transaction will write anywhere from 5 to 15 order_line rows, each of which will point to the same order row. Do we really perform 5-15 separate fk checks? I really hope not.

Unrelated note 2: I think we're missing a foreign key from the new_order table to the order table. The spec says that we should have one.

tbg · 2019-03-18T16:46:23Z

So the sequence of events is something like this?

txn inserts into order (pipelined write)
txn potentially does other things
txn comes back and tries to validate the FK (its earlier insert into order)
QueryIntent sent to order (successful)
read from order shows key is missing?

If we think this has to do with follower reads, maybe it's worth running with a very aggressive closed timestamp duration? I have a half-baked WIP that I think will allow us to run with much smaller closed timestamp durations than today, without permanently wedging longer txns. I didn't test this in any real scenarios though and know that is still has some issues.

Also, this could "just" be a replica inconsistency, though I really hope it isn't (and the test doesn't do anything like restarting nodes or even serious rebalancing).

nvanbenschoten · 2019-03-19T02:08:15Z

So the sequence of events is something like this?

Yes, exactly. As Bram pointed out in #26786, each row performs its own fk check, so even though there was only a single SQL statement that should have needed to perform an fk check on the order row, we actually ended up issuing 5-15 individual ScanRequests. This explains how the pipelined order write was proven by a QueryIntent and a later lookup was able to miss the row without a prefixed QueryIntent.

Interestingly, with only a single fk lookup, this kind of issue would have been turned into a retryable RETRY_ASYNC_WRITE_FAILURE. I wonder if we've seen this before but masked behind an increase in these kinds of errors.

The good news is that I dropped kv.closed_timestamp.target_duration down to 5s and hit this failure on 2 of my 4 instances of this, so I think it's very likely that this is due to a bad interaction with follower reads. It's actually not immediately obvious that follower reads within a writing transaction should always preserve a read-your-writes property. I think the implicit assumption that makes it work is that a transaction will never read at a lower timestamp than it has previously written at. I wonder if we are failing to maintain that invariant somehow. Or perhaps this is just a standard closed timestamp bug.

cc. @ajwerner do you have any insight into this?

ajwerner · 2019-03-19T03:46:54Z

Do you know that a follower read is happening? / Does the repro disappear if you disable it?

We should isolate whether it’s an unsafe follower read vs an interaction with the mpt.

Are there any concurrent splits or lease transfers?

ajwerner · 2019-03-19T03:49:27Z

Reading above that, it was likely due to a txn with an old time stamp so probably a follower read.

tbg · 2019-03-19T10:31:50Z

If a txn got pushed since it wrote some of its intents at a higher timestamp than planned, and would restart on commit, are the fk checks run anyway? If they used the read timestamp they might miss the intent via a follower read.

nvanbenschoten · 2019-03-19T17:24:55Z

Good point, we do use the OrigTimestamp when determining follower read eligibility, which can absolutely diverge from provisional commit Timestamp:

cockroach/pkg/ccl/followerreadsccl/followerreads.go

Line 97 in 5be70eb

canUseFollowerRead(clusterID, st, ba.Txn.OrigTimestamp)

That violates the monotonic timestamp requirement mentioned above.

I'm going to make the following change and run again:

+++ b/pkg/ccl/followerreadsccl/followerreads.go
@@ -93,7 +93,7 @@ func canUseFollowerRead(clusterID uuid.UUID, st *cluster.Settings, ts hlc.Timest
 // canSendToFollower implements the logic for checking whether a batch request
 // may be sent to a follower.
 func canSendToFollower(clusterID uuid.UUID, st *cluster.Settings, ba roachpb.BatchRequest) bool {
-       return ba.IsReadOnly() && ba.Txn != nil &&
+       return ba.IsReadOnly() && ba.Txn != nil && !ba.Txn.IsWriting() &&
                canUseFollowerRead(clusterID, st, ba.Txn.OrigTimestamp)
 }

I suspect that doing so will fix this issue immediately. If so, I don't think we have any reason to suspect that this is a more severe closed timestamp issue.

nvanbenschoten · 2019-03-19T17:26:20Z

The following diff would also work though:

+++ b/pkg/ccl/followerreadsccl/followerreads.go
@@ -94,7 +94,7 @@ func canUseFollowerRead(clusterID uuid.UUID, st *cluster.Settings, ts hlc.Timest
 // may be sent to a follower.
 func canSendToFollower(clusterID uuid.UUID, st *cluster.Settings, ba roachpb.BatchRequest) bool {
        return ba.IsReadOnly() && ba.Txn != nil &&
-               canUseFollowerRead(clusterID, st, ba.Txn.OrigTimestamp)
+               canUseFollowerRead(clusterID, st, ba.Txn.Timestamp)
 }

We can decide which one we'd like to go with once it's confirmed that this fixes the issue.

nvanbenschoten · 2019-03-19T20:31:16Z

This alone didn't fix the issue. I think that makes sense, given that the diff doesn't actually prevent a writing transaction's batch from evaluating as a follower read, it just reduces the chance that one will be sent to a follower. I'm testing again with an additional check here.

nvanbenschoten · 2019-03-19T21:01:36Z

That seemed to do the trick!

tbg · 2019-03-19T21:07:45Z

Great, one less thing to worry about. Are we going to just file this and disable follower reads for writing txns in the 19.1 time frame?

nvanbenschoten · 2019-03-19T21:20:48Z

Are we going to just file this and disable follower reads for writing txns in the 19.1 time frame?

Yes, that's the plan. There isn't much of a pressing reason for getting it to work at the risk of other bugs.

Fixes cockroachdb#35812. To avoid missing its own writes, a transaction must not evaluate a read on a follower who has nit caught up to at least its current provisional commit timestamp. We were violating this both at the DistSender level and at the Replica level. Because the ability to perform follower reads in a writing transaction is fairly unimportant and has these known issues, this commit disallows follower reads for writing transactions. Release note: None

35969: kv: disallow follower reads for writing transactions r=nvanbenschoten a=nvanbenschoten Fixes #35812. To avoid missing its own writes, a transaction must not evaluate a read on a follower who has nit caught up to at least its current provisional commit timestamp. We were violating this both at the DistSender level and at the Replica level. Because the ability to perform follower reads in a writing transaction is fairly unimportant and has these known issues, this commit disallows follower reads for writing transactions. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

Fixes cockroachdb#35812. To avoid missing its own writes, a transaction must not evaluate a read on a follower who has nit caught up to at least its current provisional commit timestamp. We were violating this both at the DistSender level and at the Replica level. Because the ability to perform follower reads in a writing transaction is fairly unimportant and has these known issues, this commit disallows follower reads for writing transactions. Release note: None

nvanbenschoten mentioned this issue Mar 15, 2019

roachtest: tpcc/nodes=3/w=max failed #35337

Closed

andreimatei mentioned this issue Mar 15, 2019

19.1 release blockers list #35554

Closed

23 tasks

nvanbenschoten self-assigned this Mar 16, 2019

tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Mar 18, 2019

nvanbenschoten mentioned this issue Mar 18, 2019

workload/tpcc: add missing new_order -> order foreign key #35891

Closed

nvanbenschoten mentioned this issue Mar 19, 2019

kv: disallow follower reads for writing transactions #35969

Merged

craig bot closed this as completed in #35969 Mar 21, 2019

nvanbenschoten mentioned this issue Mar 21, 2019

release-19.1: kv: disallow follower reads for writing transactions #36013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpcc/nodes=3/w=max failed with foreign key violation #35812

roachtest: tpcc/nodes=3/w=max failed with foreign key violation #35812

nvanbenschoten commented Mar 15, 2019

nvanbenschoten commented Mar 16, 2019

nvanbenschoten commented Mar 16, 2019

tbg commented Mar 18, 2019

nvanbenschoten commented Mar 19, 2019

ajwerner commented Mar 19, 2019

ajwerner commented Mar 19, 2019

tbg commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

tbg commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

roachtest: tpcc/nodes=3/w=max failed with foreign key violation #35812

roachtest: tpcc/nodes=3/w=max failed with foreign key violation #35812

Comments

nvanbenschoten commented Mar 15, 2019

SHA:

Failed test:

Artifacts:

A few theories we can immediately ignore:

this was a very long running transaction

the transaction was aborted and its abort span was already GCed

nvanbenschoten commented Mar 16, 2019

nvanbenschoten commented Mar 16, 2019

tbg commented Mar 18, 2019

nvanbenschoten commented Mar 19, 2019

ajwerner commented Mar 19, 2019

ajwerner commented Mar 19, 2019

tbg commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019

tbg commented Mar 19, 2019

nvanbenschoten commented Mar 19, 2019