-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/nodes=3/w=max failed with foreign key violation #35812
Comments
A cockroach/pkg/storage/replica_follower_read.go Lines 31 to 35 in 57e825a
|
Actually quite the opposite, if we saw an inconsistency where a write that the transaction performed was dropped before the transaction came back to read it, the So why didn't the Unrelated note: a Unrelated note 2: I think we're missing a foreign key from the |
So the sequence of events is something like this?
If we think this has to do with follower reads, maybe it's worth running with a very aggressive closed timestamp duration? I have a half-baked WIP that I think will allow us to run with much smaller closed timestamp durations than today, without permanently wedging longer txns. I didn't test this in any real scenarios though and know that is still has some issues. Also, this could "just" be a replica inconsistency, though I really hope it isn't (and the test doesn't do anything like restarting nodes or even serious rebalancing). |
Yes, exactly. As Bram pointed out in #26786, each row performs its own fk check, so even though there was only a single SQL statement that should have needed to perform an fk check on the Interestingly, with only a single fk lookup, this kind of issue would have been turned into a retryable The good news is that I dropped cc. @ajwerner do you have any insight into this? |
Do you know that a follower read is happening? / Does the repro disappear if you disable it? We should isolate whether it’s an unsafe follower read vs an interaction with the mpt. Are there any concurrent splits or lease transfers? |
Reading above that, it was likely due to a txn with an old time stamp so probably a follower read. |
If a txn got pushed since it wrote some of its intents at a higher timestamp than planned, and would restart on commit, are the fk checks run anyway? If they used the read timestamp they might miss the intent via a follower read. |
Good point, we do use the
That violates the monotonic timestamp requirement mentioned above. I'm going to make the following change and run again:
I suspect that doing so will fix this issue immediately. If so, I don't think we have any reason to suspect that this is a more severe closed timestamp issue. |
The following diff would also work though:
We can decide which one we'd like to go with once it's confirmed that this fixes the issue. |
This alone didn't fix the issue. I think that makes sense, given that the diff doesn't actually prevent a writing transaction's batch from evaluating as a follower read, it just reduces the chance that one will be sent to a follower. I'm testing again with an additional check here. |
That seemed to do the trick! |
Great, one less thing to worry about. Are we going to just file this and disable follower reads for writing txns in the 19.1 time frame? |
Yes, that's the plan. There isn't much of a pressing reason for getting it to work at the risk of other bugs. |
Fixes cockroachdb#35812. To avoid missing its own writes, a transaction must not evaluate a read on a follower who has nit caught up to at least its current provisional commit timestamp. We were violating this both at the DistSender level and at the Replica level. Because the ability to perform follower reads in a writing transaction is fairly unimportant and has these known issues, this commit disallows follower reads for writing transactions. Release note: None
Fixes cockroachdb#35812. To avoid missing its own writes, a transaction must not evaluate a read on a follower who has nit caught up to at least its current provisional commit timestamp. We were violating this both at the DistSender level and at the Replica level. Because the ability to perform follower reads in a writing transaction is fairly unimportant and has these known issues, this commit disallows follower reads for writing transactions. Release note: None
35969: kv: disallow follower reads for writing transactions r=nvanbenschoten a=nvanbenschoten Fixes #35812. To avoid missing its own writes, a transaction must not evaluate a read on a follower who has nit caught up to at least its current provisional commit timestamp. We were violating this both at the DistSender level and at the Replica level. Because the ability to perform follower reads in a writing transaction is fairly unimportant and has these known issues, this commit disallows follower reads for writing transactions. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
Fixes cockroachdb#35812. To avoid missing its own writes, a transaction must not evaluate a read on a follower who has nit caught up to at least its current provisional commit timestamp. We were violating this both at the DistSender level and at the Replica level. Because the ability to perform follower reads in a writing transaction is fairly unimportant and has these known issues, this commit disallows follower reads for writing transactions. Release note: None
Pulled from #35337 (comment).
SHA:
https://github.com/cockroachdb/cockroach/commits/57e825a7940495b67e0cc7213a5fabc24e12be0e
Failed test:
https://teamcity.cockroachdb.com/viewLog.html?buildId=1176948&tab=buildLog
Artifacts:
https://drive.google.com/open?id=1bQWTo6DOlNj8ie1cFepZdMNMirRANMq8
A few theories we can immediately ignore:
this was a very long running transaction
We see from the workload logs a cumulative ops/sec for
new_order
transactions of 253.9. The load ran for43m6s
, so we expect to have performed about253.9*(43*60+5) = 656,332
new order transactions.We see from the error that this was order 3050 in warehouse 914 and district 3. Each district begins the workload with 3000 orders. That means that this was the 50th
new_order
transaction performed for this warehouse/district.The test ran with 1350 warehouses, which means it ran with 13500 unique districts. Given a uniform distribution of
new_order
transactions, we expect the 50th order for a given district to take place after1350*10*49 = 661,500
completednew_order
transactions. Since this is close to our estimate for the total number ofnew_order
transactions performed, we can conclude that the victim transaction began very recently.the transaction was aborted and its abort span was already GCed
The abort span is not removed until an hour after the transaction is last active:
cockroach/pkg/storage/storagebase/base.go
Lines 37 to 44 in 2695531
The load was only running for 43 minutes, so no abort spans should have been GCed.
The text was updated successfully, but these errors were encountered: