-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distsql: TxnCoordSender doesn't allow distribution if transaction changed data #13376
Comments
The solution to this may be
#10511.
…On Thu, Feb 2, 2017 at 1:00 PM, RaduBerinde ***@***.***> wrote:
The TxnCoordSender assumes everything happens on a node, except for
read-only transactions. DistSQL assumes that read-only queries can be
distributed. The problem is in the difference between "transaction" and
"query": if the transaction earlier modified data, a read-only query in
that transaction apparently can't be distributed - i saw the error below
while running TestLogicDistSQL on a 3 node cluster with the fake span
resolver.
statement ok
BEGIN TRANSACTION
statement ok
CREATE TABLE kv (
k CHAR PRIMARY KEY,
v TIMESTAMP WITH TIME ZONE
)
statement ok
INSERT INTO kv (k,v) VALUES ('a', transaction_timestamp())
query T
SELECT k FROM kv <----------
E170202 12:48:14.328709 17874 sql/distsqlrun/tablereader.go:159 [n3,fe9cf6117,TableReader=53] scan error: writing transaction timed out or ran on multiple coordinators
E170202 12:48:14.329505 17855 sql/distsqlrun/server.go:181 [n1] writing transaction timed out or ran on multiple coordinators
The select was simply running the scan on another node:
[image: image]
<https://cloud.githubusercontent.com/assets/16544120/22561644/0e1f14b4-e947-11e6-8d88-e29c3d5ddd3b.png>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#13376>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/ABdsPDwDoLGVYK0IZdJl1oe9fqN_MIVoks5rYhmugaJpZM4L1cAM>
.
|
Thanks, I meant to reference that issue but I forgot. |
The fact that read-only transactions can be distributed like this is an accident. If we decide it's a feature we want to support (and I don't see why we wouldn't), we could simply skip the "multiple coordinator" check for read-only batches. No need for anything as intrusive as #10511. |
It used to be like that; the check was made more strict to handle a subtle condition, details in 5fec899 |
Temporary workaround for cockroachdb#13376.
Temporary workaround for cockroachdb#13376.
The scenario addressed by the "multiple coordinator" check (5fec899) is:
The read at the end should error out; without the check it would just run and it won't see anything that was written earlier in that txn (traces of that txn are gone). One way to check for this wile still allowing distribution of the read-only query is to move up the check into sql. If the txn has an anchor key, the sql layer could call into the TxnCoordSender to verify that it still has the txn on file. It can check this after every statement it runs (when it still has a chance to return an error for the statement). Note that this may become problematic if we ever stream results directly to the client (we would stream incorrect results before the error comes). |
FWIW this change that introduced the check in TxnCoordSender was merged last March, a few weeks before the |
The abort cache (like the sequence cache that did the same thing and preceded 5fec899) cannot (reliably) help with timed-out transactions, because the abort cache entry becomes eligible for GC at the same time the transaction times out. So to reliably avoid the anomaly fixed in #5323, we need to either lift the transaction liveness check up to the sql/distsql level like @RaduBerinde suggested, or push it down to each node that handles part of the transaction (basically, allow multiple coordinators as long as only one writes, but each coordinator would be responsible for heartbeating or polling the transaction status). |
Temporary workaround for cockroachdb#13376.
Note to self: currently the "writing transaction timed out or ran on multiple coordinators" check is in the wrong place. It's done in |
…ashion The key idea here is to stop using a single TxnCoordSender but instead provide a factory for stateful instances of TxnCoordSender to be used with each client.DB transaction. The state stored by the TxnCoordSender can be fetched, collected, passed, and then used to augment a parent or sibling TxnCoordSender. This allows DistSQL to use the TxnCoordSender just like normal SQL and paves the way for mutating distributed SQL transactions. With this change, we get the added benefit of observed timestamps being correct returned to the root transaction through distributed sql flows, which should prevent unnecessary restarts from uncertainty interval errors. Fixes cockroachdb#10511 Fixes cockroachdb#13376 Release note: None
…ashion The key idea here is to stop using a single TxnCoordSender but instead provide a factory for stateful instances of TxnCoordSender to be used with each client.DB transaction. The state stored by the TxnCoordSender can be fetched, collected, passed, and then used to augment a parent or sibling TxnCoordSender. This allows DistSQL to use the TxnCoordSender just like normal SQL and paves the way for mutating distributed SQL transactions. With this change, we get the added benefit of observed timestamps being correct returned to the root transaction through distributed sql flows, which should prevent unnecessary restarts from uncertainty interval errors. Fixes cockroachdb#10511 Fixes cockroachdb#13376 Release note (dist sql): Removes the limitation that distributed SQL is not used once a transaction has had writes.
…ashion The key idea here is to stop using a single TxnCoordSender but instead provide a factory for stateful instances of TxnCoordSender to be used with each client.DB transaction. The state stored by the TxnCoordSender can be fetched, collected, passed, and then used to augment a parent or sibling TxnCoordSender. This allows DistSQL to use the TxnCoordSender just like normal SQL and paves the way for mutating distributed SQL transactions. With this change, we get the added benefit of observed timestamps being correct returned to the root transaction through distributed sql flows, which should prevent unnecessary restarts from uncertainty interval errors. Fixes cockroachdb#10511 Fixes cockroachdb#13376 Release note (dist sql): Removes the limitation that distributed SQL is not used once a transaction has had writes.
…ashion The key idea here is to stop using a single TxnCoordSender but instead provide a factory for stateful instances of TxnCoordSender to be used with each client.DB transaction. The state stored by the TxnCoordSender can be fetched, collected, passed, and then used to augment a parent or sibling TxnCoordSender. This allows DistSQL to use the TxnCoordSender just like normal SQL and paves the way for mutating distributed SQL transactions. With this change, we get the added benefit of observed timestamps being correct returned to the root transaction through distributed sql flows, which should prevent unnecessary restarts from uncertainty interval errors. Fixes cockroachdb#10511 Fixes cockroachdb#13376 Release note (dist sql): Removes the limitation that distributed SQL is not used once a transaction has had writes.
The TxnCoordSender assumes everything happens on a node, except for read-only transactions. DistSQL assumes that read-only queries can be distributed. The problem is in the difference between "transaction" and "query": if the transaction earlier modified data, a read-only query in that transaction apparently can't be distributed - i saw the error below while running
TestLogicDistSQL
on a 3 node cluster with the fake span resolver.The select was simply running the scan on another node:
data:image/s3,"s3://crabby-images/607e4/607e4a5deebea885c375924e6016b9ce47a3ad85" alt="image"
The text was updated successfully, but these errors were encountered: