Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanconfigreconciler: use fixed timestamp when reading descriptors #140400

Merged
merged 1 commit into from
Feb 5, 2025

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented Feb 3, 2025

Recently when we changed the default value of autocommit_before_ddl to
true, we found that the chance of hitting a retry error while running
schema changes dramatically increased. The reason was because the
backgound span reconciler would need locks for the same keys that were
being modified by the schema change job itself -- most notably, the
descriptor table and descriptor ID sequence.

This patch addresses the issue by making the spanconfig reconciler use
the checkpoint timestamp from the rangefeed as the fixed timestamp for
the transaction that reads the descriptors whose spans are being
reconciled. This defensive measure helps us avoid any possibility of
contention caused by this background job's transaction. This allows
us to re-enable the autocommit setting for logictests that run in
multitenancy.

fixes #140172
Release note: None

@rafiss rafiss requested a review from a team as a code owner February 3, 2025 19:29
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@msbutler msbutler requested a review from arulajmani February 3, 2025 19:35
@msbutler
Copy link
Collaborator

msbutler commented Feb 3, 2025

tagging @arulajmani who built this system and who may have opinions on this change

Copy link
Collaborator

@fqazi fqazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @msbutler, and @yuzefovich)

Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this patch is papering over the contention issue smells. The reconciler is running a read-only transaction, so it forms the read half of a read-write conflict. Typically, the write transaction should be able to commit by refreshing to a higher timestamp. If it isn't able to, it's because the read-refresh of its read set is failing because a (third) transaction has performed an overlapping write. So this patch seems like a brittle bandaid to me.

I call it a brittle bandaid because the patch is removing one source of schema change transactions getting their write timestamp bumped. However, there's other ways this can happen -- e.g. the closed timestamp interval, another read of these tables, etc. When such contention issues come up during customer escalations, our recommendation is to generally dig into why the read-refresh is failing. That's because this battle of "ensure write timestamps are never bumped" is fraught. To that end, we should:

  • Figure out if the schema change transaction has a needlessly large read set.
  • Figure out which third transaction is causing the schema change transaction's read-refresh to fail.
  • Figure out whether the schema change transaction can acquire locks to prevent its read-refreshes from failing.

Reviewed 1 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @msbutler, @rafiss, and @yuzefovich)


-- commits line 7 at r1:
The spanconfig reconciler is not writing to any of these tables, is it? What locks is it acquiring?


-- commits line 17 at r1:
Calling this "more correct" implies that the previous behaviour was wrong -- that seems like a mischaracterization. The only guarantee we need here that reconciliation should happen at a consistent read snapshot, which was the case before.


pkg/spanconfig/spanconfigreconciler/reconciler.go line 530 at r1 (raw file):

				// system.span_count. The spanconfig reconciler needs to read these
				// objects also. A fixed timestamp will avoid contention, and is also
				// more correct since this job should read descriptors using the same

We should drop this "more correct" verbiage -- see my comment above.

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @msbutler, and @yuzefovich)


-- commits line 7 at r1:

Previously, arulajmani (Arul Ajmani) wrote…

The spanconfig reconciler is not writing to any of these tables, is it? What locks is it acquiring?

We can take it into this thread: https://cockroachlabs.slack.com/archives/C04N0AS14CT/p1738595046864469

Or let's sync up in a huddle so I can show you in person.

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani, @msbutler, and @yuzefovich)


-- commits line 17 at r1:

Previously, arulajmani (Arul Ajmani) wrote…

Calling this "more correct" implies that the previous behaviour was wrong -- that seems like a mischaracterization. The only guarantee we need here that reconciliation should happen at a consistent read snapshot, which was the case before.

Why is not incorrect to read the descriptors at a larger timestamp than the checkpoint we receive from the rangefeed. Does using a larger timestamp not mean that the descriptors could look different from how they did at the time they were checkpointed in the rangefeed?

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 4, 2025

I'll post my investigation here for greater visibility.

To try to debug why the transaction retry errors were occurring, I incrementally added logs throughout spanconfig/KV code. I ended up with this diff:

diff --git a/pkg/kv/kvclient/kvcoord/dist_sender.go b/pkg/kv/kvclient/kvcoord/dist_sender.go
index fb9f3a26cff..3ac2f13bc79 100644
--- a/pkg/kv/kvclient/kvcoord/dist_sender.go
+++ b/pkg/kv/kvclient/kvcoord/dist_sender.go
@@ -1657,6 +1657,8 @@ func (ds *DistSender) detectIntentMissingDueToIntentResolution(
                // commit because a committed record cannot be GC'ed and the recreated as
                // ABORTED.
                if resp.TxnRecordExists {
+                       log.Infof(ctx, "detectIntentMissingDueToIntentResolution aborting %s with ABORT_REASON_ABORTED_RECORD_FOUND since TxnRecordExists",
+                               txn.ID)
                        return false, kvpb.NewTransactionAbortedError(kvpb.ABORT_REASON_ABORTED_RECORD_FOUND)
                }
                return false, kvpb.NewAmbiguousResultErrorf("intent missing and record aborted")
diff --git a/pkg/kv/kvclient/kvcoord/txn_coord_sender.go b/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
index fbc39e1ed3c..625e41bcdec 100644
--- a/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
+++ b/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
@@ -789,6 +789,7 @@ func (tc *TxnCoordSender) maybeRejectClientLocked(
                // this was not due to a synchronous transaction commit and transaction
                // record garbage collection.
                // See the comment on txnHeartbeater.mu.finalObservedStatus for more details.
+               log.Infof(ctx, "client rejected txn %s, protoStatus=%s", tc.mu.txn.ID.Short(), protoStatus)
                abortedErr := kvpb.NewErrorWithTxn(
                        kvpb.NewTransactionAbortedError(kvpb.ABORT_REASON_CLIENT_REJECT), &tc.mu.txn)
                return kvpb.NewError(tc.handleRetryableErrLocked(ctx, abortedErr))
diff --git a/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go b/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go
index 03a630650aa..4698fa59721 100644
--- a/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go
+++ b/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go
@@ -501,6 +501,7 @@ func (h *txnHeartbeater) heartbeatLocked(ctx context.Context) bool {
                        // provides more clarity. We do however prevent it from running more
                        // requests in case it isn't aware that the transaction is over.
                        log.VEventf(ctx, 1, "Heartbeat detected aborted txn, cleaning up for %s", h.mu.txn)
+                       log.Infof(ctx, "Heartbeat detected aborted txn, cleaning up for %s; err: %v", h.mu.txn, pErr)
                        h.abortTxnAsyncLocked(ctx)
                        h.mu.finalObservedStatus = roachpb.ABORTED
                        return false
@@ -521,6 +522,7 @@ func (h *txnHeartbeater) heartbeatLocked(ctx context.Context) bool {
                        // Roll back the transaction record to clean up intents and
                        // then shut down the heartbeat loop.
                        log.VEventf(ctx, 1, "Heartbeat detected aborted txn, cleaning up for %s", h.mu.txn)
+                       log.Infof(ctx, "Heartbeat detected aborted txn, cleaning up for %s; respTxn is finalized", h.mu.txn)
                        h.abortTxnAsyncLocked(ctx)
                }
                h.mu.finalObservedStatus = respTxn.Status
diff --git a/pkg/kv/kvserver/batcheval/cmd_end_transaction.go b/pkg/kv/kvserver/batcheval/cmd_end_transaction.go
index 1c8f59e75ca..b1857fd8225 100644
--- a/pkg/kv/kvserver/batcheval/cmd_end_transaction.go
+++ b/pkg/kv/kvserver/batcheval/cmd_end_transaction.go
@@ -343,6 +343,8 @@ func EndTxn(
                        // Similarly to above, use alwaysReturn==true. The caller isn't trying
                        // to abort, but the transaction is definitely aborted and its locks
                        // can go.
+                       log.Infof(ctx, "EndTxn aborting %s with ABORT_REASON_ABORTED_RECORD_FOUND; found %d intents",
+                               reply.Txn.ID, len(reply.Txn.LockSpans))
                        reply.Txn.LockSpans = args.LockSpans
                        return result.FromEndTxn(reply.Txn, true /* alwaysReturn */, args.Poison),
                                kvpb.NewTransactionAbortedError(kvpb.ABORT_REASON_ABORTED_RECORD_FOUND)
diff --git a/pkg/kv/kvserver/concurrency/lock_table_waiter.go b/pkg/kv/kvserver/concurrency/lock_table_waiter.go
index 97b0f4dafd2..445a81df9e3 100644
--- a/pkg/kv/kvserver/concurrency/lock_table_waiter.go
+++ b/pkg/kv/kvserver/concurrency/lock_table_waiter.go
@@ -6,10 +6,12 @@
 package concurrency

 import (
+       "bytes"
        "context"
        "math"
        "time"

+       "github.com/cockroachdb/cockroach/pkg/util/debugutil"
        "github.com/cockroachdb/cockroach/pkg/kv/kvpb"
        "github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/isolation"
        "github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/lock"
@@ -465,6 +467,24 @@ func (w *lockTableWaiterImpl) pushLockTxn(
                // adjusting the intent, accepting that the intent would then no longer
                // round-trip and would lose the local timestamp if rewritten later.
                log.VEventf(ctx, 2, "pushing timestamp of txn %s above %s", ws.txn.Short(), h.Timestamp)
+               pusher := ""
+               if h.Txn != nil {
+                       pusher = h.Txn.ID.String()
+               }
+               log.Infof(ctx, "pusher %s pushing timestamp of txn %s for key %s (%v) above %s",
+                       pusher, ws.txn.Short(), roachpb.Key(ws.txn.Key), ws.txn.Key, h.Timestamp,
+               )
        } else {
                pushType = kvpb.PUSH_ABORT
                log.VEventf(ctx, 2, "pushing txn %s to abort", ws.txn.Short())
diff --git a/pkg/kv/kvserver/replica.go b/pkg/kv/kvserver/replica.go
index c4c8ea51719..2395d83566f 100644
--- a/pkg/kv/kvserver/replica.go
+++ b/pkg/kv/kvserver/replica.go
@@ -2460,6 +2460,8 @@ func checkIfTxnAborted(
                // We hit the cache, so let the transaction restart.
                log.VEventf(ctx, 1, "found AbortSpan entry for %s with priority %d",
                        txn.ID.Short(), entry.Priority)
+               log.Infof(ctx, "found AbortSpan entry for %s with priority %d",
+                       txn.ID.Short(), entry.Priority)
                newTxn := txn.Clone()
                if entry.Priority > newTxn.Priority {
                        newTxn.Priority = entry.Priority
diff --git a/pkg/kv/kvserver/replica_rangefeed.go b/pkg/kv/kvserver/replica_rangefeed.go
index 13b4c7005ce..c9c63426345 100644
--- a/pkg/kv/kvserver/replica_rangefeed.go
+++ b/pkg/kv/kvserver/replica_rangefeed.go
@@ -161,6 +161,9 @@ func (tp *rangefeedTxnPusher) PushTxns(
                },
        }

+       for _, txn := range txns {
+               log.Infof(ctx, "rangefeedTxnPusher pushing %s", txn.ID)
+       }
        pushedTxnMap, anyAmbiguousAbort, pErr := tp.ir.MaybePushTransactions(
                ctx, pushTxnMap, h, kvpb.PUSH_TIMESTAMP, false, /* skipIfInFlight */
        )
diff --git a/pkg/kv/kvserver/replica_tscache.go b/pkg/kv/kvserver/replica_tscache.go
index 5262fa05b69..7fd72d0872e 100644
--- a/pkg/kv/kvserver/replica_tscache.go
+++ b/pkg/kv/kvserver/replica_tscache.go
@@ -168,6 +168,7 @@ func (r *Replica) updateTimestampCache(
                        // It inserts the timestamp of the final batch in the transaction. This
                        // timestamp must necessarily be equal to or greater than the
                        // transaction's MinTimestamp, which is consulted in CanCreateTxnRecord.
+                       log.Infof(ctx, "EndTxnRequest %s adding tombstone for key %s", txnID, start)
                        key := transactionTombstoneMarker(start, txnID)
                        addToTSCache(key, nil, ts, txnID)
                        // Additionally, EndTxn requests that release replicated locks for
@@ -182,6 +183,7 @@ func (r *Replica) updateTimestampCache(
                        // HeartbeatTxn requests record a tombstone entry when the record is
                        // initially written. This is used when considering potential 1PC
                        // evaluation, avoiding checking for a transaction record on disk.
+                       log.Infof(ctx, "HeartbeatTxnRequest %s adding tombstone for key %s", txnID, start)
                        key := transactionTombstoneMarker(start, txnID)
                        addToTSCache(key, nil, ts, txnID)
                case *kvpb.RecoverTxnRequest:
@@ -196,6 +198,7 @@ func (r *Replica) updateTimestampCache(
                        // MinTimestamp.
                        recovered := resp.(*kvpb.RecoverTxnResponse).RecoveredTxn
                        if recovered.Status.IsFinalized() {
+                               log.Infof(ctx, "RecoverTxnRequest %s adding tombstone for key %s", recovered.ID, start)
                                key := transactionTombstoneMarker(start, recovered.ID)
                                addToTSCache(key, nil, ts, recovered.ID)
                        }
@@ -257,6 +260,7 @@ func (r *Replica) updateTimestampCache(
                                // local clock.
                                // TODO(nvanbenschoten): confirm that this restriction is still
                                // true, now that we ship the timestamp cache on lease transfers.
+                               log.Infof(ctx, "PushTxnRequest txn %s pushee %s pushType %s adding tombstone for key %s", t.PusherTxn.ID, pushee.ID, t.PushType
, start)
                                key = transactionTombstoneMarker(start, pushee.ID)
                                pushTS = pushee.MinTimestamp
                        } else {
@@ -656,6 +660,10 @@ func (r *Replica) CanCreateTxnRecord(
                        // the restriction that it couldn't create a transaction record for
                        // us, so it recorded a tombstone cache instead to prevent us
                        // from ever creating a transaction record.
+                       log.Infof(ctx,
+                               `CanCreateTxnRecord returning ABORT_REASON_ABORTED_RECORD_FOUND for %s due to concurrent abort by %s; key: %s, txnMinTS: %s, to
mbstoneTimestamp: %s`,
+                               txnID, tombstoneTxnID, roachpb.Key(txnKey), txnMinTS, tombstoneTimestamp,
+                       )
                        return false, kvpb.ABORT_REASON_ABORTED_RECORD_FOUND
                }
        }
diff --git a/pkg/spanconfig/spanconfiglimiter/limiter.go b/pkg/spanconfig/spanconfiglimiter/limiter.go
index 4e912097ad4..be588e30a84 100644
--- a/pkg/spanconfig/spanconfiglimiter/limiter.go
+++ b/pkg/spanconfig/spanconfiglimiter/limiter.go
@@ -17,6 +17,8 @@ import (
        "github.com/cockroachdb/cockroach/pkg/sql/isql"
        "github.com/cockroachdb/cockroach/pkg/sql/sem/tree"
        "github.com/cockroachdb/cockroach/pkg/sql/sessiondata"
+       "github.com/cockroachdb/cockroach/pkg/util/debugutil"
+       "github.com/cockroachdb/cockroach/pkg/util/log"
        "github.com/cockroachdb/errors"
 )

@@ -73,6 +75,7 @@ RETURNING span_count
                sessiondata.NodeUserSessionDataOverride,
                updateSpanCountStmt, delta)
        if err != nil {
+               log.Infof(ctx, "error updating span count: %v\n%s", err, debugutil.Stack())
                return false, err
        }
        if len(datums) != 1 {
diff --git a/pkg/spanconfig/spanconfigreconciler/reconciler.go b/pkg/spanconfig/spanconfigreconciler/reconciler.go
index 122cd92fcab..1061f772618 100644
--- a/pkg/spanconfig/spanconfigreconciler/reconciler.go
+++ b/pkg/spanconfig/spanconfigreconciler/reconciler.go
@@ -9,6 +9,9 @@ import (
        "context"
        "time"

+       "github.com/cockroachdb/cockroach/pkg/util/debugutil"
+       "github.com/cockroachdb/cockroach/pkg/util/uuid"
+
        "github.com/cockroachdb/cockroach/pkg/keys"
        "github.com/cockroachdb/cockroach/pkg/kv"
        "github.com/cockroachdb/cockroach/pkg/roachpb"
@@ -231,6 +234,10 @@ func (f *fullReconciler) reconcile(
                ctx context.Context, txn descs.Txn,
        ) error {
                kvTxn = txn.KV()
+               if err := kvTxn.SetUserPriority(roachpb.MinUserPriority); err != nil {
+                       return err
+               }
+               log.Infof(ctx, "full reconciler txn %s", txn.KV().ID())
                translator := f.sqlTranslatorFactory.NewSQLTranslator(txn)
                records, err = spanconfig.FullTranslate(ctx, translator)
                return err
@@ -498,6 +505,7 @@ func (r *incrementalReconciler) reconcile(
                        if len(sqlUpdates) == 0 {
                                return callback(checkpoint) // nothing to do; propagate the checkpoint
                        }
+                       // time.Sleep(10 * time.Second)

                        // Process the SQLUpdates and identify all descriptor IDs that require
                        // translation. If the SQLUpdates includes ProtectedTimestampUpdates then
@@ -517,10 +525,22 @@ func (r *incrementalReconciler) reconcile(
                        var missingProtectedTimestampTargets []spanconfig.SystemTarget
                        var records []spanconfig.Record

+                       var txnID uuid.UUID
                        if err := r.execCfg.InternalDB.DescsTxn(ctx, func(
                                ctx context.Context, txn descs.Txn,
                        ) error {
                                var err error
+                               err = txn.KV().SetUserPriority(roachpb.MinUserPriority)
+                               if err != nil {
+                                       return err
+                               }
+                               txnID = txn.KV().ID()
+                               log.Infof(ctx, "incremental reconciler txn=%s checkpoint=%s allIDs: %v", txnID, checkpoint, allIDs)
+                               log.Infof(ctx, "%s", debugutil.Stack())

                                // TODO(irfansharif): Instead of these filter methods for missing
                                // tables and system targets that live on the Reconciler, we could

I also removed the override of autocommit_before_ddl in logic tests:

diff --git a/pkg/sql/logictest/logic.go b/pkg/sql/logictest/logic.go
index a68eb878c27..45a1e5688a5 100644
--- a/pkg/sql/logictest/logic.go
+++ b/pkg/sql/logictest/logic.go
@@ -1687,12 +1691,12 @@ func (t *logicTest) newCluster(
                // TODO(rafi): Remove this setting. We're adding it since the 3node-tenant
                // config seem to be flaky with autcommit_before_ddl = true. Disable that
                // setting for multitenant configs while the issue is being investigated.
-               if _, err := conn.Exec(
-                       "ALTER TENANT [$1] SET CLUSTER SETTING sql.defaults.autocommit_before_ddl.enabled = false",
-                       tenantID.ToUint64(),
-               ); err != nil {
-                       t.Fatal(err)
-               }
+               // if _, err := conn.Exec(
+               //      "ALTER TENANT [$1] SET CLUSTER SETTING sql.defaults.autocommit_before_ddl.enabled = false",
+               //      tenantID.ToUint64(),
+               // ); err != nil {
+               //      t.Fatal(err)
+               // }

I can observe a failure reliably by running this test. I used these instructions to run the test on EngFlow.

./dev testlogic ccl --config=3node-tenant --files=vectorize --count=1 --ignore-cache

Here's what I found.

The test fails with:

        /var/lib/engflow/worker/work/0/exec/bazel-out/k8-fastbuild/bin/pkg/ccl/logictestccl/tests/3node-tenant/3node-tenant_test_/3node-tenant_test.runfiles/co
m_github_cockroachdb_cockroach/pkg/sql/logictest/testdata/logic_test/vectorize:941:
        expected success, but found
        (40001) restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" meta={id=565db22a key=/Tenant/10/Table/7/1/0/0 iso=Serializable pri=0.01082732 epo=0 ts=1738682097.451969559,0 min=1738682097.451969559,0 seq=1029} lock=true stat=PENDING rts=1738682097.451969559,0 wto=false gul=1738682097.951969559,0

I looked in the CRDB logs that I added for ABORT_REASON_CLIENT_REJECT, and found:

I250203 19:10:24.672865 17104 kv/kvserver/replica_tscache.go:663 ⋮ [T1,Vsystem,n1,tenant=10,s1,r75/1:‹/Tenant/1{0-1}›] 1653  CanCreateTxnRecord returning ABORT_REASON_ABORTED_RECORD_FOUND for ‹ee961b2f-8239-436e-b0c4-a4cde096cd72› due to concurrent abort by ‹b61d3056-8eb0-42d0-8450-ac491bb35d55›; key: /Tenant/10/Table/7/1/‹0›/‹0›, txnMinTS: 1738609817.488408647,0, tombstoneTimestamp: 1738609817.488408647,0
I250203 19:10:24.673212 12259 kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:504 ⋮ [T10,Vcluster-10,nsql1,txn-hb=ee961b2f] 1654  Heartbeat detected aborted txn, cleaning up for "sql txn" meta={id=ee961b2f key=/Tenant/10/Table/7/1/‹0›/‹0› iso=Serializable pri=0.03936347 epo=0 ts=1738609817.488408647,0 min=1738609817.488408647,0 seq=1029} lock=true stat=PENDING rts=1738609817.488408647,0 wto=false gul=1738609817.988408647,0; err: TransactionAbortedError(ABORT_REASON_ABORTED_RECORD_FOUND): "sql txn" meta={id=ee961b2f key=/Tenant/10/Table/7/1/‹0›/‹0› iso=Serializable pri=0.03936347 epo=0 ts=1738609817.488408647,0 min=1738609817.488408647,0 seq=0} lock=true stat=PENDING rts=1738609817.488408647,0 wto=false gul=1738609817.988408647,0
I250203 19:10:24.674475 17135 kv/kvclient/kvcoord/txn_coord_sender.go:792 ⋮ [T10,Vcluster-10,nsql1,client=127.0.0.1:49480,hostssl,user=root,intExec=update-span-count] 1655  client rejected txn ‹ee961b2f›, protoStatus=PENDING
I250203 19:10:24.675110 6727 spanconfig/spanconfiglimiter/limiter.go:78 ⋮ [T10,Vcluster-10,nsql1,client=127.0.0.1:49480,hostssl,user=root] 1656  error updating span count: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "sql txn" meta={id=ee961b2f key=/Tenant/10/Table/7/1/‹0›/‹0› iso=Serializable pri=0.03936347 epo=0 ts=1738609817.488408647,0 min=1738609817.488408647,0 seq=1029} lock=true stat=PENDING rts=1738609817.488408647,0 wto=false gul=1738609817.988408647,0

This shows us that the logic test statement (running in transaction ee961b2f) failed while updating span counts. Updating the counts has to happen for every schema change operation in multitenant; it seems like a major hotspot when you have a test that is rapidly performing schema changes:

const updateSpanCountStmt = `
INSERT INTO system.span_count (span_count) VALUES ($1)
ON CONFLICT (singleton)
DO UPDATE SET span_count = system.span_count.span_count + $1
RETURNING span_count
`
datums, err := l.ie.QueryRowEx(ctx, "update-span-count", txn,
sessiondata.NodeUserSessionDataOverride,
updateSpanCountStmt, delta)
if err != nil {
return false, err
}

The key referenced in the error is /Tenant/10/Table/7/1/‹0›/‹0›, which is the system.descriptor_id_seq sequence. The reason it aborted is because of the ABORT_REASON_ABORTED_RECORD_FOUND case here:

// If we find another transaction's ID then that transaction has
// aborted us before our transaction record was written. It obeyed
// the restriction that it couldn't create a transaction record for
// us, so it recorded a tombstone cache instead to prevent us
// from ever creating a transaction record.
return false, kvpb.ABORT_REASON_ABORTED_RECORD_FOUND

The other transaction in this case is b61d3056. Looking in the logs for that ID, we see:

I250203 19:10:20.570431 13603 kv/kvserver/concurrency/lock_table_waiter.go:474 ⋮ [T1,Vsystem,n1,tenant=10,s1,r75/1:‹/Tenant/1{0-1}›] 1635  pusher ‹b61d3056-8eb0-42d0-8450-ac491bb35d55› pushing timestamp of txn ee961b2f for key /Tenant/10/Table/7/1/‹0›/‹0› ([‹254› ‹146› ‹143› ‹137› ‹136› ‹136›]) above 1738609820.570422460,0
...
I250203 19:10:22.488810 13603 kv/kvserver/replica_tscache.go:263 ⋮ [T1,Vsystem,n1,s1,r75/1:‹/Tenant/1{0-1}›] 1642  PushTxnRequest txn ‹b61d3056-8eb0-42d0-8450-ac491bb35d55› pushee ‹ee961b2f-8239-436e-b0c4-a4cde096cd72› pushType PUSH_TIMESTAMP adding tombstone for key /Tenant/10/Table/7/1/‹0›/‹0›

And most importantly, this shows the creator of the transaction is the AUTO SPAN CONFIG RECONCILIATION job.

I250203 19:10:20.436938 4359 spanconfig/spanconfigreconciler/reconciler.go:538 ⋮ [T10,Vcluster-10,nsql1,job=AUTO SPAN CONFIG RECONCILIATION id=1043789956657971201] 1627  incremental reconciler txn=‹b61d3056-8eb0-42d0-8450-ac491bb35d55› checkpoint=1738609817.435992344,0 allIDs: [4 23 33 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137]

if err := r.execCfg.InternalDB.DescsTxn(ctx, func(
ctx context.Context, txn descs.Txn,
) error {


Figure out which third transaction is causing the schema change transaction's read-refresh to fail.

It doesn't seem like there's any evidence of a third transaction above.

Figure out whether the schema change transaction can acquire locks to prevent its read-refreshes from failing.

How can we make it do that? Should we use SELECT FOR UPDATE on the system.span_counts table earlier in the transaction that performs a schema change?

Recently when we changed the default value of autocommit_before_ddl to
true, we found that the chance of hitting a retry error while running
schema changes dramatically increased. The reason was because the
backgound span reconciler would need locks for the same keys that were
being modified by the schema change job itself -- most notably, the
descriptor table and descriptor ID sequence.

This patch addresses the issue by making the spanconfig reconciler use
the checkpoint timestamp from the rangefeed as the fixed timestamp for
the transaction that reads the descriptors whose spans are being
reconciled. This defensive measure helps us avoid any possibility of
contention caused by this background job's transaction. This allows
us to re-enable the autocommit setting for logictests that run in
multitenancy.

Release note: None
@rafiss rafiss force-pushed the deflake-mt-autocommit branch from 264f443 to 5c5ffc5 Compare February 4, 2025 23:25
@rafiss
Copy link
Collaborator Author

rafiss commented Feb 4, 2025

I met with Arul and we concluded that the most likely explanation for the behavior exhibited by these logs is overload.

I tried bumping up the CPU allocation to this test up to 3. Right now it's 2.

exec_properties = {"test.Pool": "heavy"},
shard_count = 48,
tags = ["cpu:2"],

Unfortunately the same flakes still occurred.

I'd like to go forward with this change. I removed the language around it being more correct, and clarified that this change is purely a defensive programming measure.

@rafiss rafiss requested a review from arulajmani February 4, 2025 23:26
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for digging into this and posting the detailed analysis above. It was a breeze to engage with!

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @msbutler, @rafiss, and @yuzefovich)

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 5, 2025

TFTRs!

bors r+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants