Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

DB Backend: report explicit error when transactions are used concurrently #37172

Merged
merged 7 commits into from
Jun 15, 2022

Conversation

camdencheek
Copy link
Member

@camdencheek camdencheek commented Jun 13, 2022

This updates our transaction wrapper to return an explicit error whenever a transaction is used concurrently. Concurrent transaction access is a form of race condition that causes (in my experiments) either a conn busy error, a bad connection error, or a panic. So, instead of getting these errors that are very difficult to debug, this PR throws an error logs aan error with the stack trace that actually describes what's wrong. These errors should get pushed to Sentry, which will allow us to track where this is happening

Stacked on https://github.com/sourcegraph/sourcegraph/pull/37167

Test plan

Added DB tests that exercise the error return.

@cla-bot cla-bot bot added the cla-signed label Jun 13, 2022
@camdencheek camdencheek changed the base branch from main to cc/remove-new-untyped-db June 13, 2022 22:20

"github.com/sourcegraph/sourcegraph/internal/database/dbtest"
"github.com/sourcegraph/sourcegraph/internal/database/dbutil"
"github.com/sourcegraph/sourcegraph/lib/errors"
)

func TestTransaction(t *testing.T) {
db := dbtest.NewDB(t)
db := dbtest.NewRawDB(t)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unrelated, but we don't need a db with a frontend schema in these tests

}

func (t *lockingTx) lock() error {
if !t.mu.TryLock() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woah, a valid use of the new TryLock()!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have to admit I'm a bit skeptical of using TryLock() followed by Lock() in the same function (and there's caveats around using TryLock). But I also don't see another way around it that doesn't end up doing the same thing under the hood (semaphore, atomic compare and swap, etc.)

Copy link
Member Author

@camdencheek camdencheek Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So...Russ Cox is totally correct, including in this case. It is always incorrect to use a transaction concurrently, including after this PR.

However, my concerns are slightly different than Russ's. I want to do anything I can to prevent a panic in production, which might include implementing some imprecise logic that might avoid a handful of panics.

That said, the more important thing here IMO is that we report on incorrect usage, which this does with basically the same consistency as any race detector. This is what TryLock allows us to do: report when the invariant that a transaction should never be used concurrently is violated. A plain Lock cannot do this.

Copy link
Contributor

@mrnugget mrnugget Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I was just wondering whether we need the TryLock for example or could use a one-word value instead?

type lockingTx struct {
	tx     *sql.Tx

	mu     sync.Mutex
	inUse bool

	logger log.Logger
}

func (t *lockingTx) lock() {
	if t.inUse {
		// For now, log an error, but try to serialize access anyways to try to
		// keep things slightly safer.
		err := errors.WithStack(ErrConcurrentTransactionAccess)
		t.logger.Error("transaction used concurrently", log.Error(err))
	}
	t.mu.Lock()
	t.inUse = true
}

func (t *lockingTx) unlock() {
	t.inUse = false
	t.mu.Unlock()
}

(Edit: ... and that's what I meant with "it all ends up being the same" 😄 because this is not different than what TryLock does somewhere under the hood, but I guess TryLock is now the officially supported version, since I'm not even 100% sure 1-word vars are safe to read concurrently like that.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not safe to read inUse outside the mutex though, right?

@camdencheek
Copy link
Member Author

camdencheek commented Jun 13, 2022

@sourcegraph/code-intel The failures here appear to be actual problems caught by this change. As an example, a transaction is created here, then writeVisibleUploads is called on that tx here, which starts multiple batch writer goroutines on that transaction here.

@coury-clark
Copy link
Contributor

@camdencheek I think you meant to ping code intel above?

@camdencheek
Copy link
Member Author

camdencheek commented Jun 13, 2022

Whoops, yep, thanks @coury-clark. Does editing a comment actually ping? Pinging @sourcegraph/code-intel because I don't know. (sorry for the noise Insights team)

@camdencheek camdencheek force-pushed the cc/remove-new-untyped-db branch from 173d292 to ab6c232 Compare June 14, 2022 17:48
Base automatically changed from cc/remove-new-untyped-db to main June 14, 2022 17:58
@camdencheek camdencheek force-pushed the cc/safe-transactions branch from f340c94 to 3eb6b8b Compare June 14, 2022 18:05
@camdencheek
Copy link
Member Author

In order to make this PR mergeable without blocking on Code Intel folks (who I think are all at an offsite), I've updated this PR to instead just log an error and attempt to synchronize it (best effort, will not work in many cases). This way, we will at least get sentry errors and log messages that will notify us of other places this is happening.

@camdencheek camdencheek marked this pull request as ready for review June 14, 2022 20:27
@sourcegraph-bot
Copy link
Contributor

sourcegraph-bot commented Jun 14, 2022

Codenotify: Notifying subscribers in CODENOTIFY files for diff ee784c5...61b0a38.

Notify File(s)
@efritz internal/database/basestore/handle.go
internal/database/basestore/store_test.go

internal/database/basestore/handle.go Outdated Show resolved Hide resolved
}
defer func() { err = tx.Done(err) }()

return tx.Exec(ctx, sqlf.Sprintf(`INSERT INTO store_counts_test VALUES (%s, 42)`, i))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just have the first one sleeps for 5 seconds, and fail on the second start? So we don't need 100 goroutines 😂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, that's a much better idea!

@camdencheek camdencheek added the team/search-product Issues owned by the search product team label Jun 15, 2022
@camdencheek camdencheek merged commit 291a3df into main Jun 15, 2022
@camdencheek camdencheek deleted the cc/safe-transactions branch June 15, 2022 14:46
unclejustin pushed a commit that referenced this pull request Jun 15, 2022
…ntly (#37172)

This updates our transaction wrapper to return an explicit error whenever a transaction is used concurrently. Concurrent transaction access is a form of race condition that causes (in my experiments) either a conn busy error, a bad connection error, or a panic. So, instead of getting these errors that are very difficult to debug, this logs error with the stack trace that actually describes what's wrong. These errors should get pushed to Sentry, which will allow us to track where this is happening
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cla-signed team/search-product Issues owned by the search product team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants