-
Notifications
You must be signed in to change notification settings - Fork 1.3k
DB Backend: report explicit error when transactions are used concurrently #37172
Conversation
|
||
"github.com/sourcegraph/sourcegraph/internal/database/dbtest" | ||
"github.com/sourcegraph/sourcegraph/internal/database/dbutil" | ||
"github.com/sourcegraph/sourcegraph/lib/errors" | ||
) | ||
|
||
func TestTransaction(t *testing.T) { | ||
db := dbtest.NewDB(t) | ||
db := dbtest.NewRawDB(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated, but we don't need a db with a frontend schema in these tests
} | ||
|
||
func (t *lockingTx) lock() error { | ||
if !t.mu.TryLock() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
woah, a valid use of the new TryLock()
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have to admit I'm a bit skeptical of using TryLock()
followed by Lock()
in the same function (and there's caveats around using TryLock
). But I also don't see another way around it that doesn't end up doing the same thing under the hood (semaphore, atomic compare and swap, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So...Russ Cox is totally correct, including in this case. It is always incorrect to use a transaction concurrently, including after this PR.
However, my concerns are slightly different than Russ's. I want to do anything I can to prevent a panic in production, which might include implementing some imprecise logic that might avoid a handful of panics.
That said, the more important thing here IMO is that we report on incorrect usage, which this does with basically the same consistency as any race detector. This is what TryLock
allows us to do: report when the invariant that a transaction should never be used concurrently is violated. A plain Lock
cannot do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I was just wondering whether we need the TryLock
for example or could use a one-word value instead?
type lockingTx struct {
tx *sql.Tx
mu sync.Mutex
inUse bool
logger log.Logger
}
func (t *lockingTx) lock() {
if t.inUse {
// For now, log an error, but try to serialize access anyways to try to
// keep things slightly safer.
err := errors.WithStack(ErrConcurrentTransactionAccess)
t.logger.Error("transaction used concurrently", log.Error(err))
}
t.mu.Lock()
t.inUse = true
}
func (t *lockingTx) unlock() {
t.inUse = false
t.mu.Unlock()
}
(Edit: ... and that's what I meant with "it all ends up being the same" 😄 because this is not different than what TryLock does somewhere under the hood, but I guess TryLock is now the officially supported version, since I'm not even 100% sure 1-word vars are safe to read concurrently like that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not safe to read inUse
outside the mutex though, right?
@camdencheek I think you meant to ping code intel above? |
Whoops, yep, thanks @coury-clark. Does editing a comment actually ping? Pinging @sourcegraph/code-intel because I don't know. (sorry for the noise Insights team) |
173d292
to
ab6c232
Compare
f340c94
to
3eb6b8b
Compare
In order to make this PR mergeable without blocking on Code Intel folks (who I think are all at an offsite), I've updated this PR to instead just log an error and attempt to synchronize it (best effort, will not work in many cases). This way, we will at least get sentry errors and log messages that will notify us of other places this is happening. |
Codenotify: Notifying subscribers in CODENOTIFY files for diff ee784c5...61b0a38.
|
} | ||
defer func() { err = tx.Done(err) }() | ||
|
||
return tx.Exec(ctx, sqlf.Sprintf(`INSERT INTO store_counts_test VALUES (%s, 42)`, i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just have the first one sleeps for 5 seconds, and fail on the second start? So we don't need 100 goroutines 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, that's a much better idea!
Co-authored-by: Joe Chen <[email protected]>
…ntly (#37172) This updates our transaction wrapper to return an explicit error whenever a transaction is used concurrently. Concurrent transaction access is a form of race condition that causes (in my experiments) either a conn busy error, a bad connection error, or a panic. So, instead of getting these errors that are very difficult to debug, this logs error with the stack trace that actually describes what's wrong. These errors should get pushed to Sentry, which will allow us to track where this is happening
This updates our transaction wrapper to return an explicit error whenever a transaction is used concurrently. Concurrent transaction access is a form of race condition that causes (in my experiments) either a
conn busy
error, abad connection
error, or a panic. So, instead of getting these errors that are very difficult to debug, this PRthrows an errorlogs aan error with the stack trace that actually describes what's wrong. These errors should get pushed to Sentry, which will allow us to track where this is happeningStacked on https://github.com/sourcegraph/sourcegraph/pull/37167
Test plan
Added DB tests that exercise the error return.