-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add semi-sync monitor to unblock primaries blocked on semi-sync ACKs #17763
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Manan Gupta <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17763 +/- ##
==========================================
+ Coverage 67.41% 67.48% +0.06%
==========================================
Files 1592 1593 +1
Lines 258024 258473 +449
==========================================
+ Hits 173948 174424 +476
+ Misses 84076 84049 -27 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Manan Gupta <[email protected]>
func (w *Watcher) stillBlocked() bool { | ||
w.mu.Lock() | ||
defer w.mu.Unlock() | ||
return w.isOpen && w.isBlocked | ||
} | ||
|
||
// checkAndSetIsWriting checks if the watcher is already writing to the DB. | ||
// If it is not, then it sets the isWriting field and signals the caller. | ||
func (w *Watcher) checkAndSetIsWriting() bool { | ||
w.mu.Lock() | ||
defer w.mu.Unlock() | ||
if w.isWriting { | ||
return false | ||
} | ||
w.isWriting = true | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this affects the query hot path, right? If it does, then it might be worth e.g. using 1 byte for the status and using bits in there for isWriting, isBlocked, isOpen etc. so that we can use atomics for reading them, CAS for optional changes, etc. If nothing else, it's probably worth moving these to atomic.Bool so that e.g. checkAndSetIsWriting can be one atomic call:
return w.isWriting(false, true)
It makes the code simpler, clearer, and less contentious / efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think performance is too much of a concern, but the usage of having multiple bool fields behind a mutex vs atomic.Bool I think becomes a matter of preference. I for one, like to have the former because that means that only one boolean value transitions at a point in time, but with atomic bool values it can change even when you've just read that value.
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
…writes Signed-off-by: Manan Gupta <[email protected]>
…c is blocked on the primary Signed-off-by: Manan Gupta <[email protected]>
…d for too long and a add a few logs Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
…ger to only start it when required Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
…ake tablet Signed-off-by: Manan Gupta <[email protected]>
… is blocked Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Just had some questions/comments. Let me know what you think.
// be called multiple times in parallel. | ||
func (m *Monitor) checkAndFixSemiSyncBlocked() { | ||
// Check if semi-sync is blocked or not | ||
ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we hardcoding the timeout and not using RemoteOperationTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use RemoteOperationTimeout here? Its not a remote operation. Its just talking to MySQL. I initially had no timeout whatsoever. Matt suggested I add one, so I added an abritrary one.
if !m.appPool.IsOpen() { | ||
m.appPool.Open(m.config.DB.AppWithDB()) | ||
} | ||
m.clearTicks.Start(m.clearAllData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we are in the middle of fixing a blocked semi-sync and this timer goes off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write to clear the table would block if there is an outstanding write to the table in progress, and eventually it would succeed. There should be no problems with this triggering in the middle of fixing of semi-sync.
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
Signed-off-by: Manan Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I only had one minor suggestion. Nice work on this, @GuptaManan100 !
Signed-off-by: Manan Gupta <[email protected]>
Done, I've made that change too! 💕 |
Description
This PR introduces a new component to the vttablet binary to monitor the semi-sync status of primary vttablets. We've observed cases where a brief network disruption can cause the primary to get stuck indefinitely waiting for semi-sync ACKs. In rare scenarios, this can block reparent operations and render the primary unresponsive. More information can be found in the issues #17709 and #17749.
To address this, the new component continuously monitors the semi-sync status. If the primary becomes stuck on semi-sync ACKs, it generates writes to unblock it. If this fails, VTOrc is notified of the issue and initiates an emergency reparent operation.
A metric for the number of outstanding writes from the semi-sync monitor has also been added. Unfortunately, it's not easy to reproduce the problem in an end-to-end test since it requires setting port forward rules (on Mac) or iptable changes (on Linux), both of which require sudo access. So I've added a test but that test doesn't run on CI. It can only be run locally and the password for the root user has to entered when prompted for it.
The semi-sync monitor is used such that the users who aren't running semi-sync, will not see the monitor open at all. It will only run when semi-sync on the primary is turned on.
Related Issue(s)
Checklist
Deployment Notes