-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[3.5] etcdserver: call the OnPreCommitUnsafe in unsafeCommit #14733
Conversation
`unsafeCommit` is called by both `(*batchTxBuffered) commit` and `(*backend) defrag`. When users perform the defragmentation operation, etcd doesn't update the consistent index. If etcd crashes(e.g. panicking) in the process for whatever reason, then etcd replays the WAL entries starting from the latest snapshot, accordingly it may re-apply entries which might have already been applied, eventually the revision isn't consistent with other members. Refer to discussion in etcd-io#14685 Signed-off-by: Benjamin Wang <[email protected]>
This is pretty low level change, how sure we are that no other things are affected by this change? I think we should validate it before we merge the PR. My suggestion, we should add more gofailpoints into the code and use linearizability tests to validate the fix. @ahrtr Can you propose more failpoint location that would thoroughly test backend transation commit? |
Sure. Will think about it in separate discussion session. |
This is a simple & safe change to me. I do not see any risk. Which part you do not have confidence? There might be other issues which could be discovered by other failpoints, but this fix is straightforward and safe to me. We already two important fixes, this one and the auth one. I think we need to release 3.5.6 soon. Of course, if you want to do more test, It's OK to me. But we shouldn't wait until all failpoints test are done. |
This is what worrying to me. It means that our rudiments were wrong, the fundamentals that we built a large part of code. This means that it might have far reaching impact as the change influences a large part of the codebase. I just want us to be careful and consider what this change could be also impacting. Thus the idea to add more failpoints. |
I do not get your point. What's rudiments were wrong? This is just an corner case we missed previously. |
I mean that we should take similar steps as we did in #13885. That time we introduced validation by checking stacktrace during runtime. I would like to double check if we can roll out similar validation. For this issue I would like to propose adding more gofailpoints in processes touching backend like defrag. |
I agree that we can think about adding more failpoints, but it should be discussed and tracked in separate session (e.g. #14735 ). So all discussion in this PR should be focusing on what's the impact this PR might cause.
The unsafeCommit is only called in two places (see below). I just move the code of calling preCommitHood from
I am afraid there is no similar validation this time. The defragment failpoints successfully discovered this issue, and it's good. But anyway, I added two other related failpoints in #14746. |
Backport #14730 to 3.5.
unsafeCommit
is called by both(*batchTxBuffered) commit
and(*backend) defrag
. When users perform the defragmentation operation, etcd doesn't update the consistent index. If etcd crashes(e.g. panicking) in the process for whatever reason, then etcd replays the WAL entries starting from the latest snapshot, accordingly it may re-apply entries which might have already been applied, eventually the revision isn't consistent with other members.Refer to discussion in #14685
Signed-off-by: Benjamin Wang [email protected]
cc @mitake @ptabor @serathius @spzala