-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing delete event on watch opened on same revision as compaction request #19179
Comments
List of PRs that were merged that day: https://github.com/etcd-io/etcd/pulls?q=is%3Apr+is%3Aclosed+merged%3A2025-01-09 |
Both 409 and 410 are deletion (tombstone) revisions, and 410 was compacted (see log below), so it's expected that the first watched event was 410 instead of 409. Probably a test issue?
|
I don't follow, compaction is expected to be exclusive (compaction on rev 409, should preserve event with rev 409), while watch is expected to be inclusive (watch from rev 409, should include event on rev 409). This can be validated by running following commands.
|
I was saying actually the compacted revision was 410 instead of 409. So it's expected behaviour that the first watched event was 410. Somehow robustness test was expecting to see 409. To clarify, I got the log from the first link (pasted below) you provided. |
Yes, I am aware of it. It's exactly what we fixed in #18274. See also my previous summary #18089 (comment) |
Trying to reproduce this case https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-amd64/1877585036438409216 ("--experimental-compaction-batch-limit=300")
First batch compaction will delete created revision. And second round will trigger panic. etcd/server/storage/mvcc/kvstore.go Line 476 in ebb2b06
|
I think I reproduced it.
cc @ahrtr @serathius I think we should update logic at restore etcd/server/storage/mvcc/kvstore.go Line 476 in ebb2b06
|
Awesome finding @fuweid, any guess why the issue started showing only recently (9 Jan) and was not visible before? Would be good to know to improve robustness test. |
Right, but fact that etcd didn't reject watch on compacted revision means that at that time this revision was not yet compacted. |
It's a valid point. Thanks for the catch! I think it's an edge case that we missed in #18274. The fix is easy, but it might take some time to add test to prevent any potential impact and fix any broken test cases. This isn't a regression, also it's low possibility to happen in K8s (see #18089 (comment)). So I suggest that we don't block the release of 3.5.18. I see another related potential issue. When etcdserver gets started, it will finish previous uncompleted compaction, but it's executed async. In theory, there is a very small window that the etcd/server/storage/mvcc/kvstore.go Line 410 in ebb2b06
|
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Marek Siarkowicz <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
Signed-off-by: Wei Fu <[email protected]>
CHANGELOG: update backport info for #19179
All done. thx |
Bug report criteria
What happened?
Starting from 9 January we started getting failures on presubmit tests.
Presubmit history goes up to December 31, with failures only starting on implying the issue is new.
Failues are due to resumable guarantee being broken
From history visualizations I have seen it follows pattern:
What did you expect to happen?
Resumable guarantee should not be broken.
How can we reproduce it (as minimally and precisely as possible)?
Didn't yet managed to reproduce it locally.
Anything else we need to know?
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-amd64/1877585036438409216
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877364764741472256
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877466683589791744
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877575502907052032
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877585037260492800
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877678423757819904
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877842586459181056
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1878101264374435840
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1878113522366287872
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1878196741560340480
Etcd version (please run commands below)
I was not able to reproduce the issue outside of CI, so I haven't confirmed other versions
Etcd configuration (command line flags or environment variables)
N/A
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
N/A
Relevant log output
No response
The text was updated successfully, but these errors were encountered: