-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server/auth: invalidate range permission cache during recovering from… #13920
Conversation
Codecov Report
@@ Coverage Diff @@
## main #13920 +/- ##
==========================================
- Coverage 72.71% 71.91% -0.81%
==========================================
Files 469 469
Lines 38398 38414 +16
==========================================
- Hits 27923 27624 -299
- Misses 8710 9016 +306
- Partials 1765 1774 +9
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Is this reproducible on isolated members with serialiable requests? Could we maybe add an e2e tests? |
Yeah, let me add e2e test cases. |
@@ -388,6 +388,8 @@ func (as *authStore) Recover(be AuthBackend) { | |||
as.tokenProvider.enable() | |||
} | |||
as.enabledMu.Unlock() | |||
|
|||
as.rangePermCache = make(map[string]*unifiedRangePermissions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two comments:
- Suggest to call clearCachedPerm;
- There is a potential race condition, some requests (such as v3_server.go#L128 and watch.go#L235 ) coming from API (outside of the applying workflow) may be accessing the
rangePermCache
concurrently. It seems that we need to add a lock to protect it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. Good catch.
In general the existing locking strategy seems week here:
- The access to this cache is protected by
readtx.Lock()
readTx
is owned by backend that we are swapping here.- so seems theoretically possible that there would be 2 transactions concurrently accessing/modifying the cache... one having tx on old backend, the other the new backend. Closing of the old backend is fully asynchronic:
etcd/server/etcdserver/server.go
Lines 1000 to 1008 in 2e034d2
go func() { lg.Info("closing old backend file") defer func() { lg.Info("closed old backend file") }() if err := oldbe.Close(); err != nil { lg.Panic("failed to close old backend", zap.Error(err)) } }() - Thus seems cache should have its own lock instead of piggybacking on transaction lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thus seems cache should have its own lock instead of piggybacking on transaction lock
Exactly! Please note that the cache can't be protected by the readtx.Lock(), because a batchTx and a readRx can execute concurrently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out that! I think it's an independent issue, let me open a dedicated PR for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an independent PR here: #13954 It's great if you can review.
#13954 was merged so I'll resume this PR. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
@ahrtr Yes, I think we can close this PR. The most reliable way to cause this issue was membership change as shown in #14571 |
Thanks @mitake |
… snapshot
Fix #13883
The above issue reports a problem that
authStore.Recover()
doesn't invalidaterangePermCache
, so an etcd node which is isolated from its cluster might not invalidate stale permission cache after resolving the network partitioning. This PR fixes the issue by invalidating the cache in a defensive manner.cc @ptabor