Merge pull request #4608 from wojtek-t/consistent_reads_mitigation

Consistent reads mitigation
kubernetes · May 9, 2024 · e2c97c0 · e2c97c0
2 parents ad0a1d5 + 9afa3e0
commit e2c97c0
Show file tree

Hide file tree

Showing 2 changed files with 106 additions and 29 deletions.
diff --git a/keps/sig-api-machinery/2340-Consistent-reads-from-cache/README.md b/keps/sig-api-machinery/2340-Consistent-reads-from-cache/README.md
@@ -49,6 +49,7 @@ read from etcd.
   - [Troubleshooting](#troubleshooting)
 - [Implementation History](#implementation-history)
 - [Alternatives](#alternatives)
+  - [Per-request override](#per-request-override)
 <!-- /toc -->
 
 ## Summary
@@ -253,36 +254,94 @@ Since falling back to etcd won't work, we should fail the requests and rely on
 rate limiting to prevent cascading failure.  I.e. `Retry-After` HTTP header (for
 well-behaved clients) and [Priority and Fairness](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md).
 
-For such situations we will provide users with following tools:
+In order to mitigate such problems, let's present how the system currently works
+in different cases. In addition to that, we add column indicating whether a given
+case will change how watchcache implementation will be handling the request.
+
+| ResourceVersion | ResourceVersionMatch | Continuation      | Limit         | etcd implementation                     | watchcache implementation                          | changed  |
+|-----------------|----------------------|-------------------|---------------|-----------------------------------------|----------------------------------------------------|----------|
+| _unset_         | _unset_              | _unset_           | _unset_ / _N_ | Quorum read request                     | Delegated to etcd                                  | Yes      |
+| _unset_         | _unset_              | _token_           | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd                                  |          |
+| _unset_         | _Exact_              | _unset_ / _token_ | _unset_ / _N_ | Fails [validation]                      | Fails [validation]                                 |          |
+| _unset_         | _NotOlderThan_       | _unset_           | _unset_ / _N_ | Quorum read request                     | Delegated to etcd                                  | Yes      |
+| _unset_         | _NotOlderThan_       | _token_           | _unset_ / _N_ | Fails [validation]                      | Fails [validation]                                 |          |
+| _0_             | _unset_              | _unset_           | _unset_ / _N_ | Quorum read request                     | List from cache ignoring _limit_                   |          |
+| _0_             | _unset_              | _token_           | _unset_ / _N_ | Quorum read request                     | Delegated to etcd                                  |          |
+| _0_             | _Exact_              | _unset_ / _token_ | _unset_ / _N_ | Fails [validation]                      | Fails [validation]                                 |          |
+| _0_             | _NotOlderThan_       | _unset_           | _unset_ / _N_ | Quorum read request                     | List from cache ignoring _limit_                   |          |
+| _0_             | _NotOlderThan_       | _token_           | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd                                  |          |
+| _RV_            | _unset_              | _unset_           | _unset_       | Quorum read request                     | Wait for cache synced to _RV_+ and list from cache |          |
+| _RV_            | _unset_              | _unset_           | _N_           | Read request from RV=_RV_               | Delegated to etcd                                  |          |
+| _RV_            | _unset_              | _token_           | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd                                  | Deferred |
+| _RV_            | _Exact_              | _unset_           | _unset_ / _N_ | Read request from RV=_RV_               | Delegated to etcd                                  |          |
+| _RV_            | _Exact_              | _token_           | _unset_ / _N_ | Fails [validation]                      | Fails [validation]                                 |          |
+| _RV_            | _NotOlderThan_       | _unset_           | _unset_       | Quorum read request + check for _RV_    | Wait for cache synced to _RV_+ and list from cache |          |
+| _RV_            | _NotOlderThan_       | _unset_           | _N_           | Quorum read request + check for _RV_    | Delegated to etcd                                  | Deferred |
+| _RV_            | _NotOlderThan_       | _token_           | _unset_/ _N_  | Fails [validation]                      | Fails [validation]                                 |          |
+
+For watch requests both `Continuation` and `Limit` parameters are ignored (we should
+have added validation rules for them in the past), but we have `SendInitialEvents` one.
+The table for watch requests look like the following
+
+| ResourceVersion | ResourceVersionMatch | SendInitialEvents      | etcd implementation                            | watchcache implementation               | changed  |
+|-----------------|----------------------|------------------------|------------------------------------------------|-----------------------------------------|----------|
+| _unset_         | _unset_              | _unset_                | Quorum list + watch stream                     | Delegate to etcd                        | Deferred |
+| _unset_         | _unset_              | false / true           | Fails [validation]                             | Fails [validation]                      |          |
+| _unset_         | _NotOlderThan_       | _unset_                | Fails [validation]                             | Fails [validation]                      |          |
+| _unset_         | _NotOlderThan_       | false                  | Watch stream from etcd RV                      | Read etcd RV. Watch stream from it      |          |
+| _unset_         | _NotOlderThan_       | true                   | Quorum list + watch stream                     | Wait RV > etcd RV. List + watch stream  |          |
+| _unset_         | _Exact_              | _unset_ / false / true | Fails [validation]                             | Fails [validation]                      |          |
+| _0_             | _unset_              | _unset_                | Quorum list + watch stream                     | List + watch stream                     |          |
+| _0_             | _unset_              | false / true           | Fails [validation]                             | Fails [validation]                      |          |
+| _0_             | _NotOlderThan_       | _unset_                | Fails [validation]                             | Fails [validation]                      |          |
+| _0_             | _NotOlderThan_       | false                  | Watch stream from etcd RV                      | Watch stream from current watchcache RV |          |
+| _0_             | _NotOlderThan_       | true                   | Quorum list + watch stream                     | List + watch stream                     |          |
+| _0_             | _Exact_              | _unset_ / false / true | Fails [validation]                             | Fails [validation]                      |          |
+| _RV_            | _unset_              | _unset_                | Watch stream from RV                           | Watch stream from RV                    |          |
+| _RV_            | _unset_              | false / true           | Fails [validation]                             | Fails [validation]                      |          |
+| _RV_            | _NotOlderThan_       | _unset_                | Fails [validation]                             | Fails [validation]                      |          |
+| _RV_            | _NotOlderThan_       | false                  | Check RV > etcd RV. Watch stream from RV       | Watch stream from RV                    |          |
+| _RV_            | _NotOlderThan_       | true                   | Check RV > etcd RV. Quorum list + watch stream | Wait for RV. List + watch stream        |          |
+| _RV_            | _Exact_              | _unset_ / false / true | Fails [validation]                             | Fails [validation]                      |          |
+
+[validation]: https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/validation/validation.go#L28
+[etcd resolution]: https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L589-L627
+
+As presented in the above tables, the semantics for a given request server from
+etcd and watchcache is a little bit different. It's a consequence of the fact that:
+* etcd design supports only `Exact` semantics - it allows for consistent list
+  from a given resource version (either specific value or "now").
+  The semantics of `NotOlderThan` is implemented as getting consistent list from
+  "now" and checking if it satisfies the condition.
+* watchcache design supports only `NotOlderThan` semantics - it always waits
+  until its resource version is at least as fresh as requested resource version
+  and then returns the result from its current state
+
+For the above reason, sending the same request to etcd and watchcache, especially
+when cluster state is changing, may legitimately return different results.
+
+In order to allow debugging results returned from watchcache in a runnning cluster,
+the only reasonable procedure is:
+* send a request that is served from watchcache
+* send a request setting `ResourceVersionMatch=Exact` and `ResourceVersions` to value
+  returned from the request returned in a previous point
+* compare the two results
+
+The existing API already allows us to achieve it.
+
+To further allow debugging and improve confidence we will provide users with the
+following tools:
 * a dedicated `apiserver_watch_cache_read_wait` metric to detect a problem with
   watch cache.
-* a per-request override to disable watch cache to allow debugging.
+* a `inconsistency detector` that for requests served from watchcache will be able
+  to send a request to etcd (as described above) and compare the results
 
 Metric `apiserver_watch_cache_read_wait` will measure wait time experienced by 
 reads for watch cache to become fresh. If user notices a latency request in
 they can use this metric to confirm that the issue is caused by watch cache.
 
-Per request override should allow user to compare request results without
-impacting other requests or requiring to redeploy whole cluster. The exact
-details of override API will be clarified during API review. In healthy
-situation, using this override should not cause any impact on the response,
-however it might increase resource usage. In our tests cpu load could increase
-tenfold. To prevent abuse access to it should be limited to users with
-`cluster-admin` role, rejecting the request otherwise.
-
-In case of issues with watch cache users can use the `ConsistentListFromCache`
-feature flag to disable the feature or the existing `--watch-cache` flag to
-disable the whole watch cache.
-
-We prefer to provide users an explicit flag and per-request override over an
-automatic fallback. It gives users full control and visibility into how request
-are handled and ensures accurate APF cost estimates. We expect watch being
-starved to happen very rarely, meaning its logic needs to be very simple to
-ensure it works properly. A simple fallback will not bring much benefit over
-what user can do manually. It will just make the harder to understand and
-predict behavior. APF estimates cost just based on request parameters,
-before it is passed to storage. If fallback was based on state of watch cache,
-cost of request would change after the APF decision increasing the risk of overload.
+The `inconsistency detector` will get enabled in our CI to detect issues with
+the introduced mechanism.
 
 ## Design Details
 
@@ -379,7 +438,7 @@ Comparing resource usage and latency with and without consistent list from watch
 
 - Feature is enabled by default.
 - Metric `apiserver_watch_cache_read_wait` is implemented.
-- Per-request watch cache opt-out is implemented.
+- Inconsistency detector is implemented and enabled in CI
 - Deprecate support of etcd v3.3.X, v3.4.24 and v3.5.7
 
 #### GA
@@ -529,7 +588,7 @@ Use per-request override to compare latency when reading from watch cache vs etc
 ## Implementation History
 
 * 1.28 - Alpha
-* 1.30 - Beta
+* 1.31 - Beta
 
 ## Alternatives
 
@@ -547,4 +606,22 @@ Allow clients to manage the initial resource version they provide to reflectors,
 Do a dynamic fallback based on watch cache wait time.
 
 - We expect watch being starved to happen very rarely, meaning its logic needs to be very simple to ensure it works properly.
-- Simple fallback will rather not do a better job then just a manual fallback.
+- Simple fallback will rather not do a better job then just a manual fallback.
+
+### Per-request override
+
+To enable debugging, we considered introducing per-request override to disable
+watchcache to force the request to be served from etcd. This would allow us
+to compare request results without impacting other requests or requiring to
+redeploy the whole cluster. However, as described in the KEP itself, the results
+of the same requests served from watchcache and etcd may legitimately return
+different results. As a result, the proposed debugging mechanism was decided
+to better serve its purpose.
+
+We also considered automatic fallback. However, we expect watch being
+starved to happen very rarely, meaning its logic needs to be very simple to
+ensure it works properly. A simple fallback will not bring much benefit over
+what user can do manually. It will just make the harder to understand and
+predict behavior. APF estimates cost just based on request parameters,
+before it is passed to storage. If fallback was based on state of watch cache,
+cost of request would change after the APF decision increasing the risk of overload.
diff --git a/keps/sig-api-machinery/2340-Consistent-reads-from-cache/kep.yaml b/keps/sig-api-machinery/2340-Consistent-reads-from-cache/kep.yaml
@@ -16,17 +16,17 @@ approvers:
   - "@wojtek-t"
 editor: TBD
 creation-date: 2019-12-10
-last-updated: 2023-06-15
+last-updated: 2024-05-09
 status: implementable
 see-also:
   -  "/keps/sig-api-machinery/3157-watch-list"
 replaces:
 superseded-by:
 stage: beta
-latest-milestone: "v1.30"
+latest-milestone: "v1.31"
 milestone:
   alpha: "v1.28"
-  beta: "v1.30"
+  beta: "v1.31"
 feature-gates:
 - name: ConsistentListFromCache
   components: