Skip to content

Commit

Permalink
Merge pull request #4608 from wojtek-t/consistent_reads_mitigation
Browse files Browse the repository at this point in the history
Consistent reads mitigation
  • Loading branch information
k8s-ci-robot authored May 9, 2024
2 parents ad0a1d5 + 9afa3e0 commit e2c97c0
Show file tree
Hide file tree
Showing 2 changed files with 106 additions and 29 deletions.
129 changes: 103 additions & 26 deletions keps/sig-api-machinery/2340-Consistent-reads-from-cache/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ read from etcd.
- [Troubleshooting](#troubleshooting)
- [Implementation History](#implementation-history)
- [Alternatives](#alternatives)
- [Per-request override](#per-request-override)
<!-- /toc -->

## Summary
Expand Down Expand Up @@ -253,36 +254,94 @@ Since falling back to etcd won't work, we should fail the requests and rely on
rate limiting to prevent cascading failure. I.e. `Retry-After` HTTP header (for
well-behaved clients) and [Priority and Fairness](https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/20190228-priority-and-fairness.md).

For such situations we will provide users with following tools:
In order to mitigate such problems, let's present how the system currently works
in different cases. In addition to that, we add column indicating whether a given
case will change how watchcache implementation will be handling the request.

| ResourceVersion | ResourceVersionMatch | Continuation | Limit | etcd implementation | watchcache implementation | changed |
|-----------------|----------------------|-------------------|---------------|-----------------------------------------|----------------------------------------------------|----------|
| _unset_ | _unset_ | _unset_ | _unset_ / _N_ | Quorum read request | Delegated to etcd | Yes |
| _unset_ | _unset_ | _token_ | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd | |
| _unset_ | _Exact_ | _unset_ / _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
| _unset_ | _NotOlderThan_ | _unset_ | _unset_ / _N_ | Quorum read request | Delegated to etcd | Yes |
| _unset_ | _NotOlderThan_ | _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
| _0_ | _unset_ | _unset_ | _unset_ / _N_ | Quorum read request | List from cache ignoring _limit_ | |
| _0_ | _unset_ | _token_ | _unset_ / _N_ | Quorum read request | Delegated to etcd | |
| _0_ | _Exact_ | _unset_ / _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
| _0_ | _NotOlderThan_ | _unset_ | _unset_ / _N_ | Quorum read request | List from cache ignoring _limit_ | |
| _0_ | _NotOlderThan_ | _token_ | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd | |
| _RV_ | _unset_ | _unset_ | _unset_ | Quorum read request | Wait for cache synced to _RV_+ and list from cache | |
| _RV_ | _unset_ | _unset_ | _N_ | Read request from RV=_RV_ | Delegated to etcd | |
| _RV_ | _unset_ | _token_ | _unset_ / _N_ | Read request from RV encoded in _token_ | Delegated to etcd | Deferred |
| _RV_ | _Exact_ | _unset_ | _unset_ / _N_ | Read request from RV=_RV_ | Delegated to etcd | |
| _RV_ | _Exact_ | _token_ | _unset_ / _N_ | Fails [validation] | Fails [validation] | |
| _RV_ | _NotOlderThan_ | _unset_ | _unset_ | Quorum read request + check for _RV_ | Wait for cache synced to _RV_+ and list from cache | |
| _RV_ | _NotOlderThan_ | _unset_ | _N_ | Quorum read request + check for _RV_ | Delegated to etcd | Deferred |
| _RV_ | _NotOlderThan_ | _token_ | _unset_/ _N_ | Fails [validation] | Fails [validation] | |

For watch requests both `Continuation` and `Limit` parameters are ignored (we should
have added validation rules for them in the past), but we have `SendInitialEvents` one.
The table for watch requests look like the following

| ResourceVersion | ResourceVersionMatch | SendInitialEvents | etcd implementation | watchcache implementation | changed |
|-----------------|----------------------|------------------------|------------------------------------------------|-----------------------------------------|----------|
| _unset_ | _unset_ | _unset_ | Quorum list + watch stream | Delegate to etcd | Deferred |
| _unset_ | _unset_ | false / true | Fails [validation] | Fails [validation] | |
| _unset_ | _NotOlderThan_ | _unset_ | Fails [validation] | Fails [validation] | |
| _unset_ | _NotOlderThan_ | false | Watch stream from etcd RV | Read etcd RV. Watch stream from it | |
| _unset_ | _NotOlderThan_ | true | Quorum list + watch stream | Wait RV > etcd RV. List + watch stream | |
| _unset_ | _Exact_ | _unset_ / false / true | Fails [validation] | Fails [validation] | |
| _0_ | _unset_ | _unset_ | Quorum list + watch stream | List + watch stream | |
| _0_ | _unset_ | false / true | Fails [validation] | Fails [validation] | |
| _0_ | _NotOlderThan_ | _unset_ | Fails [validation] | Fails [validation] | |
| _0_ | _NotOlderThan_ | false | Watch stream from etcd RV | Watch stream from current watchcache RV | |
| _0_ | _NotOlderThan_ | true | Quorum list + watch stream | List + watch stream | |
| _0_ | _Exact_ | _unset_ / false / true | Fails [validation] | Fails [validation] | |
| _RV_ | _unset_ | _unset_ | Watch stream from RV | Watch stream from RV | |
| _RV_ | _unset_ | false / true | Fails [validation] | Fails [validation] | |
| _RV_ | _NotOlderThan_ | _unset_ | Fails [validation] | Fails [validation] | |
| _RV_ | _NotOlderThan_ | false | Check RV > etcd RV. Watch stream from RV | Watch stream from RV | |
| _RV_ | _NotOlderThan_ | true | Check RV > etcd RV. Quorum list + watch stream | Wait for RV. List + watch stream | |
| _RV_ | _Exact_ | _unset_ / false / true | Fails [validation] | Fails [validation] | |

[validation]: https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apimachinery/pkg/apis/meta/internalversion/validation/validation.go#L28
[etcd resolution]: https://github.com/kubernetes/kubernetes/blob/release-1.30/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L589-L627

As presented in the above tables, the semantics for a given request server from
etcd and watchcache is a little bit different. It's a consequence of the fact that:
* etcd design supports only `Exact` semantics - it allows for consistent list
from a given resource version (either specific value or "now").
The semantics of `NotOlderThan` is implemented as getting consistent list from
"now" and checking if it satisfies the condition.
* watchcache design supports only `NotOlderThan` semantics - it always waits
until its resource version is at least as fresh as requested resource version
and then returns the result from its current state

For the above reason, sending the same request to etcd and watchcache, especially
when cluster state is changing, may legitimately return different results.

In order to allow debugging results returned from watchcache in a runnning cluster,
the only reasonable procedure is:
* send a request that is served from watchcache
* send a request setting `ResourceVersionMatch=Exact` and `ResourceVersions` to value
returned from the request returned in a previous point
* compare the two results

The existing API already allows us to achieve it.

To further allow debugging and improve confidence we will provide users with the
following tools:
* a dedicated `apiserver_watch_cache_read_wait` metric to detect a problem with
watch cache.
* a per-request override to disable watch cache to allow debugging.
* a `inconsistency detector` that for requests served from watchcache will be able
to send a request to etcd (as described above) and compare the results

Metric `apiserver_watch_cache_read_wait` will measure wait time experienced by
reads for watch cache to become fresh. If user notices a latency request in
they can use this metric to confirm that the issue is caused by watch cache.

Per request override should allow user to compare request results without
impacting other requests or requiring to redeploy whole cluster. The exact
details of override API will be clarified during API review. In healthy
situation, using this override should not cause any impact on the response,
however it might increase resource usage. In our tests cpu load could increase
tenfold. To prevent abuse access to it should be limited to users with
`cluster-admin` role, rejecting the request otherwise.

In case of issues with watch cache users can use the `ConsistentListFromCache`
feature flag to disable the feature or the existing `--watch-cache` flag to
disable the whole watch cache.

We prefer to provide users an explicit flag and per-request override over an
automatic fallback. It gives users full control and visibility into how request
are handled and ensures accurate APF cost estimates. We expect watch being
starved to happen very rarely, meaning its logic needs to be very simple to
ensure it works properly. A simple fallback will not bring much benefit over
what user can do manually. It will just make the harder to understand and
predict behavior. APF estimates cost just based on request parameters,
before it is passed to storage. If fallback was based on state of watch cache,
cost of request would change after the APF decision increasing the risk of overload.
The `inconsistency detector` will get enabled in our CI to detect issues with
the introduced mechanism.

## Design Details

Expand Down Expand Up @@ -379,7 +438,7 @@ Comparing resource usage and latency with and without consistent list from watch

- Feature is enabled by default.
- Metric `apiserver_watch_cache_read_wait` is implemented.
- Per-request watch cache opt-out is implemented.
- Inconsistency detector is implemented and enabled in CI
- Deprecate support of etcd v3.3.X, v3.4.24 and v3.5.7

#### GA
Expand Down Expand Up @@ -529,7 +588,7 @@ Use per-request override to compare latency when reading from watch cache vs etc
## Implementation History

* 1.28 - Alpha
* 1.30 - Beta
* 1.31 - Beta

## Alternatives

Expand All @@ -547,4 +606,22 @@ Allow clients to manage the initial resource version they provide to reflectors,
Do a dynamic fallback based on watch cache wait time.

- We expect watch being starved to happen very rarely, meaning its logic needs to be very simple to ensure it works properly.
- Simple fallback will rather not do a better job then just a manual fallback.
- Simple fallback will rather not do a better job then just a manual fallback.

### Per-request override

To enable debugging, we considered introducing per-request override to disable
watchcache to force the request to be served from etcd. This would allow us
to compare request results without impacting other requests or requiring to
redeploy the whole cluster. However, as described in the KEP itself, the results
of the same requests served from watchcache and etcd may legitimately return
different results. As a result, the proposed debugging mechanism was decided
to better serve its purpose.

We also considered automatic fallback. However, we expect watch being
starved to happen very rarely, meaning its logic needs to be very simple to
ensure it works properly. A simple fallback will not bring much benefit over
what user can do manually. It will just make the harder to understand and
predict behavior. APF estimates cost just based on request parameters,
before it is passed to storage. If fallback was based on state of watch cache,
cost of request would change after the APF decision increasing the risk of overload.
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,17 @@ approvers:
- "@wojtek-t"
editor: TBD
creation-date: 2019-12-10
last-updated: 2023-06-15
last-updated: 2024-05-09
status: implementable
see-also:
- "/keps/sig-api-machinery/3157-watch-list"
replaces:
superseded-by:
stage: beta
latest-milestone: "v1.30"
latest-milestone: "v1.31"
milestone:
alpha: "v1.28"
beta: "v1.30"
beta: "v1.31"
feature-gates:
- name: ConsistentListFromCache
components:
Expand Down

0 comments on commit e2c97c0

Please sign in to comment.