-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add dependency filter #555
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the wrong approach. The approach should be three separate efforts to address the three completely different issues.
#528 is the result of overzealous client-side validation. As I mentioned, we should not be doing client-side validation. The solution is to rollback the overzealous validation.
#554 should be a simple fix with a better test to prove the fix.
For #526, it is not clear this is a bug as much as an optional feature. If we are solving this, we are not going to do it a couple days before a release without tests.
/hold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment, but I think this approach looks fine. As for the concerns raised by @seans3:
For #554 and #526: We need more tests and I think the PR also mentions that those will be added before this is merged. So I think we are all in agreement there.
I don't think the timing related to a release matters. We have to trust that the combination of code reviews and tests are sufficient to catch most bugs, and when they don't (which will happen from time to time), we have to add better tests. It is unclear to me if waiting would make it safer as we don't have any clients (to my knowledge) who are running unreleased code. We can leverage the tests in https://github.com/GoogleContainerTools/kpt to do an additional check before we make a release. I have done that in the past, but it not something we have automated.
For #526 specifically, I think that is a bug since it seems strange to handle dependencies, but also ignore them in the face of failures. Skipping them seems to align with the general approach of the library, where we continue with as much as possible when something fails, but also make it safe to retry to "make things right" after a failure.
I think we are all in agreement that we want to avoid as much client-side validation as possible. But in this case, we have always restricted dependencies to only include resources within the apply set. This just changes how we report the error.
// DependencyFilter implements ValidationFilter interface to determine if an | ||
// object can be applied or deleted based on the status of it's dependencies. | ||
type DependencyFilter struct { | ||
TaskContext *taskrunner.TaskContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not something we need to address in this PR, but it feels like the TaskContext should be passed in as a parameter to the Filter
function rather than being a property on the struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but was trying to limit the amount of unrelated changes. That can be done in a separate PR.
IMO we should not launch depends-on in Config Sync without this bug fixed, and it was probably a mistake to launch the feature in kpt with this bug in the first place. Prior to this PR, depends-on was only half implemented (sorted ordering, but no enforcement). |
For CRDs this is probably fine, because there is a service-side error that prevents deletion of CRDs with CRs in use. For namespaces tho, this is catastrophic, because namespace deletion causes garbage collection of everything else in the namespace. For all other explicit dependencies there is no server-side dependency enforcement. So the client-side enforcement cannot be rolled back without completely removing the dependency ordering feature. |
This is fair. I can extract this to another PR. |
308cf9b
to
01bf108
Compare
@seans3 Removing the validation for external dependencies won't fix this bug. It would stop the confusing validation error, but it wouldn't fix the behavior. It wouldn't fail server-side, because there's no server-side impl. And without the DependencyFilter, neither the apply nor the prune would be skipped. So the apply would go through and then its dependency would be deleted. And we would lose the validation error for otherwise legitimately external dependencies. Combining the graph for both applies and prunes allows for detection of implicit dependencies (CRDs & Namespaces) which would have otherwise been missed. And for explicit dependencies (depends-on & apply-time-mutation), it detects them as "internal" dependencies instead of "external". Then both implicit and explicit deps can be checked by the DependencyFilter. And to keep the behavior otherwise the same, the graph only needs to be sorted once and then filtered down to include either the apply objects or prune objects. I was planning to add more unit tests, but the e2e tests and applier/destroyer tests already show that the existing behavior is otherwise preserved, and I wanted to make sure the approach was otherwise acceptable before writing more tests. |
01bf108
to
84d8cdf
Compare
/hold |
- Pass TaskContext into TaskBuilder.Build - Combine dependency graph for apply and prune objects. This is required to catch dependencies that would have been deleted. - Replace graph.SortObjs into DependencyGraph + Sort + HydrateSetList - Replace graph.ReverseSortObjs with ReverseSetList to perform on the combined (apply + prune) set list. - Add planned pending applies and prune to the InventoryManager before executing the task queue. This allows the DependencyFilter to validate against the planned actuation strategy of objects that haven't been applied/pruned yet. - Add the dependency graph to the TaskContext, for the DependencyFilter to use. This can be removed in the future if the filters are managed by the solver. - Make Graph.Sort non-destructive, so the graph can be re-used by the DependencyFilter. - Add Graph.EdgesFrom and EdgesTo for the DependencyFilter to use. This requires storing the reverse edge list. - Add an e2e test for the DependencyFilter - Add an e2e test for the LocalNamespaceFilter Fixes kubernetes-sigs#526 Fixes kubernetes-sigs#528
84d8cdf
to
4599e4f
Compare
/unhold Added unit tests:
Discussed approach with @seans3 offline. His primary complaint was that the dependency ordering was never intended to be a guarantee, just an expedient to produce fewer errors. However, while this may have been true when dependencies were first added, there are a number of use cases that have been discovered that are not solved by just ordering, which are better solved by an ordering guarantee. Most of these use cases come down to improving UX such that the error is obvious and once fixed in the local config, a re-apply is sufficient to resolve the problem. One example worth calling out is that this helps expedite deployment of chains of dependencies that would otherwise be applied all up front and fall into retry-back-off, causing end-to-end reconciliation to take orders of magnitude longer to complete compared to waiting to apply resources until their dependencies are reconciled. This is especially true for infrastructure resources which take a notoriously long time to reconcile (like GKE clusters). Plus, some retry loops (like Jobs) have a finite number of retries configured before failing, after which a reapply of the initial spec is insufficient. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
The tests look good. Thanks for all the quick work. I will leave the LGTM to Jingfang.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: karlkfi, seans3 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks for adding the unit tests. /lgtm |
This is required to catch dependencies that would have been deleted.
combined (apply + prune) set list.
before executing the task queue.
This allows the DependencyFilter to validate against the planned
actuation strategy of objects that haven't been applied/pruned yet.
DependencyFilter to use.
This can be removed in the future if the filters are managed by the
solver.
DependencyFilter.
This requires storing the reverse edge list.
Fixes #526
Fixes #528
Replaces #523