Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add dependency filter #555

Merged
merged 1 commit into from
Mar 1, 2022

Conversation

karlkfi
Copy link
Contributor

@karlkfi karlkfi commented Feb 25, 2022

  • Pass TaskContext into TaskBuilder.Build
  • Combine dependency graph for apply and prune objects.
    This is required to catch dependencies that would have been deleted.
  • Replace graph.SortObjs into DependencyGraph + Sort + HydrateSetList
  • Replace graph.ReverseSortObjs with ReverseSetList to perform on the
    combined (apply + prune) set list.
  • Add planned pending applies and prune to the InventoryManager
    before executing the task queue.
    This allows the DependencyFilter to validate against the planned
    actuation strategy of objects that haven't been applied/pruned yet.
  • Add the dependency graph to the TaskContext, for the
    DependencyFilter to use.
    This can be removed in the future if the filters are managed by the
    solver.
  • Make Graph.Sort non-destructive, so the graph can be re-used by the
    DependencyFilter.
  • Add Graph.EdgesFrom and EdgesTo for the DependencyFilter to use.
    This requires storing the reverse edge list.
  • Add an e2e test for the DependencyFilter
  • Add an e2e test for the LocalNamespaceFilter

Fixes #526
Fixes #528

Replaces #523

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 25, 2022
@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 25, 2022
Copy link
Contributor

@seans3 seans3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the wrong approach. The approach should be three separate efforts to address the three completely different issues.

#528 is the result of overzealous client-side validation. As I mentioned, we should not be doing client-side validation. The solution is to rollback the overzealous validation.

#554 should be a simple fix with a better test to prove the fix.

For #526, it is not clear this is a bug as much as an optional feature. If we are solving this, we are not going to do it a couple days before a release without tests.

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 25, 2022
Copy link
Member

@mortent mortent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, but I think this approach looks fine. As for the concerns raised by @seans3:

For #554 and #526: We need more tests and I think the PR also mentions that those will be added before this is merged. So I think we are all in agreement there.
I don't think the timing related to a release matters. We have to trust that the combination of code reviews and tests are sufficient to catch most bugs, and when they don't (which will happen from time to time), we have to add better tests. It is unclear to me if waiting would make it safer as we don't have any clients (to my knowledge) who are running unreleased code. We can leverage the tests in https://github.com/GoogleContainerTools/kpt to do an additional check before we make a release. I have done that in the past, but it not something we have automated.

For #526 specifically, I think that is a bug since it seems strange to handle dependencies, but also ignore them in the face of failures. Skipping them seems to align with the general approach of the library, where we continue with as much as possible when something fails, but also make it safe to retry to "make things right" after a failure.

I think we are all in agreement that we want to avoid as much client-side validation as possible. But in this case, we have always restricted dependencies to only include resources within the apply set. This just changes how we report the error.

// DependencyFilter implements ValidationFilter interface to determine if an
// object can be applied or deleted based on the status of it's dependencies.
type DependencyFilter struct {
TaskContext *taskrunner.TaskContext
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not something we need to address in this PR, but it feels like the TaskContext should be passed in as a parameter to the Filter function rather than being a property on the struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but was trying to limit the amount of unrelated changes. That can be done in a separate PR.

@karlkfi
Copy link
Contributor Author

karlkfi commented Feb 26, 2022

For #526, it is not clear this is a bug as much as an optional feature. If we are solving this, we are not going to do it a couple days before a release without tests.

IMO we should not launch depends-on in Config Sync without this bug fixed, and it was probably a mistake to launch the feature in kpt with this bug in the first place. Prior to this PR, depends-on was only half implemented (sorted ordering, but no enforcement).

@karlkfi
Copy link
Contributor Author

karlkfi commented Feb 26, 2022

#528 is the result of overzealous client-side validation. As I mentioned, we should not be doing client-side validation. The solution is to rollback the overzealous validation.

For CRDs this is probably fine, because there is a service-side error that prevents deletion of CRDs with CRs in use.

For namespaces tho, this is catastrophic, because namespace deletion causes garbage collection of everything else in the namespace.

For all other explicit dependencies there is no server-side dependency enforcement. So the client-side enforcement cannot be rolled back without completely removing the dependency ordering feature.

@karlkfi
Copy link
Contributor Author

karlkfi commented Feb 26, 2022

#554 should be a simple fix with a better test to prove the fix.

This is fair. I can extract this to another PR.

@karlkfi
Copy link
Contributor Author

karlkfi commented Feb 26, 2022

Pulled out fix for #554: #556

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 26, 2022
@karlkfi
Copy link
Contributor Author

karlkfi commented Feb 26, 2022

#528 is the result of overzealous client-side validation. As I mentioned, we should not be doing client-side validation. The solution is to rollback the overzealous validation.

@seans3 Removing the validation for external dependencies won't fix this bug. It would stop the confusing validation error, but it wouldn't fix the behavior. It wouldn't fail server-side, because there's no server-side impl. And without the DependencyFilter, neither the apply nor the prune would be skipped. So the apply would go through and then its dependency would be deleted. And we would lose the validation error for otherwise legitimately external dependencies.

Combining the graph for both applies and prunes allows for detection of implicit dependencies (CRDs & Namespaces) which would have otherwise been missed. And for explicit dependencies (depends-on & apply-time-mutation), it detects them as "internal" dependencies instead of "external". Then both implicit and explicit deps can be checked by the DependencyFilter. And to keep the behavior otherwise the same, the graph only needs to be sorted once and then filtered down to include either the apply objects or prune objects.

I was planning to add more unit tests, but the e2e tests and applier/destroyer tests already show that the existing behavior is otherwise preserved, and I wanted to make sure the approach was otherwise acceptable before writing more tests.

@karlkfi
Copy link
Contributor Author

karlkfi commented Feb 28, 2022

/hold

- Pass TaskContext into TaskBuilder.Build
- Combine dependency graph for apply and prune objects.
  This is required to catch dependencies that would have been deleted.
- Replace graph.SortObjs into DependencyGraph + Sort + HydrateSetList
- Replace graph.ReverseSortObjs with ReverseSetList to perform on the
  combined (apply + prune) set list.
- Add planned pending applies and prune to the InventoryManager
  before executing the task queue.
  This allows the DependencyFilter to validate against the planned
  actuation strategy of objects that haven't been applied/pruned yet.
- Add the dependency graph to the TaskContext, for the
  DependencyFilter to use.
  This can be removed in the future if the filters are managed by the
  solver.
- Make Graph.Sort non-destructive, so the graph can be re-used by the
  DependencyFilter.
- Add Graph.EdgesFrom and EdgesTo for the DependencyFilter to use.
  This requires storing the reverse edge list.
- Add an e2e test for the DependencyFilter
- Add an e2e test for the LocalNamespaceFilter

Fixes kubernetes-sigs#526
Fixes kubernetes-sigs#528
@karlkfi karlkfi force-pushed the karl-dep-filter2 branch from 84d8cdf to 4599e4f Compare March 1, 2022 01:31
@karlkfi
Copy link
Contributor Author

karlkfi commented Mar 1, 2022

/unhold

Added unit tests:

  • DependencyFilter.Filter
  • solver.Build (apply, prune, apply & prune, inventory status)
  • graph.ReverseSetList
  • graph.HydrateSetList
  • graph.DependencyGraph
  • Graph.Dependencies
  • Graph.Dependents
  • Graph.Sort (non-destructive)

Discussed approach with @seans3 offline. His primary complaint was that the dependency ordering was never intended to be a guarantee, just an expedient to produce fewer errors. However, while this may have been true when dependencies were first added, there are a number of use cases that have been discovered that are not solved by just ordering, which are better solved by an ordering guarantee. Most of these use cases come down to improving UX such that the error is obvious and once fixed in the local config, a re-apply is sufficient to resolve the problem.

One example worth calling out is that this helps expedite deployment of chains of dependencies that would otherwise be applied all up front and fall into retry-back-off, causing end-to-end reconciliation to take orders of magnitude longer to complete compared to waiting to apply resources until their dependencies are reconciled. This is especially true for infrastructure resources which take a notoriously long time to reconcile (like GKE clusters). Plus, some retry loops (like Jobs) have a finite number of retries configured before failing, after which a reapply of the initial spec is insufficient.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2022
@karlkfi karlkfi requested review from Liujingfang1 and seans3 March 1, 2022 01:52
Copy link
Contributor

@seans3 seans3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

The tests look good. Thanks for all the quick work. I will leave the LGTM to Jingfang.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: karlkfi, seans3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2022
@Liujingfang1
Copy link
Contributor

Thanks for adding the unit tests.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 1, 2022
@k8s-ci-robot k8s-ci-robot merged commit 5c6134a into kubernetes-sigs:master Mar 1, 2022
@karlkfi karlkfi deleted the karl-dep-filter2 branch March 1, 2022 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
5 participants