Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NodeConfig needs to aggregate status conditions so they can be queried #1557

Closed
Tracked by #1593
tnozicka opened this issue Nov 10, 2023 · 15 comments · Fixed by #1918
Closed
Tracked by #1593

NodeConfig needs to aggregate status conditions so they can be queried #1557

tnozicka opened this issue Nov 10, 2023 · 15 comments · Fixed by #1918
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@tnozicka
Copy link
Contributor

What should the feature do?

The needs to be a way to asses whether NodeConfig has succeded but we only have conditions like RaidControllerNodeubuntu-2204Degraded where the node is variable and impractical/impossible to query.

What is the use case behind this feature?

Visibility and debugability. When say a RAID setup or mounts fail, it need to fail at the correct place where we wait for the NodeConfig to apply and not 5 command later when manager or scyllacluster breaks.

Anything else we need to know?

No response

@tnozicka tnozicka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 10, 2023
@tnozicka
Copy link
Contributor Author

Ideally add a test with broken NodeConfig
a) on RAID
b) on mounts
so we know the final conditions also work

@tnozicka
Copy link
Contributor Author

when we have the aggregated conditions we should update printers as well, see #1730

@rzetelskik
Copy link
Member

Ideally add a test with broken NodeConfig
a) on RAID
b) on mounts
so we know the final conditions also work

@tnozicka do you know any simple reproducers of the two cases?

@tnozicka
Copy link
Contributor Author

tnozicka commented May 6, 2024

I suppose
a) reference unexisting device or a broken one
b) mounting over an existing mount

@rzetelskik
Copy link
Member

rzetelskik commented May 13, 2024

@tnozicka I need some insight into the intended semantics of the current conditions.

  1. There's NodeConfigReconciledConditionType now:
    NodeConfigReconciledConditionType NodeConfigConditionType = "Reconciled"

    What is the exact meaning of it? The comment says Reconciled indicates that the NodeConfig is fully deployed and available - what does "fully deployed" mean exactly?
    func (ncc *Controller) calculateStatus(nc *scyllav1alpha1.NodeConfig, daemonSets map[string]*appsv1.DaemonSet, scyllaUtilsImage string) (*scyllav1alpha1.NodeConfigStatus, error) {

    The status calculation suggests that it refers to the required DaemonSets being deployed - I assume that with DaemonSet controller conditions this will no longer be needed?
  2. What does Available condition mean in the context of NodeConfig? It's not really intuitive and I don't see any of the controllers setting it in the current implementation.
  3. Should the aggregated node conditions be taken into account when aggregating the general conditions? In other words, should the aggregated NodeConfig conditions also reflect conditions of particular node setups?

@tnozicka
Copy link
Contributor Author

The status calculation suggests that it refers to the required DaemonSets being deployed

update+available

I assume that with DaemonSet controller conditions this will no longer be needed?
What does Available condition mean in the context of NodeConfig? It's not really intuitive and I don't see any of the controllers setting it in the current implementation.

The thing with nodeconfig is that it technically may not have a long running workload to call it "available", so I chose "Reconciled" initially, before we had the generic sync functions or the other patterns. I suppose we can use the available as well here in a broader sense. I guess implementation wise we'll always have to have something long running to make it level based, unless kubernetes exposes those through an API.

Should the aggregated node conditions be taken into account when aggregating the general conditions? In other words, should the aggregated NodeConfig conditions also reflect conditions of particular node setups?

Can you elaborate here or give me an example, I am not sure I got the point.

@rzetelskik
Copy link
Member

I suppose we can use the available as well here in a broader sense.

Wouldn't progressing make more sense? It seems better aligned with ScyllaCluster ans StatefulSets for example.

Can you elaborate here or give me an example, I am not sure I got the point.

For example, should RaidControllerNodeubuntu-2204Degraded be aggregated into general Degraded? Same for available/progressing.

@tnozicka
Copy link
Contributor Author

Wouldn't progressing make more sense? It seems better aligned with ScyllaCluster ans StatefulSets for example.

Progressing and Available are given. I though we are talking Reconciled vs. Available.

For example, should RaidControllerNodeubuntu-2204Degraded be aggregated into general Degraded?

RaidControllerNodeubuntu-2204Degraded => RaidControllerDegraded => Degraded

But there is also the aggregation per node. It's all a bit weird since some conditions come from the workload but fot the aggregation purposes, I'd just do:
*Available => Available
*Progressing => Progressing
*Degraded => Degraded

@rzetelskik
Copy link
Member

Thanks for all the responses so far.

RaidControllerNodeubuntu-2204Degraded => RaidControllerDegraded => Degraded

With how this is implemented now, I was thinking this should be more of a RaidControllerNodeubuntu-2204Degraded => Nodeubuntu-2204Degraded => Degraded flow? Because the node conditions are aggregated per node now (RaidControllerNodeubuntu-2204Degraded => Nodeubuntu-2204Degraded ), so a more natural approach to me would be to aggregate all the NodeConfig controller conditions (like DaemonSetControllerDegraded, which are not set up now at all) + all the node aggregates (Nodeubuntu-2204Degraded etc.). So the NodeSetup controller would do the per-node aggregation, and the NodeConfig controller would do the top-level aggregation. Otherwise the NodeConfig controller would have to aggregate partial conditions coming from independent controllers.

@tnozicka
Copy link
Contributor Author

I don't find the "mid" aggregations particularly useful (and there is a lot of options), so I'd likely only generically aggregate the *Degraded => Degraded and leave the rest when we need it. But the only aggregation case I see useful at this point is to reasonably wait for it to be successfully done which is covered by the wildcard aggregation without any intermediate steps.

@rzetelskik
Copy link
Member

I don't find the "mid" aggregations particularly useful (and there is a lot of options)

I think it's useful because the NodeConfig controller has to read these conditions from the status, since they come from different controllers (NodeSetup), so if they're not there - we have to assume a worst-case default. I guess it's easier to do per-matching node then per matching node AND per each controller of NodeSetup. But ok, I understand the conclusion is the intermediate conditions are an implementation detail.

*Degraded => Degraded

I don't think I follow what this is about.

@tnozicka
Copy link
Contributor Author

I was think about just taking the existing conditions on the status and doing

	degradedCondition, err := AggregateStatusConditions(
		FindStatusConditionsWithSuffix(*allConditions, "Degraded"),
		metav1.Condition{
			Type:               "Degraded",
			Status:             metav1.ConditionFalse,
			Reason:             internalapi.AsExpectedReason,
			Message:            "",
			ObservedGeneration: generation,
		},
	)

probably fixing the logic to check observed generations since the conditions come from different places

Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2024
@rzetelskik
Copy link
Member

/remove-lifecycle stale
/triage accepted

@scylla-operator-bot scylla-operator-bot bot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2024
@tnozicka
Copy link
Contributor Author

we also need a way of figuring out if all the asynchronous statuses has been reported already (from node setup pods)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants