NodeConfig needs to aggregate status conditions so they can be queried #1557

tnozicka · 2023-11-10T18:37:15Z

What should the feature do?

The needs to be a way to asses whether NodeConfig has succeded but we only have conditions like RaidControllerNodeubuntu-2204Degraded where the node is variable and impractical/impossible to query.

What is the use case behind this feature?

Visibility and debugability. When say a RAID setup or mounts fail, it need to fail at the correct place where we wait for the NodeConfig to apply and not 5 command later when manager or scyllacluster breaks.

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

tnozicka · 2023-11-10T18:40:27Z

Ideally add a test with broken NodeConfig
a) on RAID
b) on mounts
so we know the final conditions also work

tnozicka · 2024-02-15T14:52:53Z

when we have the aggregated conditions we should update printers as well, see #1730

rzetelskik · 2024-05-06T10:44:37Z

Ideally add a test with broken NodeConfig
a) on RAID
b) on mounts
so we know the final conditions also work

@tnozicka do you know any simple reproducers of the two cases?

tnozicka · 2024-05-06T10:56:58Z

I suppose
a) reference unexisting device or a broken one
b) mounting over an existing mount

rzetelskik · 2024-05-13T16:28:16Z

@tnozicka I need some insight into the intended semantics of the current conditions.

There's NodeConfigReconciledConditionType now:

scylla-operator/pkg/api/scylla/v1alpha1/types_nodeconfig.go

Line 29 in c3e7916

NodeConfigReconciledConditionType NodeConfigConditionType = "Reconciled"

What is the exact meaning of it? The comment says Reconciled indicates that the NodeConfig is fully deployed and available - what does "fully deployed" mean exactly?

scylla-operator/pkg/controller/nodeconfig/status.go

Line 19 in c3e7916

    
           func (ncc *Controller) calculateStatus(nc *scyllav1alpha1.NodeConfig, daemonSets map[string]*appsv1.DaemonSet, scyllaUtilsImage string) (*scyllav1alpha1.NodeConfigStatus, error) {

The status calculation suggests that it refers to the required DaemonSets being deployed - I assume that with DaemonSet controller conditions this will no longer be needed?

What does Available condition mean in the context of NodeConfig? It's not really intuitive and I don't see any of the controllers setting it in the current implementation.
Should the aggregated node conditions be taken into account when aggregating the general conditions? In other words, should the aggregated NodeConfig conditions also reflect conditions of particular node setups?

tnozicka · 2024-05-14T07:15:16Z

The status calculation suggests that it refers to the required DaemonSets being deployed

update+available

I assume that with DaemonSet controller conditions this will no longer be needed?
What does Available condition mean in the context of NodeConfig? It's not really intuitive and I don't see any of the controllers setting it in the current implementation.

The thing with nodeconfig is that it technically may not have a long running workload to call it "available", so I chose "Reconciled" initially, before we had the generic sync functions or the other patterns. I suppose we can use the available as well here in a broader sense. I guess implementation wise we'll always have to have something long running to make it level based, unless kubernetes exposes those through an API.

Should the aggregated node conditions be taken into account when aggregating the general conditions? In other words, should the aggregated NodeConfig conditions also reflect conditions of particular node setups?

Can you elaborate here or give me an example, I am not sure I got the point.

rzetelskik · 2024-05-14T09:00:42Z

I suppose we can use the available as well here in a broader sense.

Wouldn't progressing make more sense? It seems better aligned with ScyllaCluster ans StatefulSets for example.

Can you elaborate here or give me an example, I am not sure I got the point.

For example, should RaidControllerNodeubuntu-2204Degraded be aggregated into general Degraded? Same for available/progressing.

tnozicka · 2024-05-14T10:44:18Z

Wouldn't progressing make more sense? It seems better aligned with ScyllaCluster ans StatefulSets for example.

Progressing and Available are given. I though we are talking Reconciled vs. Available.

For example, should RaidControllerNodeubuntu-2204Degraded be aggregated into general Degraded?

RaidControllerNodeubuntu-2204Degraded => RaidControllerDegraded => Degraded

But there is also the aggregation per node. It's all a bit weird since some conditions come from the workload but fot the aggregation purposes, I'd just do:
*Available => Available
*Progressing => Progressing
*Degraded => Degraded

rzetelskik · 2024-05-14T10:53:20Z

Thanks for all the responses so far.

RaidControllerNodeubuntu-2204Degraded => RaidControllerDegraded => Degraded

With how this is implemented now, I was thinking this should be more of a RaidControllerNodeubuntu-2204Degraded => Nodeubuntu-2204Degraded => Degraded flow? Because the node conditions are aggregated per node now (RaidControllerNodeubuntu-2204Degraded => Nodeubuntu-2204Degraded ), so a more natural approach to me would be to aggregate all the NodeConfig controller conditions (like DaemonSetControllerDegraded, which are not set up now at all) + all the node aggregates (Nodeubuntu-2204Degraded etc.). So the NodeSetup controller would do the per-node aggregation, and the NodeConfig controller would do the top-level aggregation. Otherwise the NodeConfig controller would have to aggregate partial conditions coming from independent controllers.

tnozicka · 2024-05-14T11:15:06Z

I don't find the "mid" aggregations particularly useful (and there is a lot of options), so I'd likely only generically aggregate the *Degraded => Degraded and leave the rest when we need it. But the only aggregation case I see useful at this point is to reasonably wait for it to be successfully done which is covered by the wildcard aggregation without any intermediate steps.

rzetelskik · 2024-05-14T11:23:56Z

I don't find the "mid" aggregations particularly useful (and there is a lot of options)

I think it's useful because the NodeConfig controller has to read these conditions from the status, since they come from different controllers (NodeSetup), so if they're not there - we have to assume a worst-case default. I guess it's easier to do per-matching node then per matching node AND per each controller of NodeSetup. But ok, I understand the conclusion is the intermediate conditions are an implementation detail.

*Degraded => Degraded

I don't think I follow what this is about.

tnozicka · 2024-05-14T12:07:35Z

I was think about just taking the existing conditions on the status and doing

	degradedCondition, err := AggregateStatusConditions(
		FindStatusConditionsWithSuffix(*allConditions, "Degraded"),
		metav1.Condition{
			Type:               "Degraded",
			Status:             metav1.ConditionFalse,
			Reason:             internalapi.AsExpectedReason,
			Message:            "",
			ObservedGeneration: generation,
		},
	)

probably fixing the logic to check observed generations since the conditions come from different places

scylla-operator-bot · 2024-07-09T10:43:34Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

rzetelskik · 2024-07-09T10:47:46Z

/remove-lifecycle stale
/triage accepted

tnozicka · 2024-08-28T17:16:27Z

we also need a way of figuring out if all the asynchronous statuses has been reported already (from node setup pods)

tnozicka added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 10, 2023

tnozicka assigned rzetelskik Nov 10, 2023

tnozicka mentioned this issue Nov 27, 2023

Node disk setup stability and UX #1593

Closed

tnozicka mentioned this issue Feb 15, 2024

Add server-side printers to our APIs #1730

Merged

rzetelskik mentioned this issue May 13, 2024

Aggregate NodeConfig status conditions #1918

Merged

rzetelskik mentioned this issue May 15, 2024

Extend node tune controller with status conditions #1920

Closed

scylla-operator-bot bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 9, 2024

scylla-operator-bot bot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2024

scylla-operator-bot bot closed this as completed in #1918 Sep 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NodeConfig needs to aggregate status conditions so they can be queried #1557

NodeConfig needs to aggregate status conditions so they can be queried #1557

tnozicka commented Nov 10, 2023

tnozicka commented Nov 10, 2023

tnozicka commented Feb 15, 2024

rzetelskik commented May 6, 2024

tnozicka commented May 6, 2024

rzetelskik commented May 13, 2024 •

edited

Loading

tnozicka commented May 14, 2024

rzetelskik commented May 14, 2024

tnozicka commented May 14, 2024

rzetelskik commented May 14, 2024

tnozicka commented May 14, 2024

rzetelskik commented May 14, 2024

tnozicka commented May 14, 2024

scylla-operator-bot bot commented Jul 9, 2024

rzetelskik commented Jul 9, 2024

tnozicka commented Aug 28, 2024

NodeConfig needs to aggregate status conditions so they can be queried #1557

NodeConfig needs to aggregate status conditions so they can be queried #1557

Comments

tnozicka commented Nov 10, 2023

What should the feature do?

What is the use case behind this feature?

Anything else we need to know?

tnozicka commented Nov 10, 2023

tnozicka commented Feb 15, 2024

rzetelskik commented May 6, 2024

tnozicka commented May 6, 2024

rzetelskik commented May 13, 2024 • edited Loading

tnozicka commented May 14, 2024

rzetelskik commented May 14, 2024

tnozicka commented May 14, 2024

rzetelskik commented May 14, 2024

tnozicka commented May 14, 2024

rzetelskik commented May 14, 2024

tnozicka commented May 14, 2024

scylla-operator-bot bot commented Jul 9, 2024

rzetelskik commented Jul 9, 2024

tnozicka commented Aug 28, 2024

rzetelskik commented May 13, 2024 •

edited

Loading