Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-909 fix servicemonitor & prom rule reconcile #290

Merged
merged 3 commits into from
Mar 16, 2023

Conversation

jotak
Copy link
Member

@jotak jotak commented Mar 8, 2023

bug tld;dr:
We need to be careful about not tying lifecycle of different objects together, as this would be assuming a certain state, which creates "holes" in the reconcile logic - we must avoid as much as possible any assumption about the current state

@codecov
Copy link

codecov bot commented Mar 8, 2023

Codecov Report

Merging #290 (34f0fbb) into main (4e15d42) will increase coverage by 2.11%.
The diff coverage is 65.93%.

@@            Coverage Diff             @@
##             main     #290      +/-   ##
==========================================
+ Coverage   47.57%   49.69%   +2.11%     
==========================================
  Files          43       43              
  Lines        5019     4984      -35     
==========================================
+ Hits         2388     2477      +89     
+ Misses       2420     2305     -115     
+ Partials      211      202       -9     
Flag Coverage Δ
unittests 49.69% <65.93%> (+2.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ollers/flowlogspipeline/flp_monolith_reconciler.go 63.28% <50.00%> (+6.79%) ⬆️
...trollers/flowlogspipeline/flp_ingest_reconciler.go 62.39% <55.55%> (+7.99%) ⬆️
...rollers/flowlogspipeline/flp_transfo_reconciler.go 58.86% <55.55%> (+6.51%) ⬆️
...trollers/consoleplugin/consoleplugin_reconciler.go 63.73% <70.00%> (+3.73%) ⬆️
...trollers/reconcilers/namespaced_objects_manager.go 80.51% <87.50%> (+0.80%) ⬆️
controllers/flowcollector_controller.go 58.99% <100.00%> (+6.26%) ⬆️
controllers/flowlogspipeline/flp_reconciler.go 56.60% <100.00%> (+4.14%) ⬆️
pkg/discover/apis.go 90.32% <100.00%> (+1.03%) ⬆️
pkg/helper/comparators.go 77.46% <100.00%> (ø)

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jotak jotak force-pushed the fix-reconcile-on-upgrade branch 2 times, most recently from 3f819d6 to 3b6c395 Compare March 8, 2023 15:53
KalmanMeth
KalmanMeth previously approved these changes Mar 9, 2023
@openshift-ci openshift-ci bot added the lgtm label Mar 9, 2023
@jotak jotak requested a review from memodi March 9, 2023 14:00
@memodi
Copy link
Contributor

memodi commented Mar 9, 2023

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 9, 2023
@github-actions
Copy link

github-actions bot commented Mar 9, 2023

New image: ["quay.io/netobserv/network-observability-operator:a35d384"]. It will expire after two weeks.

@memodi
Copy link
Contributor

memodi commented Mar 9, 2023

@jotak - I am seeing an error still this time with console-plugin reconcilation, if I tried procedure as described in ticket:

$ oc get flowcollector -w
NAME      AGENT   SAMPLING (EBPF)   DEPLOYMENT MODEL   STATUS
cluster   EBPF    1                 DIRECT             ReconcileConsolePluginFailed

2023-03-09T17:15:18.262Z	ERROR	controller.flowcollector	Failed to create new *v1.ClusterRoleBinding	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "Namespace": "", "Name": "netobserv-plugin", "error": "clusterroles.rbac.authorization.k8s.io \"netobserv-plugin\" not found"}
github.com/netobserv/network-observability-operator/controllers/reconcilers.(*ClientHelper).ReconcileClusterRoleBinding
	/opt/app-root/controllers/reconcilers/client_helper.go:92
github.com/netobserv/network-observability-operator/controllers/consoleplugin.(*CPReconciler).reconcilePermissions
	/opt/app-root/controllers/consoleplugin/consoleplugin_reconciler.go:151
github.com/netobserv/network-observability-operator/controllers/consoleplugin.(*CPReconciler).Reconcile
	/opt/app-root/controllers/consoleplugin/consoleplugin_reconciler.go:97
github.com/netobserv/network-observability-operator/controllers.(*FlowCollectorReconciler).Reconcile
	/opt/app-root/controllers/flowcollector_controller.go:156
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2023-03-09T17:15:18.262Z	ERROR	controller.flowcollector	Failed to reconcile Console plugin: clusterroles.rbac.authorization.k8s.io "netobserv-plugin" not found	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "error": "clusterroles.rbac.authorization.k8s.io \"netobserv-plugin\" not found"}
github.com/netobserv/network-observability-operator/controllers.(*FlowCollectorReconciler).Reconcile
	/opt/app-root/controllers/flowcollector_controller.go:158
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2023-03-09T17:15:18.293Z	ERROR	controller.flowcollector	Reconciler error	{"reconciler group": "flows.netobserv.io", "reconciler kind": "FlowCollector", "name": "cluster", "namespace": "", "error": "clusterroles.rbac.authorization.k8s.io \"netobserv-plugin\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/opt/app-root/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

@jotak jotak force-pushed the fix-reconcile-on-upgrade branch from 3b6c395 to e0e9d0d Compare March 10, 2023 12:34
@openshift-ci openshift-ci bot removed the lgtm label Mar 10, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 10, 2023
@jotak
Copy link
Member Author

jotak commented Mar 10, 2023

@OlivierCazade can you take a second look please? @memodi pointed out an issue that turned out needing more changes, cf my last commit e0e9d0d

@memodi
Copy link
Contributor

memodi commented Mar 13, 2023

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 13, 2023
@memodi
Copy link
Contributor

memodi commented Mar 13, 2023

/ok-to-test

@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:075326e
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-075326e
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-075326e

They will expire after two weeks.

Catalog source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-075326e
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@jotak
Copy link
Member Author

jotak commented Mar 15, 2023

@memodi I retried and cannot reproduce the initial issue with this fix applied. Maybe we need to check if our reproduction scenario differ. What I do is:

  1. Install downstream operator 1.1
  2. Create a default flowcollector (without changing any value - also deploying 0-click loki)
  3. Ensure everything works as expected
  4. Delete the operator, keeping the FlowCollector installed
  5. Deploy this PR operator using IMG=quay.io/netobserv/network-observability-operator:075326e make deploy
  6. Check logs & readiness with oc get flowcollector

@jotak
Copy link
Member Author

jotak commented Mar 15, 2023

Just retried using loki-operator instead of zero-click, to make sure it isn't related to that : still no issue

@jotak jotak requested a review from KalmanMeth March 15, 2023 09:58
jotak added 3 commits March 15, 2023 18:04
We need to be careful about not tying lifecycle of different objects
together, as this would be assuming a certain state, which creates "holes"
in the reconcile logic - we must avoid as much as possible any
assumption about the current state
Codecov fails too often to upload. Make it non mandatory
Also increase the timeout for some tests
They are also flawed: we should not assume they are created once for all
when flowcollector is created.
E.g. of how it could go wrong:
1. an old version of an operator is installed, with a flowcollector
2. an upgrade is done, which introduces a new "static" resource
=> that new resource would not be created, since the FlowCollector
object was already present hence did not trigger these "static init"
things
@jotak jotak force-pushed the fix-reconcile-on-upgrade branch from e0e9d0d to 34f0fbb Compare March 15, 2023 17:04
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 15, 2023
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Mar 15, 2023
@github-actions
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:7a17fa1
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-7a17fa1
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-7a17fa1

They will expire after two weeks.

Catalog source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-7a17fa1
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@jotak
Copy link
Member Author

jotak commented Mar 15, 2023

@memodi the catalog issue should be fixed, I regenerated a new one

@memodi
Copy link
Contributor

memodi commented Mar 15, 2023

/label qe-approved

thanks @jotak - confirming both the issues - Reconciliation during upgrade and NOO image with pre-merge catalog image - are fixed with most recent changes.

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Mar 15, 2023
@jotak jotak requested a review from jpinsonneau March 16, 2023 08:12
@@ -29,7 +29,7 @@ jobs:
with:
files: ./cover.out
flags: unittests
fail_ci_if_error: true
fail_ci_if_error: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the solution to flaky test coverage ? Will it skip if it fails to upload for example ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah well, IMO coverage upload is not something critical enough to make whole CI fail, given how unstable it is

Copy link
Contributor

@jpinsonneau jpinsonneau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in terms of code (not tested)

@jotak
Copy link
Member Author

jotak commented Mar 16, 2023

/approve

@openshift-ci
Copy link

openshift-ci bot commented Mar 16, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 725e26f into netobserv:main Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants