Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1092 Move CRD fields to advanced #467

Merged
merged 4 commits into from
Jan 22, 2024

Conversation

jpinsonneau
Copy link
Contributor

@jpinsonneau jpinsonneau commented Oct 20, 2023

Description

This PR move unused API fields to debug sections under each component.
See NETOBSERV-1092 for the list of affected fields.

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@jpinsonneau jpinsonneau added ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. testing-breaking-change labels Oct 20, 2023
@jpinsonneau
Copy link
Contributor Author

@msherif1234 Should we consider the eBPF cache fields as debug ?

	// `cacheActiveTimeout` is the max period during which the reporter aggregates flows before sending.
	// Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
	// however you can expect higher memory consumption and an increased latency in the flow collection.
	//+kubebuilder:validation:Pattern:=^\d+(ns|ms|s|m)?$
	//+kubebuilder:default:="5s"
	CacheActiveTimeout string `json:"cacheActiveTimeout,omitempty"`

	// `cacheMaxFlows` is the max number of flows in an aggregate; when reached, the reporter sends the flows.
	// Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
	// however you can expect higher memory consumption and an increased latency in the flow collection.
	//+kubebuilder:validation:Minimum=1
	//+kubebuilder:default:=100000
	CacheMaxFlows int32 `json:"cacheMaxFlows,omitempty"`

@msherif1234
Copy link
Contributor

@msherif1234 Should we consider the eBPF cache fields as debug ?

	// `cacheActiveTimeout` is the max period during which the reporter aggregates flows before sending.
	// Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
	// however you can expect higher memory consumption and an increased latency in the flow collection.
	//+kubebuilder:validation:Pattern:=^\d+(ns|ms|s|m)?$
	//+kubebuilder:default:="5s"
	CacheActiveTimeout string `json:"cacheActiveTimeout,omitempty"`

	// `cacheMaxFlows` is the max number of flows in an aggregate; when reached, the reporter sends the flows.
	// Increasing `cacheMaxFlows` and `cacheActiveTimeout` can decrease the network traffic overhead and the CPU load,
	// however you can expect higher memory consumption and an increased latency in the flow collection.
	//+kubebuilder:validation:Minimum=1
	//+kubebuilder:default:=100000
	CacheMaxFlows int32 `json:"cacheMaxFlows,omitempty"`

I don't think so those knob if user decided to tune will be because resource limitation or excess and will change and stay so they are more of config operation knobs not debug IMHO

@@ -635,37 +591,10 @@ type FlowCollectorLoki struct {
// Set `enable` to `true` to store flows in Loki. It is required for the OpenShift Console plugin installation.
Enable *bool `json:"enable,omitempty"`

//+kubebuilder:default:="1s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like for the ebpf agent, I think batchWait and batchSize could be good to keep not in Debug, for performance tuning?
Especially BatchSize is something users need to keep consistent with the msg max size defined on loki server-side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I based that on the Jira you initially wrote 🤔
If you changed your mind I can revert. Just let me know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my side I would feel better as a user to have that elsewhere but debug is a strong word here.
Maybe we could just concider renaming the debug sections to something more generic such as advanced ?

@jotak
Copy link
Member

jotak commented Oct 25, 2023

Just a remark that we should probably take back loki batch settings , other than that LGTM!

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 7, 2023
@jpinsonneau
Copy link
Contributor Author

jpinsonneau commented Nov 7, 2023

Just a remark that we should probably take back loki batch settings , other than that LGTM!

Sure, I moved loki batch wait, batch size and timeout to FLP since these are only related to processor.
Since we now have 3 loki related fields and 4 kafka ones I can create subsections under processor. WDYT ?

https://github.com/jpinsonneau/network-observability-operator/blob/1092/api/v1beta2/flowcollector_types.go#L392-L425

Copy link

codecov bot commented Nov 7, 2023

Codecov Report

Attention: 238 lines in your changes are missing coverage. Please review.

Comparison is base (da4bdd3) 56.27% compared to head (83166a8) 57.60%.

Files Patch % Lines
.../flowcollector/v1alpha1/zz_generated.conversion.go 0.00% 58 Missing ⚠️
...s/flowcollector/v1beta1/zz_generated.conversion.go 35.06% 50 Missing ⚠️
...pis/flowcollector/v1beta2/zz_generated.deepcopy.go 75.15% 37 Missing and 2 partials ⚠️
pkg/helper/crd.go 77.86% 23 Missing and 6 partials ⚠️
...is/flowcollector/v1alpha1/flowcollector_webhook.go 0.00% 27 Missing ⚠️
...pis/flowcollector/v1beta1/flowcollector_webhook.go 80.00% 18 Missing and 4 partials ⚠️
controllers/consoleplugin/consoleplugin_objects.go 66.66% 2 Missing and 4 partials ⚠️
pkg/helper/flowcollector.go 93.75% 4 Missing and 2 partials ⚠️
controllers/ovs/flowsconfig_ovnk_reconciler.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #467      +/-   ##
==========================================
+ Coverage   56.27%   57.60%   +1.32%     
==========================================
  Files          69       70       +1     
  Lines        9104     9446     +342     
==========================================
+ Hits         5123     5441     +318     
- Misses       3648     3668      +20     
- Partials      333      337       +4     
Flag Coverage Δ
unittests 57.60% <66.75%> (+1.32%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 885 to 901
//+kubebuilder:default="1s"
// `minBackoff` is the initial backoff time for client connection between retries.
MinBackoff *metav1.Duration `json:"minBackoff,omitempty"` // Warning: keep as pointer, else default is ignored

//+kubebuilder:default="5s"
// `maxBackoff` is the maximum backoff time for client connection between retries.
MaxBackoff *metav1.Duration `json:"maxBackoff,omitempty"` // Warning: keep as pointer, else default is ignored

//+kubebuilder:validation:Minimum=0
//+kubebuilder:default:=2
// `maxRetries` is the maximum number of retries for client connections.
MaxRetries *int32 `json:"maxRetries,omitempty"`

//+kubebuilder:default:={"app":"netobserv-flowcollector"}
// +optional
// `staticLabels` is a map of common labels to set on each flow.
StaticLabels map[string]string `json:"staticLabels"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with your suggestion to move flp-related loki config under processor;
For those 4 fields I guess they should be prefixed with "Loki" like you did for the others

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -29,7 +29,6 @@ func TestBeta1ConversionRoundtrip_Loki(t *testing.T) {
Enable: true,
InsecureSkipVerify: true,
},
BatchSize: 1000,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For beta1 -> beta2 -> beta1 cycle, we loose BatchSize value since this field moved from Loki to Processor spec.

If we want to manage such, we'll need to manually override Convert_v1beta1_FlowCollector_To_v1beta2_FlowCollector and Convert_v1beta2_FlowCollector_To_v1beta1_FlowCollector to copy the fields.

Is it worth the price ? @jotak @msherif1234

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's necessary to do it, else we loose the ability to set this field, since the conversions always happen between the served version (like v1beta2) and the stored version (v1beta1) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and same for all the other fields that are moved)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also I tested deploying the sample flowcollector doesn't work, I think it still has all the old fields

Copy link
Contributor Author

@jpinsonneau jpinsonneau Nov 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's annoying since it always runs the conversion webhook, you end with debug fields in the yaml all the time even if you directly apply v1beta2 sample:

spec:
  agent:
    ebpf:
      ...
      debug: {}
  consolePlugin:
    ...
    debug:
      port: 9001
      register: true
  processor:
    ...
    debug:
      port: 2055
      lokiMaxRetries: 2
      lokiMaxBackoff: 5s
      conversationTerminatingTimeout: 5s
      conversationEndTimeout: 10s
      lokiStaticLabels:
        app: netobserv-flowcollector
      enableKubeProbes: true
      lokiMinBackoff: 1s
      healthPort: 8080
      dropUnusedFields: true
      conversationHeartbeatInterval: 30s

I would prefer to omit these when default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I don't see a perfect way to managing that ...
having in-code defaults, and fill these values only when they don't match the default? Not perfect because it would force us to keep defaults defined in two places...

Copy link
Member

@jotak jotak Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a codegen tool like kubebuilder that automatically generates defaults as go consts from the kubebuilder annotations or from the CRD openAPI .... sounds like a hack'n'hustle project?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a way to load defaults from CRD and show custom values only when set.
That required a bit of refactoring and creating a new crd helper but it works very well and the user still have autocompletion in the yaml.

For simplicity, I have created getters in flowcollector helper that sets defaults and load custom values in their related config so we can rely on these everywhere in the code and we don't need to check for nils.

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 21, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 22, 2023
@jpinsonneau
Copy link
Contributor Author

jpinsonneau commented Nov 22, 2023

  • Added v1beta1 <> v1beta2 FlowCollector manual conversion to move Loki / Processor fields
  • Added webhook tests
  • Updated v1beta2 sample

https://github.com/netobserv/network-observability-operator/compare/93e67291a0ab7be328dbe506037dd39c8e5b852a..d814d996a12b1345fe8e567acd92e46a3a80867c

@jpinsonneau jpinsonneau requested a review from jotak November 22, 2023 11:13
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Nov 24, 2023
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 4, 2023
@jpinsonneau
Copy link
Contributor Author

Found a bug on conversion webhook: 04269b8

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 17, 2024
Copy link

New images:

  • quay.io/netobserv/network-observability-operator:8b85581
  • quay.io/netobserv/network-observability-operator-bundle:v0.0.0-8b85581
  • quay.io/netobserv/network-observability-operator-catalog:v0.0.0-8b85581

They will expire after two weeks.

To deploy this build:

# Direct deployment, from operator repo
IMAGE=quay.io/netobserv/network-observability-operator:8b85581 make deploy

# Or using operator-sdk
operator-sdk run bundle quay.io/netobserv/network-observability-operator-bundle:v0.0.0-8b85581

Or as a Catalog Source:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: netobserv-dev
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  image: quay.io/netobserv/network-observability-operator-catalog:v0.0.0-8b85581
  displayName: NetObserv development catalog
  publisher: Me
  updateStrategy:
    registryPoll:
      interval: 1m

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 18, 2024
Copy link

openshift-ci bot commented Jan 18, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jotak

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jotak jotak removed the approved label Jan 18, 2024
@jpinsonneau jpinsonneau changed the title NETOBSERV-1092 Move CRD fields to debug NETOBSERV-1092 Move CRD fields to advanced Jan 18, 2024
@memodi
Copy link
Contributor

memodi commented Jan 18, 2024

/test e2e-operator

jpinsonneau and others added 4 commits January 19, 2024 17:00
read defaults from CRD

fix merge

Feedback on Debug crd sections

- As suggested by Julien, rename Debug to Advanced
- Merge back loki fields into Loki section
- Add webhook tests on Advanced sections
- Fix loki.advanced not being overwritten from anotations on conversion
- StaticLabel isn't a pointer to map anymore, as maps are already
  nillable
- fix alm sample with renamed fields
- generalize CRD setup for all suite_tests
- Some loki settings, when provided by a v1beta1 CR, were ignored, such as
  batchWait/Size
- Found a minor day-0 bug in console: when the port setting is changed,
  console plugin was unreachable because the pod didn't use that port
(only service did). It seems nobody ever changed that setting >_<
@jpinsonneau
Copy link
Contributor Author

I've rebased again since some changes were introduced in #501 requiring 83166a8 update

@openshift-ci openshift-ci bot added the lgtm label Jan 22, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit bf1562e into netobserv:main Jan 22, 2024
12 checks passed
@nathan-weinberg
Copy link
Contributor

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants