-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Target Allocator implementation (Part 1 - OTEL Operator Enhancements) #351
Target Allocator implementation (Part 1 - OTEL Operator Enhancements) #351
Conversation
Pl review @dashpole @Aneurysm9 ty! |
@jpkrohling @Aneurysm9 - please review. Look forward to any comments you may have. ty! |
I'm planning on reviewing this by the end of this week. |
I left a few comments in the design doc. Let me know if you still want me to review this, or whether we should clarify the points there first. |
Hi @jpkrohling, we have tried to address most of your concerns in the document. Please let us know if you feel more clarification is required. |
Hi @jpkrohling thanks for your detailed review and comments on the design doc. The v3 design is what we have implemented currently. If you're good with the design, please review the code else we're happy to walk you through any grey areas / questions you may have. |
I'm mostly happy with the design, there's just one point to clarify or adapt, related to how changes are handled by the allocator. I think the configmap might not be necessary, neither a restart of the allocator process. |
internal/config/main.go
Outdated
@@ -70,15 +74,22 @@ func New(opts ...Option) Config { | |||
o.collectorImage = fmt.Sprintf("otel/opentelemetry-collector:%s", o.version.OpenTelemetryCollector) | |||
} | |||
|
|||
if len(o.targetAllocatorImage) == 0 { | |||
// will be replcaed with target allocator image once approved and available | |||
o.targetAllocatorImage = "raul9595/otel-loadbalancer:0.0.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated to an appropriate image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to fetch to the default TA image version from versions.txt
file similar to how we fetch the latest version of otel-collector image. This helps us to easily manage version upgrades of TA.
config/rbac/role.yaml
Outdated
- apiGroups: | ||
- rbac.authorization.k8s.io | ||
resources: | ||
- rolebindings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to create role bindings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason to create a rolebinding is to allow the target allocator pod to view the collector pods in the same namespace. I am basically automating the below command. If I dont create a appropriate rolebinding, it will not be able to query for the collector pods. The logic of this part is in Part2.
kubectl create clusterrolebinding default-view --clusterrole=view --serviceaccount={namespace}:default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operator should definitely not be able to create cluster roles/bindings by default, as it would require the operator to have cluster-admin permissions. There should be a separate role/binding YAML file that users can opt in to use, giving the operator the extra permissions. At runtime, as part of the "auto detect" package, the operator can then detect if it has been given this extra permission and self-adjust, enabling/disabling this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this makes sense. I would remove that part then and add a check instead just to ensure the appropriate permissions are granted. This would remove the role/rolebinding files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having it in the auto detect during which we do not have access to the request namespace while initializing the auto detect, would it be fine to create a helper method in targetallocator/reconcile which would use the k8s client to check the service account for appropriate permissions and return an error if permissions are not available?
Another option is when the rest.inClusterConfig() is called inside the logic (Part of PR2), we can just return an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. The operator then should indeed provision the role/service account/binding. Are you sure it requires cluster roles, though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually role and role bindings are enough but for the operator to create a role with the appropriate permissions, the operator needs to have these permissions thmeselves. Thus, the operator needs to have a cluster role defined since the collector namespaces are dynamic.
One solution can be just to let the user create the role and rolebindings as defined in the kuttl tests (latest update).
volumes: | ||
- configMap: | ||
items: | ||
- key: targetallocator.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add an assertion with the contents of the configmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One reason I have not included the configmap contents is because it creates some of the the content dynamically (from OTEL collector namespace). This is still not supported in KUTTL and I am not sure how to proceed with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the contents of the configmap are being tested in the Go unit tests. Would it be sufficient to just check if it is created (as being checked in the latest update)?
@@ -0,0 +1,8 @@ | |||
apiVersion: apps/v1 | |||
kind: Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it be a statefulset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OpenTelemetryCollector target allocation pod would be spawned when the CR mode is StatefulSet but the target allocation kind would be a Deployment. This KUTTL test is for testing the target allocation deployment alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add extra assert files, outlining what's expected in terms of deployments/statefulsets/configmaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have added more assert cases. Please let me know if these look fine.
The e2e tests failed probably due to intermittent issues with the cert-manager, and have been restarted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@Raul9595 before merging this PR make sure to update the default TA load balancer image. :)
internal/config/main.go
Outdated
@@ -70,15 +74,22 @@ func New(opts ...Option) Config { | |||
o.collectorImage = fmt.Sprintf("otel/opentelemetry-collector:%s", o.version.OpenTelemetryCollector) | |||
} | |||
|
|||
if len(o.targetAllocatorImage) == 0 { | |||
// will be replcaed with target allocator image once approved and available | |||
o.targetAllocatorImage = "raul9595/otel-loadbalancer:0.0.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to fetch to the default TA image version from versions.txt
file similar to how we fetch the latest version of otel-collector image. This helps us to easily manage version upgrades of TA.
Pull request has been modified.
Co-authored-by: Anthony Mirabella <[email protected]>
Pull request has been modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. There are a couple of minor changes that would be needed, but they can be addressed in a follow-up PR. If you prefer to do that, please create an issue tracking those changes and I'll merge this.
I'll also give a few hours for other folks to review (@VineethReddy02?), in case they are interested. If I don't hear any objections, I'll merge this today.
@@ -55,6 +116,54 @@ func TestExpectedDeployments(t *testing.T) { | |||
assert.Equal(t, int32(2), *actual.Spec.Replicas) | |||
}) | |||
|
|||
t.Run("should not update target allocator deployment when the config is updated", func(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please also change, say, the number of replicas? This is to demonstrate that you thought about the case where users are changing the deployment manually and you are OK with that.
Pull request has been modified.
I'm really excited about this change! Thank you for your persistence and patience, @Raul9595! This PR is set to auto-merge on green. I restarted the failing tests and will check it again later today. |
…open-telemetry#351) Co-authored-by: Alexis Perez <[email protected]> Co-authored-by: Anthony Mirabella <[email protected]>
…open-telemetry#351) Co-authored-by: Alexis Perez <[email protected]> Co-authored-by: Anthony Mirabella <[email protected]>
Description
This PR addresses issue open-telemetry/prometheus-interoperability-spec#60. This is Part 1 of a 3 part update in which enhancements are presented to the OTel Operator to support the deployment and reconciliation support for a target allocator resource. This update serves as an addition to StatefulSet support in which the target allocator resource can utilize multiple collector instances to split the Prometheus targets among them evenly.
Part 2 will contain the internal target allocator logic which will live as an image and used by the OTel operator.
Part 3 will have the logic to modify the ConfigMap to allow the collectors to find the scrape targets using http_sd_config.
The Design document for Part 1 can be found here: Target Allocator Design Document
Type of change
New feature & Enhancement
Testing
Implemented Unit tests and KUTTL end to end testing for assuring that deployment, reconciliation & target allocator internal logic are working as expected.
A more detailed implementation/testing document for Part 1 of the update is atatched below: Implementation Details
@alexperez52