-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update routine for migration of jaeger remote sampling in version 0.61.0 #1116
Update routine for migration of jaeger remote sampling in version 0.61.0 #1116
Conversation
I have created a first draft. A quick question, does it make sense to import the collector packages to port the configurations or more to avoid the dependency? I assume avoiding is better here? cc @kevinearls |
03967af
to
8380614
Compare
IIRC right now we don't import OTEL col config in the operator. |
8380614
to
ea75e5e
Compare
i wonder if it wouldn't make more sense to let the upgrade fail if remote sampling is enabled and then refer the user to the OTEL documentation? The problem I see is that we can never be sure that the specified port is correct. Furthermore, it could become a problem if external collectors or similar are not reconfigured. |
311340a
to
fa03c7d
Compare
5c4e52e
to
f9d1de5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Could you also raise an issue with the collector so that we can discuss using one port for multiple receivers? It would allow the jaeger receiver and the remote sampling endpoints to be served from the same port again. The way it is now, the jaeger collector isn't interchangeable with the otel collector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some questions about the upgrade
|
||
const issueID = "https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14707" | ||
return nil, fmt.Errorf( | ||
"jaegerremotesampling is no longer available as receiver configuration. "+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that the message points to the GH issue, but the root cause is not obvious. I would suggest putting either here a sentence that the ports should be changed for receiver and extension or make it super clear in the GH issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second question:
This is the first time when we actually expect the upgrade to fail. How is this error propagated to the end-user? It would be good to have this in the operator logs but as well recoded as k8s evet.
Third question: What happens if the upgrade fails? The instance version in the status will not be upgraded/the image will not be upgraded? If the user fixes the CR by splitting the ports will the upgrade kick in again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make it super clear in the GH issue
i tried to make it clear by saying: The jaeger remote sampling extension collides when converting old jaeger receiver settings. The problem can be solved by manually switching to another port, but this leads to potentially invalid configurations for reporters.
What would you recommend instead? Or what error message do you have in mind?
This is the first time when we actually expect the upgrade to fail. How is this error propagated to the end-user? It would be good to have this in the operator logs but as well recoded as k8s evet.
Good point. All actions are simply ignored. Right now there is only a single log entry.
What happens if the upgrade fails? The instance version in the status will not be upgraded/the image will not be upgraded? If the user fixes the CR by splitting the ports will the upgrade kick in again?
Your CR status will not change. If you fix this issue manually, its all fine. There is no other upgrade routine for 0.61.0
. Everytime you restart or deploy a newer operator version, all upgrade routines are triggered again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your CR status will not change. If you fix this issue manually, its all fine. There is no other upgrade routine for 0.61.0. Everytime you restart or deploy a newer operator version, all upgrade routines are triggered again.
After the fix is done, will the operator upgrade the instance by bumping the OTELcol version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to double check that. But i assume when you modified your CR, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please report back if the upgrade was tested when Jaeger receiver was used with enabled remote sampling. We need to make sure the instance will be upgraded after the CR is fixed (e.g. extension mapped to a different port)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pavolloffay i have now been able to test it locally. after a configuration update, the version is not automatically raised. To achieve this, either the operator must be restarted or the collector configuration must be deleted and recreated.
Details...
Operator log message, once the config is detected.
1.665433033457368e+09 ERROR collector-upgrade failed to upgrade managed otelcol instances {"name": "simplest", "namespace": "test", "error": "jaegerremotesampling is no longer available as receiver configuration. Please use the extension instead with a different remote sampling port. See: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14707"}
procedure:
# status
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest
NAME MODE VERSION AGE
simplest deployment 0.60.0 11m
# fix config
$ kubectl apply -n test -f v0_60_0_jaeger_cfg.yaml
opentelemetrycollector.opentelemetry.io/simplest configured
# status again
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest
NAME MODE VERSION AGE
simplest deployment 0.60.0 12m
# delete
$ kubectl delete -n test -f v0_60_0_jaeger_cfg.yaml
opentelemetrycollector.opentelemetry.io "simplest" deleted
# create
$ kubectl apply -n test -f v0_60_0_jaeger_cfg.yaml
opentelemetrycollector.opentelemetry.io/simplest created
# status
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest
NAME MODE VERSION AGE
simplest deployment 0.61.0 3s
Should we simply add that to the issue description, add a auto update flag or do you have other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good finding, we should definitely document this in the codebase that if the error is thrown during the upgrade then the version is never updated.
There are several ways how we can proceed:
- document the breaking change and how to resolve it
- trigger the upgrade routine during the reconciliation logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of a failing update, we will now create a log entry and event. See: 557a3ff
38157b0
to
e5c15ac
Compare
e5c15ac
to
23dd804
Compare
Signed-off-by: Benedikt Bongartz <[email protected]>
Signed-off-by: Benedikt Bongartz <[email protected]>
Signed-off-by: Benedikt Bongartz <[email protected]>
23dd804
to
0bbe115
Compare
Signed-off-by: Benedikt Bongartz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR looks good, just some minor comments
Signed-off-by: Benedikt Bongartz <[email protected]>
…1.0 (open-telemetry#1116) * upgrade routine for jaeger remote sampling Signed-off-by: Benedikt Bongartz <[email protected]> * upgrade routine to fail if jaeger receiver uses remote sapmling Signed-off-by: Benedikt Bongartz <[email protected]> * improve error message and create event when upgrade fails Signed-off-by: Benedikt Bongartz <[email protected]> * always inform users about failed upgrades Signed-off-by: Benedikt Bongartz <[email protected]> * follow recommendations Signed-off-by: Benedikt Bongartz <[email protected]> Signed-off-by: Benedikt Bongartz <[email protected]>
Closes #1104