Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update routine for migration of jaeger remote sampling in version 0.61.0 #1116

Merged

Conversation

frzifus
Copy link
Member

@frzifus frzifus commented Sep 21, 2022

Closes #1104

@frzifus
Copy link
Member Author

frzifus commented Sep 21, 2022

I have created a first draft. A quick question, does it make sense to import the collector packages to port the configurations or more to avoid the dependency? I assume avoiding is better here?

cc @kevinearls

@frzifus frzifus force-pushed the remote_sampling_upgrade_routine branch 4 times, most recently from 03967af to 8380614 Compare September 22, 2022 15:14
@pavolloffay
Copy link
Member

I have created a first draft. A quick question, does it make sense to import the collector packages to port the configurations or more to avoid the dependency? I assume avoiding is better here?

IIRC right now we don't import OTEL col config in the operator.

pkg/collector/upgrade/v0_61_0_test.go Outdated Show resolved Hide resolved
pkg/collector/upgrade/v0_61_0_test.go Outdated Show resolved Hide resolved
@pavolloffay pavolloffay mentioned this pull request Sep 30, 2022
3 tasks
@frzifus frzifus force-pushed the remote_sampling_upgrade_routine branch from 8380614 to ea75e5e Compare October 4, 2022 09:06
@frzifus
Copy link
Member Author

frzifus commented Oct 4, 2022

i wonder if it wouldn't make more sense to let the upgrade fail if remote sampling is enabled and then refer the user to the OTEL documentation? The problem I see is that we can never be sure that the specified port is correct. Furthermore, it could become a problem if external collectors or similar are not reconfigured.

@pavolloffay @jpkrohling @yuriolisa

@frzifus frzifus force-pushed the remote_sampling_upgrade_routine branch 2 times, most recently from 311340a to fa03c7d Compare October 4, 2022 16:52
@frzifus frzifus marked this pull request as ready for review October 4, 2022 16:53
@frzifus frzifus requested a review from a team October 4, 2022 16:53
@frzifus
Copy link
Member Author

frzifus commented Oct 4, 2022

@frzifus frzifus force-pushed the remote_sampling_upgrade_routine branch 2 times, most recently from 5c4e52e to f9d1de5 Compare October 4, 2022 16:55
@frzifus frzifus requested a review from pavolloffay October 4, 2022 16:55
Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you also raise an issue with the collector so that we can discuss using one port for multiple receivers? It would allow the jaeger receiver and the remote sampling endpoints to be served from the same port again. The way it is now, the jaeger collector isn't interchangeable with the otel collector.

@frzifus
Copy link
Member Author

frzifus commented Oct 6, 2022

Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions about the upgrade


const issueID = "https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14707"
return nil, fmt.Errorf(
"jaegerremotesampling is no longer available as receiver configuration. "+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the message points to the GH issue, but the root cause is not obvious. I would suggest putting either here a sentence that the ports should be changed for receiver and extension or make it super clear in the GH issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second question:

This is the first time when we actually expect the upgrade to fail. How is this error propagated to the end-user? It would be good to have this in the operator logs but as well recoded as k8s evet.

Third question: What happens if the upgrade fails? The instance version in the status will not be upgraded/the image will not be upgraded? If the user fixes the CR by splitting the ports will the upgrade kick in again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it super clear in the GH issue

i tried to make it clear by saying: The jaeger remote sampling extension collides when converting old jaeger receiver settings. The problem can be solved by manually switching to another port, but this leads to potentially invalid configurations for reporters. What would you recommend instead? Or what error message do you have in mind?

This is the first time when we actually expect the upgrade to fail. How is this error propagated to the end-user? It would be good to have this in the operator logs but as well recoded as k8s evet.

Good point. All actions are simply ignored. Right now there is only a single log entry.

What happens if the upgrade fails? The instance version in the status will not be upgraded/the image will not be upgraded? If the user fixes the CR by splitting the ports will the upgrade kick in again?

Your CR status will not change. If you fix this issue manually, its all fine. There is no other upgrade routine for 0.61.0. Everytime you restart or deploy a newer operator version, all upgrade routines are triggered again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your CR status will not change. If you fix this issue manually, its all fine. There is no other upgrade routine for 0.61.0. Everytime you restart or deploy a newer operator version, all upgrade routines are triggered again.

After the fix is done, will the operator upgrade the instance by bumping the OTELcol version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to double check that. But i assume when you modified your CR, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please report back if the upgrade was tested when Jaeger receiver was used with enabled remote sampling. We need to make sure the instance will be upgraded after the CR is fixed (e.g. extension mapped to a different port)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavolloffay i have now been able to test it locally. after a configuration update, the version is not automatically raised. To achieve this, either the operator must be restarted or the collector configuration must be deleted and recreated.

Details...

Operator log message, once the config is detected.

1.665433033457368e+09	ERROR	collector-upgrade	failed to upgrade managed otelcol instances	{"name": "simplest", "namespace": "test", "error": "jaegerremotesampling is no longer available as receiver configuration. Please use the extension instead with a different remote sampling port. See: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14707"}

procedure:

# status
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest        
NAME       MODE         VERSION   AGE
simplest   deployment   0.60.0    11m

# fix config
$ kubectl apply -n test -f v0_60_0_jaeger_cfg.yaml                      
opentelemetrycollector.opentelemetry.io/simplest configured

# status again
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest      
NAME       MODE         VERSION   AGE
simplest   deployment   0.60.0    12m

# delete
$ kubectl delete -n test -f v0_60_0_jaeger_cfg.yaml                
opentelemetrycollector.opentelemetry.io "simplest" deleted

# create
$ kubectl apply -n test -f v0_60_0_jaeger_cfg.yaml 
opentelemetrycollector.opentelemetry.io/simplest created

# status
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest 
NAME       MODE         VERSION   AGE
simplest   deployment   0.61.0    3s

Should we simply add that to the issue description, add a auto update flag or do you have other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good finding, we should definitely document this in the codebase that if the error is thrown during the upgrade then the version is never updated.

There are several ways how we can proceed:

  • document the breaking change and how to resolve it
  • trigger the upgrade routine during the reconciliation logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a failing update, we will now create a log entry and event. See: 557a3ff

@frzifus frzifus force-pushed the remote_sampling_upgrade_routine branch from 38157b0 to e5c15ac Compare October 6, 2022 19:07
@frzifus frzifus requested a review from pavolloffay October 6, 2022 19:08
@frzifus frzifus force-pushed the remote_sampling_upgrade_routine branch from e5c15ac to 23dd804 Compare October 7, 2022 07:55
@frzifus frzifus force-pushed the remote_sampling_upgrade_routine branch from 23dd804 to 0bbe115 Compare October 7, 2022 08:00
Copy link
Member

@pavolloffay pavolloffay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good, just some minor comments

pkg/collector/upgrade/upgrade.go Outdated Show resolved Hide resolved
pkg/collector/upgrade/v0_61_0-invalid.yaml Outdated Show resolved Hide resolved
Signed-off-by: Benedikt Bongartz <[email protected]>
@frzifus frzifus requested a review from pavolloffay October 12, 2022 08:21
@pavolloffay pavolloffay merged commit 6d4a9a1 into open-telemetry:main Oct 12, 2022
@frzifus frzifus deleted the remote_sampling_upgrade_routine branch October 12, 2022 12:06
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
…1.0 (open-telemetry#1116)

* upgrade routine for jaeger remote sampling

Signed-off-by: Benedikt Bongartz <[email protected]>

* upgrade routine to fail if jaeger receiver uses remote sapmling

Signed-off-by: Benedikt Bongartz <[email protected]>

* improve error message and create event when upgrade fails

Signed-off-by: Benedikt Bongartz <[email protected]>

* always inform users about failed upgrades

Signed-off-by: Benedikt Bongartz <[email protected]>

* follow recommendations

Signed-off-by: Benedikt Bongartz <[email protected]>

Signed-off-by: Benedikt Bongartz <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add upgrade routine for jaeger remote sampling from 0.60 to 0.61.
4 participants