Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update routine for migration of jaeger remote sampling in version 0.61.0 #1116

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions pkg/collector/upgrade/testdata/v0_61_0-invalid.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
---
receivers:
jaeger:
protocols:
grpc:
remote_sampling:
strategy_file: "/etc/strategy.json"
strategy_file_reload_interval: 10s
5 changes: 5 additions & 0 deletions pkg/collector/upgrade/testdata/v0_61_0-valid.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
receivers:
jaeger:
protocols:
grpc:
4 changes: 3 additions & 1 deletion pkg/collector/upgrade/upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,9 @@ func (u VersionUpgrade) ManagedInstances(ctx context.Context) error {
}
upgraded, err := u.ManagedInstance(ctx, original)
if err != nil {
// nothing to do at this level, just go to the next instance
const msg = "automated update not possible. Configuration must be corrected manually and CR instance must be re-created."
itemLogger.Info(msg)
u.Recorder.Event(&original, "Error", "Upgrade", msg)
continue
}

Expand Down
69 changes: 69 additions & 0 deletions pkg/collector/upgrade/v0_61_0.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
// Copyright The OpenTelemetry Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package upgrade

import (
"errors"
"fmt"
"strings"

"github.com/open-telemetry/opentelemetry-operator/apis/v1alpha1"
"github.com/open-telemetry/opentelemetry-operator/pkg/collector/adapters"
)

func upgrade0_61_0(u VersionUpgrade, otelcol *v1alpha1.OpenTelemetryCollector) (*v1alpha1.OpenTelemetryCollector, error) {
if len(otelcol.Spec.Config) == 0 {
return otelcol, nil
}

otelCfg, err := adapters.ConfigFromString(otelcol.Spec.Config)
if err != nil {
return otelcol, fmt.Errorf("couldn't upgrade to v0.61.0, failed to parse configuration: %w", err)
}

// Search for removed Jaeger remote sampling settings. (https://github.com/open-telemetry/opentelemetry-collector-contrib/pull/14163)
receiversConfig, ok := otelCfg["receivers"].(map[any]any)
if !ok {
// In case there is no extensions config.
return otelcol, nil
}

for key, rc := range receiversConfig {
k, ok := key.(string)
if !ok {
continue
}
cfg, ok := rc.(map[any]any)
// check if jaeger is configured
if !ok || !strings.HasPrefix(k, "jaeger") {
continue
}

// check if remote sampling settings exit
if _, ok := cfg["remote_sampling"]; !ok {
continue
}

const issueID = "https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14707"
errStr := fmt.Sprintf(
"jaegerremotesampling is no longer available as receiver configuration. "+
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that the message points to the GH issue, but the root cause is not obvious. I would suggest putting either here a sentence that the ports should be changed for receiver and extension or make it super clear in the GH issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second question:

This is the first time when we actually expect the upgrade to fail. How is this error propagated to the end-user? It would be good to have this in the operator logs but as well recoded as k8s evet.

Third question: What happens if the upgrade fails? The instance version in the status will not be upgraded/the image will not be upgraded? If the user fixes the CR by splitting the ports will the upgrade kick in again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it super clear in the GH issue

i tried to make it clear by saying: The jaeger remote sampling extension collides when converting old jaeger receiver settings. The problem can be solved by manually switching to another port, but this leads to potentially invalid configurations for reporters. What would you recommend instead? Or what error message do you have in mind?

This is the first time when we actually expect the upgrade to fail. How is this error propagated to the end-user? It would be good to have this in the operator logs but as well recoded as k8s evet.

Good point. All actions are simply ignored. Right now there is only a single log entry.

What happens if the upgrade fails? The instance version in the status will not be upgraded/the image will not be upgraded? If the user fixes the CR by splitting the ports will the upgrade kick in again?

Your CR status will not change. If you fix this issue manually, its all fine. There is no other upgrade routine for 0.61.0. Everytime you restart or deploy a newer operator version, all upgrade routines are triggered again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your CR status will not change. If you fix this issue manually, its all fine. There is no other upgrade routine for 0.61.0. Everytime you restart or deploy a newer operator version, all upgrade routines are triggered again.

After the fix is done, will the operator upgrade the instance by bumping the OTELcol version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to double check that. But i assume when you modified your CR, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please report back if the upgrade was tested when Jaeger receiver was used with enabled remote sampling. We need to make sure the instance will be upgraded after the CR is fixed (e.g. extension mapped to a different port)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pavolloffay i have now been able to test it locally. after a configuration update, the version is not automatically raised. To achieve this, either the operator must be restarted or the collector configuration must be deleted and recreated.

Details...

Operator log message, once the config is detected.

1.665433033457368e+09	ERROR	collector-upgrade	failed to upgrade managed otelcol instances	{"name": "simplest", "namespace": "test", "error": "jaegerremotesampling is no longer available as receiver configuration. Please use the extension instead with a different remote sampling port. See: https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/14707"}

procedure:

# status
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest        
NAME       MODE         VERSION   AGE
simplest   deployment   0.60.0    11m

# fix config
$ kubectl apply -n test -f v0_60_0_jaeger_cfg.yaml                      
opentelemetrycollector.opentelemetry.io/simplest configured

# status again
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest      
NAME       MODE         VERSION   AGE
simplest   deployment   0.60.0    12m

# delete
$ kubectl delete -n test -f v0_60_0_jaeger_cfg.yaml                
opentelemetrycollector.opentelemetry.io "simplest" deleted

# create
$ kubectl apply -n test -f v0_60_0_jaeger_cfg.yaml 
opentelemetrycollector.opentelemetry.io/simplest created

# status
$ kubectl get opentelemetrycollectors.opentelemetry.io -n test simplest 
NAME       MODE         VERSION   AGE
simplest   deployment   0.61.0    3s

Should we simply add that to the issue description, add a auto update flag or do you have other ideas?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good finding, we should definitely document this in the codebase that if the error is thrown during the upgrade then the version is never updated.

There are several ways how we can proceed:

  • document the breaking change and how to resolve it
  • trigger the upgrade routine during the reconciliation logic

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a failing update, we will now create a log entry and event. See: 557a3ff

"Please use the extension instead with a different remote sampling port. See: %s",
issueID,
)
u.Recorder.Event(otelcol, "Error", "Upgrade", errStr)
return nil, errors.New(errStr)
}
return otelcol, nil
}
86 changes: 86 additions & 0 deletions pkg/collector/upgrade/v0_61_0_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
// Copyright The OpenTelemetry Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package upgrade_test

import (
"context"
_ "embed"
"testing"

metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/client-go/tools/record"

"github.com/open-telemetry/opentelemetry-operator/apis/v1alpha1"
"github.com/open-telemetry/opentelemetry-operator/internal/version"
"github.com/open-telemetry/opentelemetry-operator/pkg/collector/upgrade"
)

var (
//go:embed testdata/v0_61_0-valid.yaml
valid string
//go:embed testdata/v0_61_0-invalid.yaml
invalid string
)

func Test0_61_0Upgrade(t *testing.T) {

collectorInstance := v1alpha1.OpenTelemetryCollector{
TypeMeta: metav1.TypeMeta{
Kind: "OpenTelemetryCollector",
APIVersion: "v1alpha1",
},
ObjectMeta: metav1.ObjectMeta{
Name: "otel-my-instance",
Namespace: "somewhere",
},
Spec: v1alpha1.OpenTelemetryCollectorSpec{},
}

tt := []struct {
name string
config string
expectErr bool
}{
{
name: "no remote sampling config", // valid
config: valid,
expectErr: false,
},
{
name: "has remote sampling config", // invalid
config: invalid,
expectErr: true,
},
}

for _, tc := range tt {
t.Run(tc.name, func(t *testing.T) {
collectorInstance.Spec.Config = tc.config
collectorInstance.Status.Version = "0.60.0"

versionUpgrade := &upgrade.VersionUpgrade{
Log: logger,
Version: version.Get(),
Client: k8sClient,
Recorder: record.NewFakeRecorder(upgrade.RecordBufferSize),
}

_, err := versionUpgrade.ManagedInstance(context.Background(), collectorInstance)
if (err != nil) != tc.expectErr {
t.Errorf("expect err: %t but got: %v", tc.expectErr, err)
}
})
}
}
4 changes: 4 additions & 0 deletions pkg/collector/upgrade/versions.go
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,10 @@ var (
Version: *semver.MustParse("0.57.2"),
upgrade: upgrade0_57_2,
},
{
Version: *semver.MustParse("0.61.0"),
upgrade: upgrade0_61_0,
},
}

// Latest represents the latest version that we need to upgrade. This is not necessarily the latest known version.
Expand Down