Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues during upgrade 4.5 to 4.6 #397

Closed
llomgui opened this issue Nov 29, 2020 · 13 comments
Closed

Issues during upgrade 4.5 to 4.6 #397

llomgui opened this issue Nov 29, 2020 · 13 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@llomgui
Copy link

llomgui commented Nov 29, 2020

Hello,

I tried to update 4.5.0-0.okd-2020-10-15-235428 to 4.6.0-0.okd-2020-11-27-200126.
I had multiple issues due to being behind a proxy.

First I allowed storage.googleapis.com to validate the 4.6 image. Error was: Unable to apply 4.6.0-0.okd-2020-11-27-200126: the image may not be safe to use.

Then I had an issue due to rpm-ostree getting timeout. It was not using the proxy.
I followed coreos/rpm-ostree#762 (comment) to solve it.
Is it possible to manage it via ignition files?
Can we specify a mirror to use instead of trying them all twice (80/443)?

Last issue was gcp rewriting the hostname.
Fix: sudo hostnamectl set-hostname okdmaster1a.example.com

@thurcombe
Copy link

Confirmed 4.5 to 4.6 is a problem with rpm-ostree so workaround required but I guess once your cluster is at 4.6 the fedora repos should be disabled and therefore this wont be a problem going forward

@vrutkovs
Copy link
Member

rpm-ostree getting timeout

Which timeout is it? We're disabling all available repos before proceeding to update, as all necessary RPMs are already included in machine-os-content. Please collect a must-gather

@vrutkovs vrutkovs added the triage/needs-information Indicates an issue needs more information in order to work on it. label Nov 29, 2020
@thurcombe
Copy link

thurcombe commented Nov 29, 2020

My 4.5.0-0.okd-2020-10-15-235428 to 4.6.0-0.okd-2020-11-27-200126 failed with the same issue, I added the dropin to set the proxy environment variables which allowed my upgrade to proceed.

Just poked at one of my masters post upgrade, these repos are still enabled:

sh-5.0# grep enabled=1 /etc/yum.repos.d/*
/etc/yum.repos.d/fedora-cisco-openh264.repo:enabled=1
/etc/yum.repos.d/fedora-updates-archive.repo:enabled=1
/etc/yum.repos.d/fedora-updates.repo:enabled=1
/etc/yum.repos.d/fedora.repo:enabled=1

As per our discussion in #389 this didnt seem to be an issue on a fresh 4.6 UPI install.

Same check on my 4.6 cluster that was built from scratch:

sh-5.0# grep enabled=1 /etc/yum.repos.d/*
sh-5.0# 

@bobby0724
Copy link

Is the 4.5 to 4.6 upgrade really supported ?

@vrutkovs
Copy link
Member

Just poked at one of my masters post upgrade, these repos are still enabled:

Right, that seems to be a bug - OKD doesn't need enabled repos when updating, as it ships all RPMs with it. This is resolved during fresh install (we disable all Fedora repos by default), but still open on update - and proxy env is not set there. @thurcombe mind filing a separate bug for that?

@vrutkovs
Copy link
Member

First I allowed storage.googleapis.com to validate the 4.6 image. Error was: Unable to apply 4.6.0-0.okd-2020-11-27-200126: the image may not be safe to use.

That's expected - CVO checks image signatures stored on GCS. Seems to be a docs bug

Last issue was gcp rewriting the hostname.
Fix: sudo hostnamectl set-hostname okdmaster1a.example.com

Is the cluster installed on GCP or this service being run mistakenly (#396)?

@thurcombe
Copy link

Is the cluster installed on GCP or this service being run mistakenly (#396)?

Just for info, after our previous discussion re a fresh 4.6 install I noted that gcp hostname was listed as a failed unit in my UPI. I figured it was a non-issue but mentioning it here in case it helps. I'll raise a new defect for the repo problem.

@llomgui
Copy link
Author

llomgui commented Nov 30, 2020

Another issue is MountVolume.SetUp failed for volume "var-lib-tuned-profiles-data" : stat /var/lib/kubelet/pods/687ac4e3-cb54-41d2-a31c-6c7d36d4be74/volumes/kubernetes.io~configmap/var-lib-tuned-profiles-data: no such file or directory

Apparently tuner pods cannot mount their volume configmap after an upgrade.
The configmap is the default one, empty.

Solution is to delete all pods. Then they will be running.

@giatule
Copy link

giatule commented Jan 5, 2021

I got the same issues, tried to delete all pods. the openshift-console and openshift-console still CrashLoopBackOff

image
image

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 5, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci
Copy link

openshift-ci bot commented Jun 5, 2021

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this as completed Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

7 participants