Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migration to v24.1 issues #694

Closed
adrienyhuel opened this issue Feb 6, 2024 · 9 comments · Fixed by #701
Closed

Migration to v24.1 issues #694

adrienyhuel opened this issue Feb 6, 2024 · 9 comments · Fixed by #701

Comments

@adrienyhuel
Copy link

adrienyhuel commented Feb 6, 2024

Hello,

We had some Vertica clusters on Openshift, running version v23.x, and we wanted to update them to v24.1.

We encountered some issues, that I want to share here :

  • Operator is installed with OLM (with Openshift OperatorHub), and we had to uninstall Operator v1 first on every namespaces, then install v2 cluster wide.
    With Openshift, operator installs in "openshift-operators" namespace, and the 3 services created (controller-manager, metrics and webhooks) use the label selector "control-plane=controller-manager".
    But, there is some other pods (like ArgoCD), who use this label and the services route to both Vertica and ArgoCD controller, which results in error in Verica controller.
    I had to manually update those services to add label selector "app.kubernetes.io/name=vertica-db-operator"

  • Our clusters were installed with httpServerMode=Disabled with resulted that Vertica never created https certs for HTTPServer.
    With v23 it is not a problem due to controller using admintools, but with v24, controller expect HTTPServer to be running to send commands and to check Vertica liveness (in K8S probes).
    I had to first enable HTTPServer with v23 and Operator v1, to make it generate certificates, then I can update to operator v2, then to Vertica v24.
    Could the operator generate certs and httpstls.json file if they are missing during migration ? Actually I can see in the code it only generate them during DB bootstrap, not migration.

  • In order to run Vertica v23 cluster, using CRD v1beta and Operator v2, the operator converts the CRD to v1, and generate a ServiceAccount, with a randomized name, as we can't set ServiceAccountName in v1beta CRD.
    On Openshift, we had to wait that Operator generate ServiceAccount, then create a RoleBinging to give privileged scc to Vertica Pods.
    So this can't be done with infra as code tools like ArgoCD, as the ServiceAccount name is not predictable.
    Can the v1beta CRD be updated to add ServiceAccountName, so it can be taken in account during conversion ?

And I have a last question :

  • Is the config directory (containing certs for https server) backuped up to communal storage ?
    Because if we lose a node and his local volume, config dir would be lost, and I don't think operator would recreate httpstls.json file if it it missing.
@adrienyhuel
Copy link
Author

For certificates, I tested to create a brand new Vertica Cluster, and it seems Vertica now doesn't need to use httpstls.json, because db creation generate certs inside databse catalog, like commands described here :
https://docs.vertica.com/24.1.x/en/security-and-authentication/tls-protocol/tls-overview/generating-tls-certificates-and-keys/

Then I assume we better use those commands on Vertica v23 before migration, to make cluster fully compatible with v24.

@spilchen
Copy link
Collaborator

spilchen commented Feb 7, 2024

Thank you for your valuable feedback regarding the upgrade process. I appreciate you taking the time to bring this issue to our attention.

Operator is installed with OLM (with Openshift OperatorHub), and we had to uninstall Operator v1 first on every namespaces, then install v2 cluster wide.

Unfortunately, we lacked an effective method to migrate the operator from namespace scope to cluster scope. In retrospect, starting with a cluster scope would have been a wiser decision. Operationally, cluster-scoped deployment offers greater convenience. This change had been on our wishlist for some time, but it became imperative when we altered the API version of our CRs. The conversion webhook, which handles the conversion from v1beta1 to v1, is configured in a cluster-scoped resource (CustomResourceDefinition). I intend to enhance the documentation around this process..

With Openshift, operator installs in "openshift-operators" namespace, and the 3 services created (controller-manager, metrics and webhooks) use the label selector "control-plane=controller-manager".
But, there is some other pods (like ArgoCD), who use this label and the services route to both Vertica and ArgoCD controller, which results in error in Verica controller.
I had to manually update those services to add label selector "app.kubernetes.io/name=vertica-db-operator"

I will look at using a different label so that you don't have to change the label selector manually.

In order to run Vertica v23 cluster, using CRD v1beta and Operator v2, the operator converts the CRD to v1, and generate a ServiceAccount, with a randomized name, as we can't set ServiceAccountName in v1beta CRD.
On Openshift, we had to wait that Operator generate ServiceAccount, then create a RoleBinging to give privileged scc to Vertica Pods.
So this can't be done with infra as code tools like ArgoCD, as the ServiceAccount name is not predictable.
Can the v1beta CRD be updated to add ServiceAccountName, so it can be taken in account during conversion ?

The ServiceAccountName was added to the v1beta1 CRD. Can you try using that? You can see the v1beta1 version of any vdb by specifying the longform of the CR. This will be the CR after it was sent through the conversion webhook.
kubectl get verticadb.v1beta1.vertica.com -o yaml

As for your questions around the HTTPS service. I need to do more investigation into the scenarios you brought up.

@adrienyhuel
Copy link
Author

Thank you for your answers !

The ServiceAccountName was added to the v1beta1 CRD. Can you try using that? You can see the v1beta1 version of any vdb by specifying the longform of the CR. This will be the CR after it was sent through the conversion webhook. kubectl get verticadb.v1beta1.vertica.com -o yaml

I just checked and we use "apiVersion: vertica.com/v1beta1" and it doesn't recognize ServiceAccountName field (moreover it is not present in documentation of CRD in Vertica v23.4)
I only see this field in the function that convert CRD from v1beta1 to v1.

@spilchen
Copy link
Collaborator

spilchen commented Feb 7, 2024

Thank you for your answers !

The ServiceAccountName was added to the v1beta1 CRD. Can you try using that? You can see the v1beta1 version of any vdb by specifying the longform of the CR. This will be the CR after it was sent through the conversion webhook. kubectl get verticadb.v1beta1.vertica.com -o yaml

I just checked and we use "apiVersion: vertica.com/v1beta1" and it doesn't recognize ServiceAccountName field (moreover it is not present in documentation of CRD in Vertica v23.4) I only see this field in the function that convert CRD from v1beta1 to v1.

Were you running the new operator when you checked? It's only present once you move to the 2.0.x operator (in both v1 and v1beta1 versions of the CR). The old 1.x.x operator won't have it. Note, you can run the v2 operator and continue to stay on 23.4.0 for the server.

@adrienyhuel
Copy link
Author

Were you running the new operator when you checked? It's only present once you move to the 2.0.x operator (in both v1 and v1beta1 versions of the CR). The old 1.x.x operator won't have it. Note, you can run the v2 operator and continue to stay on 23.4.0 for the server.

Yes, the problem is that we can't set serviceAccountNamewith operator v1.
Then the CRD v1beta1 do'esnt have this field.
When we install the operator v2, it immediately take the installed CRD v1beta1, and convert it to v1.
But, as the serviceAccountName is not defined, it generates a random one during conversion.

I had to update the CRD after installing operator v2, so it update the service account from generated one, to custom one.

Until I update the CRD, Vertica pods don't start because of generated service account doesn't give privileged or anyuid scc.

I know you can't do anything to that, unless :

  • providing a way to create a rolebinding with operator
  • releasing une update to v1beta1 CRD with v1 operator

@spilchen
Copy link
Collaborator

spilchen commented Feb 8, 2024

Okay, I understand your situation. Is this severely blocking you? In 24.2, which is coming out in April, we will be able to run the Vertica in OpenShift with the restricted scc. So, you shouldn't have to create a role in the future.

@adrienyhuel
Copy link
Author

Okay, I understand your situation. Is this severely blocking you? In 24.2, which is coming out in April, we will be able to run the Vertica in OpenShift with the restricted scc. So, you shouldn't have to create a role in the future.

No it is not blopcking us really. I already upgraded 4 of our 6 clusters, I only have the 2 production clusters left to do.
I just have to manually update the CRD after updating the operator, to force the service account name, and let our CICD create the rolebinding.

Anyway thank you for looking at my issues :)

@spilchen
Copy link
Collaborator

Is the config directory (containing certs for https server) backuped up to communal storage ?
Because if we lose a node and his local volume, config dir would be lost, and I don't think operator would recreate httpstls.json file if it it missing.

I created a PR to fix this issue (#698). It will be in the next version of the operator.

@spilchen
Copy link
Collaborator

With Openshift, operator installs in "openshift-operators" namespace, and the 3 services created (controller-manager, metrics and webhooks) use the label selector "control-plane=controller-manager".
But, there is some other pods (like ArgoCD), who use this label and the services route to both Vertica and ArgoCD controller, which results in error in Verica controller.
I had to manually update those services to add label selector "app.kubernetes.io/name=vertica-db-operator"

This will be addressed in issue #701

spilchen pushed a commit that referenced this issue Feb 14, 2024
Any operator that is built with the operator-sdk framework will have
default selector labels added for the operator like this:
```
  control-plane: controller-manager
```

When the operator is deployed in the same namespace as other operator,
that also continue using the default label, then the service object for
the operator fails to work. The service object is used by the webhook,
so it will route webhook requests to the wrong pod.

This change is to use the following label instead:
```
  control-plane: verticadb-operator
```

I also took the opportunity to rename the operator objects. The
deployment object was renamed from
`verticadb-operator-controller-manager` to `verticadb-operator-manager`.

Closes #694
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants