Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unable to use terraform-provider-rancher2 to provision RKE2 cluster well on Rancher v2.9.0-alpha5 #6098

Closed
TachunLin opened this issue Jul 1, 2024 · 10 comments
Assignees
Labels
area/terraform-provider-rancher2 Terraform Provider for Rancher v2 and Harvester node driver kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/always Reproducible 100% of the time severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@TachunLin
Copy link

TachunLin commented Jul 1, 2024

Describe the bug

Given the imported Harvester v1.3.1 in Rancher v2.9.0-alpha5.
When we try to provision the v1.29.5 RKE2 cluster using terraform-provider-rancher2 (3.1.1 and latest) using

  1. Backend automation test script
  2. Manual execution

The RKE2 cluster all stuck in "waiting for cluster agent to connect"
image

image

Lots of pods in Pending state

ubuntu@rke2-terraform-pool1-ea67ef05-hm945:/etc/rancher/rke2$ sudo kubectl --kubeconfig=rke2.yaml get pods -A
NAMESPACE         NAME                                                          READY   STATUS      RESTARTS   AGE
calico-system     calico-kube-controllers-86bd9c4fb4-m4bl8                      0/1     Pending     0          3h59m
calico-system     calico-node-q79j7                                             0/1     Running     0          3h59m
calico-system     calico-typha-8d75c767d-phmcc                                  0/1     Pending     0          3h59m
cattle-system     cattle-cluster-agent-78c64dbc79-7kxc2                         0/1     Pending     0          4h
kube-system       etcd-rke2-terraform-pool1-ea67ef05-hm945                      1/1     Running     0          3h59m
kube-system       harvester-cloud-provider-86bb5648b4-4xrlk                     1/1     Running     0          3h59m
kube-system       harvester-csi-driver-controllers-56877cb574-9cdkk             0/3     Pending     0          3h59m
kube-system       harvester-csi-driver-controllers-56877cb574-hdfdc             0/3     Pending     0          3h59m
kube-system       harvester-csi-driver-controllers-56877cb574-xbbwv             0/3     Pending     0          3h59m
kube-system       helm-install-harvester-cloud-provider-sr5pk                   0/1     Completed   0          4h
kube-system       helm-install-harvester-csi-driver-mk67w                       0/1     Completed   0          4h
kube-system       helm-install-rke2-calico-crd-xp99l                            0/1     Completed   0          4h
kube-system       helm-install-rke2-calico-k9j6d                                0/1     Completed   2          4h
kube-system       helm-install-rke2-coredns-xt5jr                               0/1     Completed   0          4h
kube-system       helm-install-rke2-ingress-nginx-xpqhd                         0/1     Pending     0          4h
kube-system       helm-install-rke2-metrics-server-8wv5n                        0/1     Pending     0          4h
kube-system       helm-install-rke2-snapshot-controller-crd-vm8nq               0/1     Pending     0          4h
kube-system       helm-install-rke2-snapshot-controller-qxn92                   0/1     Pending     0          4h
kube-system       helm-install-rke2-snapshot-validation-webhook-t47sd           0/1     Pending     0          4h
kube-system       kube-apiserver-rke2-terraform-pool1-ea67ef05-hm945            1/1     Running     0          4h
kube-system       kube-controller-manager-rke2-terraform-pool1-ea67ef05-hm945   1/1     Running     0          4h
kube-system       kube-proxy-rke2-terraform-pool1-ea67ef05-hm945                1/1     Running     0          4h
kube-system       kube-scheduler-rke2-terraform-pool1-ea67ef05-hm945            1/1     Running     0          4h
kube-system       rke2-coredns-rke2-coredns-5b7d84d764-mrn7r                    0/1     Pending     0          3h59m
kube-system       rke2-coredns-rke2-coredns-autoscaler-b49765765-8jxvr          0/1     Pending     0          3h59m
tigera-operator   tigera-operator-795545875-ph9wt                               1/1     Running     0          3h59m

To Reproduce
Steps to reproduce the behavior:

Pre-requisite

  1. Use helm to provision the Rancher v2.9.0-alpha5
  2. Create a v1.3.1 Harvester cluster
  3. Import Harvester to Rancher

Using automation script

  1. Clone the harvester tests repo
git clone https://github.com/harvester/tests
cd tests
  1. Trigger the terraform Rancher automation tests suite
    tox -e py36 --result-json=test_result_terraform.json -- -l harvester_e2e_tests/integrations --json-report --html=test_report/test_result_terraform.html -m 'not delete_host and not upgrade' -k 'test_z_terraform_rancher'
    
  2. Check the RKE2 cluster state

Using manual install

  1. Prepare the terraform rancher provider.tf file
  2. Get the kubeconfig file of Harvester and place in the local path
  3. Update the provider.tf file with the correct value
    providers.tf.txt
  4. Execute following command to trigger terraform to provision RKE2 cluster from Rancer
    $ terraform init
    $ terraform plan
    $ terraform applyt
    
  5. Check the RKE2 cluster state

Expected behavior

Should be able to provision downstream v1.29.5 RKE2 guest cluster well using terraform-provider-rancher2

Support bundle

supportbundle_a94a8926-f7b3-4f92-937f-4cbcb56bc9a1_2024-07-01T11-54-02Z.zip

Environment

  • Harvester ISO version: v1.3.1
  • Rancher version: v2.9.0-alpha5
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): Single nodes on baremetal machine
  • RKE2 version: v1.29.5+rke2r1
  • terraform-provider-rancher2: v3.3.1 and latest

Additional context
Use regular steps from Rancher UI can provision the v1.29.5 RKE2 cluster well

@TachunLin TachunLin added kind/bug Issues that are defects reported by users or that we know have reached a real release severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) reproduce/always Reproducible 100% of the time area/terraform-provider-rancher2 Terraform Provider for Rancher v2 and Harvester node driver labels Jul 1, 2024
@khushboo-rancher
Copy link

@TachunLin Could you try with terrform provider latest version v4.1.0?

@bk201 bk201 added this to the v1.4.0 milestone Jul 2, 2024
@TachunLin
Copy link
Author

I tried with the terraform provider rancher2 v4.1.0 in the manual and automation manner

Both of them also encounter the waiting for cluster agent to connect while provisioning the v1.29.5 RKE2 cluster.
image

  • For manual execution, I use the following provider.tf file

  • For backend automation, I update the tests/config.yml file to the terraform provider rancher2 v4.1.0

    # script location for terraform related test cases
    terraform-scripts-location: 'terraform_test_artifacts'
    # Rancher provider version, leave empty for the latest version. e.g. '' or '3.1.1'.
    terraform-provider-rancher: '4.1.0'
    
    

@TachunLin
Copy link
Author

Use the different way to test the terraform provider provisioning RKE2 cluster.

Compared with the previous test, this time I remove the cloud provider part in the providers.tf file

    # ########## Generate Manually ##########
    # gen_harvester_cloud_provider_kubeconfig.sh
    # #######################################
    machine_selector_config {
      config = {
        cloud-provider-config = file("${path.module}/harvester131-kubeconfig")
        cloud-provider-name = "harvester"
      }
    }

This time I can provision the RKE2 v1.29.5 cluster well on the Rancher v2.9.0-alpha5
image

Attached the success provider main tf file for the reference
main.tf.txt

I think maybe it related to the cloud-provider-config file did not well generated.
I would investigate more in depth to try again.

@khushboo-rancher
Copy link

@TachunLin If it is working with v4.1.0, let's close this. Also, if needed update the automation script.

@FrankYang0529
Copy link
Member

Hi David, thanks for reporting the issue. I reopen the issue because there is a bug in harvester-cloud-provider. We should be able to provision RKE2 server without modifying providers.tf. Thank you.

Using manual install

  1. Prepare the terraform rancher provider.tf file
  2. Get the kubeconfig file of Harvester and place in the local path
  3. Update the provider.tf file with the correct value
    providers.tf.txt
  4. Execute following command to trigger terraform to provision RKE2 cluster from Rancer
    $ terraform init
    $ terraform plan
    $ terraform applyt
    
  5. Check the RKE2 cluster state

@harvesterhci-io-github-bot
Copy link
Collaborator

harvesterhci-io-github-bot commented Jul 4, 2024

Pre Ready-For-Testing Checklist

  • If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted?

  • Where is the reproduce steps/test steps documented?
    The reproduce steps/test steps are at: Use default namespace if it's empty cloud-provider-harvester#43

  • Is there a workaround for the issue? If so, where is it documented?

  • Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)?
    The PR is at: Use default namespace if it's empty cloud-provider-harvester#43

  • If labeled: area/ui Has the UI issue filed or ready to be merged?

  • If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged?

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?
    The compatibility issue is filed at:

@harvesterhci-io-github-bot
Copy link
Collaborator

Automation e2e test issue: harvester/tests#1359

@TachunLin
Copy link
Author

Thanks for @FrankYang0529 provided information.

Just tried the manual manner to provision RKE2 cluster with terraform rancher2 v4.1.0

We use the gen_harvester_cloud_provider_kubeconfig.sh to generate the cloud config file

#!/bin/bash

RANCHER_SERVER_URL="https://rancher.192.168.122.162.sslip.io"
RANCHER_ACCESS_KEY="xxx"
RANCHER_SECRET_KEY="xxx"
# Refer to https://192.168.123.1/dashboard/harvester/c/c-m-f9zsbp9t/harvesterhci.io.dashboard#vm
HARVESTER_CLUSTER_ID="c-m-9dvfsnj8"
CLUSTER_NAME="harv-local"
curl -k -X POST ${RANCHER_SERVER_URL}/k8s/clusters/${HARVESTER_CLUSTER_ID}/v1/harvester/kubeconfig \
   -H 'Content-Type: application/json' \
   -u ${RANCHER_ACCESS_KEY}:${RANCHER_SECRET_KEY} \
   -d '{"clusterRoleName": "harvesterhci.io:cloudprovider", "namespace": "default", "serviceAccountName": "'${CLUSTER_NAME}'"}' | xargs | sed 's/\\n/\n/g' > ${CLUSTER_NAME}-kubeconfig

And use the correct provider tf file content in the following
main.tf.txt

This time we can also include the cloud provider and correctly provision the RKE2 cluster on v4.1.0

image

I would check again later using the existing automation script for a cross comparision.

And also we need to test again once after the PR of handle empty namespace have been merged.

@TachunLin
Copy link
Author

Double checked on Rancher v2.8.5 with Harvester v1.3.1

Using backend automation script can correctly provision the v1.28.10 RKE2 cluster.
image

@TachunLin TachunLin self-assigned this Jul 12, 2024
@TachunLin
Copy link
Author

Verified fixed on Harvester master-8c709ba-head with Rancher v2.9-alpha5. Close this issue
Using harvester-cloud-provider image rancher/harvester-cloud-provider:master-head

Result

$\color{green}{\textsf{PASS}}$ Can use Harvester kubeconfig for terraform provider rancher2 to provision RKE2 cluster correctly $~~$
  1. When we use the provider file with only the kubeconfig file generate from Harvester cluster.

  2. When we use terraform 4.1.0 to provision RKE2 cluster, it would stuck in waiting for cluster to connect
    image

  3. After we change the Harvester-cloud-provider container image version to rancher/harvester-cloud-provider:master-head

  4. The provision process will continue and make the RKE2 cluster provisioned in the running state
    image

Test Information

  • Test Environment: Single nodes harvester on local kvm machin
  • Harvester version: master-8c709ba-head (24/07/14)
  • Rancher version: v2.9-alpha5
  • Harvester cloud provider: master-head

Verify Steps

Use terraform rancher2 to provision RKE2 cluster
  1. Prepare Harvester imported in Rancher

  2. Get the Harvester kubeconfig file

  3. Prepare the provider.tf file, specify the machine_selector_config.cloud-provider-config to the kubeconfig file path

  4. Modify the necessary settings in the provider file

  5. Execute following command to trigger terraform to provision RKE2 cluster from Rancer

    $ terraform init
    $ terraform plan
    $ terraform apply
    
    
  6. When the RKE2 cluster stuck in waiting for cluster to connect

  7. Access the RKE2 cluster

  8. Use the following command to install k9s
    ```
    curl -kL https://github.com/derailed/k9s/releases/download/v0.31.7/k9s_Linux_amd64.tar.gz > k9s.tar.gz
    tar -zxvf k9s.tar.gz
    sudo mv k9s /usr/local/bin/

    mkdir ~/.kube
    sudo cp /etc/rancher/rke2/rke2.yaml ~/.kube/config
    sudo chmod 444 ~/.kube/config
    export PATH="$PATH:/var/lib/rancher/rke2/bin"
    ```

  9. Find the harvester-cloud-provider deployment

  10. Change the container version to rancher/harvester-cloud-provider:master-head

```
      containers:
      - args:
        - --cloud-config=/etc/kubernetes/cloud-config
        command:
        - harvester-cloud-provider
        image: rancher/harvester-cloud-provider:master-head
        imagePullPolicy: IfNotPresent
        name: harvester-cloud-provider
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /etc/kubernetes/cloud-config

```
  1. Delete the pending pod of Harvester cloud provider

  2. Check all pods are created and running well
    image

  3. Check the RKE2 cluster can provisioned well in Running state
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/terraform-provider-rancher2 Terraform Provider for Rancher v2 and Harvester node driver kind/bug Issues that are defects reported by users or that we know have reached a real release reproduce/always Reproducible 100% of the time severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests

5 participants