Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more retries to resource group deletion #5537

Closed
ocofaigh opened this issue Jul 30, 2024 · 8 comments
Closed

Add more retries to resource group deletion #5537

ocofaigh opened this issue Jul 30, 2024 · 8 comments
Labels
service/Resource Management Issues related to Resource Manager or Resource controller Issues

Comments

@ocofaigh
Copy link
Contributor

ocofaigh commented Jul 30, 2024

A common use case is to provision resource group + OCP VPC cluster as part of the same terraform script.
When you provision an OCP VPC cluster, it automatically provisions a VPC load balancer. Terraform does not know about this load balancer (its not in the state file).
So when you run a terraform destroy, it almost always fails on first attempt with the error:

 2024/07/22 13:36:19 Terraform destroy |     "Result": {
 2024/07/22 13:36:19 Terraform destroy |         "errors": [
 2024/07/22 13:36:19 Terraform destroy |             {
 2024/07/22 13:36:19 Terraform destroy |                 "code": "NOT_EMPTY",
 2024/07/22 13:36:19 Terraform destroy |                 "message": "Resource groups with active instances can't be deleted. Use the CLI command \"ibmcloud resource service-instances --type all -g \u003cresource-group\u003e\" to check for remaining instances, then delete the instances and try again.",
 2024/07/22 13:36:19 Terraform destroy |                 "more_info": "n/a"
 2024/07/22 13:36:19 Terraform destroy |             }
 2024/07/22 13:36:19 Terraform destroy |         ],

By running the command ibmcloud resource service-instances --type all -g <resource-group> I can see that indeed the group still contains a VPC load balancer - for example:

[
  {
    "guid": "crn:v1:bluemix:public:containers-kubernetes:us-south:a/abac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "id": "crn:v1:bluemix:public:containers-kubernetes:us-south:a/abac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "url": "/v2/resource_instances/crn:v1:bluemix:public:containers-kubernetes:us-south:a%2Fabac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "created_at": "2024-07-29T15:38:22Z",
    "updated_at": "2024-07-29T15:38:22Z",
    "deleted_at": null,
    "name": "nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "region_id": "us-south",
    "account_id": "abac0df06b644a9cabc6e44f55b3880e",
    "resource_plan_id": "containers.kubernetes.multizone.load.balancer",
    "resource_group_id": "0ed9fc69d01c48a092dd1600f63de2fa",
    "crn": "crn:v1:bluemix:public:containers-kubernetes:us-south:a/abac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud",
    "create_time": 1722267502000,
    "created_by": "iam-ServiceId-1829dcf6-eb99-4760-81ad-6ca95cbab194",
    "state": "active",
    "type": "service_instance",
    "resource_id": "containers-kubernetes",
    "dashboard_url": null,
    "allow_cleanup": false,
    "locked": false,
    "last_operation": {
      "type": "create",
      "state": "succeeded",
      "description": "Instance provisioning is completed.",
      "updated_at": null,
      "cancelable": false
    },
    "account_url": "",
    "resource_plan_url": "",
    "resource_bindings_url": "/v2/resource_instances/crn:v1:bluemix:public:containers-kubernetes:us-south:a%2Fabac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud/resource_bindings",
    "resource_aliases_url": "/v2/resource_instances/crn:v1:bluemix:public:containers-kubernetes:us-south:a%2Fabac0df06b644a9cabc6e44f55b3880e:cqjqkvvd0c64fpr2h9j0:nlb:nlb-con2-workload-cluster-3b5bf5f75003778663c521c8c35ad277-i000.us-south.containers.appdomain.cloud/resource_aliases",
    "siblings_url": "",
    "target_crn": "crn:v1:bluemix:public:globalcatalog::::deployment:containers.kubernetes.multizone.load.balancer%3Aus-south"
  }
]

If I wait some time, this eventually get deleted and resource group deletion passes. I would like to propose that the terraform provider is updated to add more retries when attempting to delete a resource group to cover such a use case.
An even nicer enhancement would be to actually output the content that are remaining in the resource group that is preventing deletion from occurring.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform IBM Provider Version

Affected Resource(s)

  • ibm_resource_group

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please share a link to the ZIP file.

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

  1. terraform apply

Important Factoids

References

  • #0000
@github-actions github-actions bot added the service/Resource Management Issues related to Resource Manager or Resource controller Issues label Jul 30, 2024
@hkantare
Copy link
Collaborator

hkantare commented Jul 30, 2024

@ocofaigh
As part of cluster delete we already have check to wait for load balancer to be deleted


Need to analyze even after this wait for delete also resource group n't able to disassociate from that particular instance

@hkantare
Copy link
Collaborator

Second approach :
As part of resource group delete add some conditional logic to check for any existing instance association and wait for certain time

@ocofaigh
Copy link
Contributor Author

@hkantare Thanks for feedback. So it sounds like isWaitForLBDeleted is not working as expected, so that should probably be debugged. I'm able to very easily reproduce using this code (which is the same as the Red Hat OpenShift Container Platform on VPC landing zone tile in IBM Cloud catalog).

+1 for the second approach too though, as I have seen other resources with similar issues. PAG is another one, as it provisions an sdnlb that terraform state does not know about

@ocofaigh
Copy link
Contributor Author

@hkantare Do you think this is something that could be prioritised?

As part of resource group delete add some conditional logic to check for any existing instance association and wait for certain time

Its something that consumers keep on hitting, especially since most of the Deployable Architectures that are available in the IBM Cloud catalog support creating a resource group. When people do a destroy (especially when OCP cluster are destroyed), the resource group delete fails very frequently with:

 2024/08/27 11:40:06 Terraform destroy |       "Result": {
 2024/08/27 11:40:06 Terraform destroy |           "errors": [
 2024/08/27 11:40:06 Terraform destroy |               {
 2024/08/27 11:40:06 Terraform destroy |                   "code": "NOT_EMPTY",
 2024/08/27 11:40:06 Terraform destroy |                   "message": "Resource groups with active instances can't be deleted. Use the CLI command \"ibmcloud resource service-instances --type all -g \u003cresource-group\u003e\" to check for remaining instances, then delete the instances and try again.",
 2024/08/27 11:40:06 Terraform destroy |                   "more_info": "n/a"
 2024/08/27 11:40:06 Terraform destroy |               }
 2024/08/27 11:40:06 Terraform destroy |           ],
 2024/08/27 11:40:06 Terraform destroy |           "trace": "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |       },
 2024/08/27 11:40:06 Terraform destroy |       "RawResult": null
 2024/08/27 11:40:06 Terraform destroy |   }

@hkantare
Copy link
Collaborator

@ocofaigh
We will plan to add some retry for resource group delete. Can you share what is the status code associated for above error?

@ocofaigh
Copy link
Contributor Author

@hkantare "StatusCode": 500

Full output:

2024/08/27 11:40:06 Terraform destroy | Error: [ERROR] Error Deleting resource group: Resource groups with active instances can't be deleted. Use the CLI command "ibmcloud resource service-instances --type all -g <resource-group>" to check for remaining instances, then delete the instances and try again. with response code  {
 2024/08/27 11:40:06 Terraform destroy |     "StatusCode": 500,
 2024/08/27 11:40:06 Terraform destroy |     "Headers": {
 2024/08/27 11:40:06 Terraform destroy |         "Cache-Control": [
 2024/08/27 11:40:06 Terraform destroy |             "max-age=0, no-cache, no-store"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Content-Length": [
 2024/08/27 11:40:06 Terraform destroy |             "332"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Content-Type": [
 2024/08/27 11:40:06 Terraform destroy |             "application/json; charset=utf-8"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Date": [
 2024/08/27 11:40:06 Terraform destroy |             "Tue, 27 Aug 2024 11:40:06 GMT"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Etag": [
 2024/08/27 11:40:06 Terraform destroy |             "W/\"14c-POn/BpsPEJ94sjfRFJOtr4bZwxc\""
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Expires": [
 2024/08/27 11:40:06 Terraform destroy |             "Tue, 27 Aug 2024 11:40:06 GMT"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Pragma": [
 2024/08/27 11:40:06 Terraform destroy |             "no-cache"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Server": [
 2024/08/27 11:40:06 Terraform destroy |             "istio-envoy"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Strict-Transport-Security": [
 2024/08/27 11:40:06 Terraform destroy |             "max-age=31536000; includeSubDomains"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Transaction-Id": [
 2024/08/27 11:40:06 Terraform destroy |             "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "Vary": [
 2024/08/27 11:40:06 Terraform destroy |             "Accept-Encoding"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Content-Type-Options": [
 2024/08/27 11:40:06 Terraform destroy |             "nosniff"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Envoy-Upstream-Service-Time": [
 2024/08/27 11:40:06 Terraform destroy |             "169"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Ratelimit-Limit": [
 2024/08/27 11:40:06 Terraform destroy |             "60"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Ratelimit-Remaining": [
 2024/08/27 11:40:06 Terraform destroy |             "59"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Ratelimit-Reset": [
 2024/08/27 11:40:06 Terraform destroy |             "0"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Request-Id": [
 2024/08/27 11:40:06 Terraform destroy |             "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "X-Response-Time": [
 2024/08/27 11:40:06 Terraform destroy |             "166.360ms"
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "_request_id": [
 2024/08/27 11:40:06 Terraform destroy |             "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |         ]
 2024/08/27 11:40:06 Terraform destroy |     },
 2024/08/27 11:40:06 Terraform destroy |     "Result": {
 2024/08/27 11:40:06 Terraform destroy |         "errors": [
 2024/08/27 11:40:06 Terraform destroy |             {
 2024/08/27 11:40:06 Terraform destroy |                 "code": "NOT_EMPTY",
 2024/08/27 11:40:06 Terraform destroy |                 "message": "Resource groups with active instances can't be deleted. Use the CLI command \"ibmcloud resource service-instances --type all -g \u003cresource-group\u003e\" to check for remaining instances, then delete the instances and try again.",
 2024/08/27 11:40:06 Terraform destroy |                 "more_info": "n/a"
 2024/08/27 11:40:06 Terraform destroy |             }
 2024/08/27 11:40:06 Terraform destroy |         ],
 2024/08/27 11:40:06 Terraform destroy |         "trace": "80e645c8-e323-4893-b0c1-b0d8a82ee0b6"
 2024/08/27 11:40:06 Terraform destroy |     },
 2024/08/27 11:40:06 Terraform destroy |     "RawResult": null
 2024/08/27 11:40:06 Terraform destroy | }

@hkantare
Copy link
Collaborator

@ocofaigh Added this retry logic for deletion of resource grp with default timeout of 20 mins.
Mostly this should be able to address the deletion of cluster alb, pag.

@ocofaigh
Copy link
Contributor Author

ocofaigh commented Sep 4, 2024

Thanks, I see it was released in 1.69.0 so going to close this issue. If I see any issues, I'll let you know

@ocofaigh ocofaigh closed this as completed Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
service/Resource Management Issues related to Resource Manager or Resource controller Issues
Projects
None yet
Development

No branches or pull requests

2 participants