Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sagemaker module updates don't apply changes in the correct order when using autoscaling group and changing sagemaker model instance sizes. #14337

Closed
SRserves85 opened this issue Jul 24, 2020 · 5 comments
Labels
service/sagemaker Issues and PRs that pertain to the sagemaker service.

Comments

@SRserves85
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Affected Resource(s)

  • aws_XXXXX

Terraform Configuration Files

resource "aws_s3_bucket" "bucket" {
  bucket = "videoblocks-ml"
  versioning {
    enabled = true
  }
}

resource "aws_sagemaker_model" "model" {
  name                     = "${var.environment}-${var.service_name}-${var.model_training_datetime}"
  execution_role_arn       = aws_iam_role.sm_execution.arn
  enable_network_isolation = true

  vpc_config {
    subnets = var.subnet_ids
    security_group_ids = var.vpc_security_group_ids
  }

  primary_container {
    image          = "${var.ecr_container_id}:${var.docker_container_tag}"
    model_data_url = "s3://${aws_s3_bucket.bucket.bucket}/models/${var.service_name}/${var.site}/${var.environment}/${var.model_training_datetime}/model.tar.gz"

    environment = {
      SAGEMAKER_BIND_TO_PORT       = "8080"
      SM_LOG_LEVEL                 = "40"
      SAGEMAKER_PROGRAM            = "entrypoint"
    }
  }
}

resource "aws_sagemaker_endpoint_configuration" "model" {
  name = "${var.service_name}-${var.site}-${var.environment}-${var.model_training_datetime}-${replace(var.instance_type, ".", "")}"

  production_variants {
    variant_name           = "${var.service_name}-${var.site}-${var.environment}"
    model_name             = aws_sagemaker_model.model.name
    initial_instance_count = var.initial_instance_count
    instance_type          = var.instance_type
  }

  # from: https://discuss.hashicorp.com/t/sagemaker-endpoint-not-updating-with-configuration-change/1727
  # By default Terraform destroys resources before creating the new one. However, in this case we want to force Terraform to create a
  # new resource first. If we do not enforce the order of: Create new endpoint config -> update sagemaker endpoint -> Destroy old endpoint config
  # Sagemaker will error when it tries to update from the old (destroyed) config to the new one.  This has no impact on runtime or uptime,
  # Sagemaker endpoints can function even if you destroy a config and do not give it a new one.
  lifecycle {
    create_before_destroy = true
  }
}

resource "aws_sagemaker_endpoint" "endpoint" {
  name                 = "${var.service_name}-${var.site}-${var.environment}"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.model.name
}

resource "aws_appautoscaling_target" "sagemaker_target" {
  max_capacity       = var.max_instance_count
  min_capacity       = var.min_instance_count
  resource_id        = "endpoint/${aws_sagemaker_endpoint.endpoint.name}/variant/${var.service_name}-${var.site}-${var.environment}"
  role_arn           = aws_iam_role.sm_execution.arn
  scalable_dimension = "sagemaker:variant:DesiredInstanceCount"
  service_namespace  = "sagemaker"
}

resource "aws_appautoscaling_policy" "sagemaker_policy" {
  name               = "${var.service_name}-${var.site}-${var.environment}-target-tracking"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.sagemaker_target.resource_id
  scalable_dimension = aws_appautoscaling_target.sagemaker_target.scalable_dimension
  service_namespace  = aws_appautoscaling_target.sagemaker_target.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "SageMakerVariantInvocationsPerInstance"
    }
    target_value       = var.target_invocations
    scale_in_cooldown  = var.target_scale_in_cooldown
    scale_out_cooldown = var.target_scale_out_cooldown
  }
}

Debug Output

Error: error updating SageMaker Endpoint (predictive-ba-xgb-videoblocks-prod): ValidationException: The variant(s) "[predictive-ba-xgb-videoblocks-prod]" must be deregistered as scalable targets with Application Auto Scaling before they can be removed or have their instance type updated.
        status code: 400, request id: cbef2671-32a7-4318-ab22-5b14aadf4aa4
  on modules/sagemaker/main.tf line 50, in resource "aws_sagemaker_endpoint" "endpoint"

Panic Output

Expected Behavior

Terraform should be able to handle removing the autoscaling group from a sagemaker endpoint if this is needed to be done to apply changes to the underlying model instances. In this case was trying to adjust instance type.

Actual Behavior

The autoscaling group was not removed before the changes to the endpoint were made, and an error was thrown.

Steps to Reproduce

  1. Create a sagemaker endpoint with autoscaling using the above terraform module.
  2. terraform apply
  3. Change your sagemaker endpoint to use a different instance type
  4. terraform apply
    5 Should receive the same error.

Important Factoids

References

  • #0000
@ghost ghost added service/applicationautoscaling service/s3 Issues and PRs that pertain to the s3 service. service/sagemaker Issues and PRs that pertain to the sagemaker service. labels Jul 24, 2020
@github-actions github-actions bot added the needs-triage Waiting for first response or review from a maintainer. label Jul 24, 2020
@justinretzolk
Copy link
Member

Hi @SRserves85 👋 Thank you for reporting this. Given that there's been a few Terraform and AWS provider releases between when you filed this and now, can you confirm whether you're still experiencing this?

@justinretzolk justinretzolk added waiting-response Maintainers are waiting on response from community or contributor. and removed needs-triage Waiting for first response or review from a maintainer. labels Sep 22, 2021
@github-actions github-actions bot removed the waiting-response Maintainers are waiting on response from community or contributor. label Sep 22, 2021
@DrFaust92
Copy link
Collaborator

It seems that adding a name_prefix argument to both endpoint and endpoint config would help here. the config recreation with the same name does not trigger an update for the endpoint resource. and as the endpoint does not recreate, the target wont recreate causing this lock. ill take a crack at this to see if my solution is applicable

@DrFaust92
Copy link
Collaborator

Update, a potentially easier solution is to omit name argument for both aws_sagemaker_endpoint and aws_sagemaker_endpoint_configuration. it will generate random names and it should handle dependencies in this case.

@DrFaust92 DrFaust92 removed the service/s3 Issues and PRs that pertain to the s3 service. label Oct 12, 2021
@DrFaust92
Copy link
Collaborator

Closing for lack of response, if issue still persists please re-open with all relevant details.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 29, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
service/sagemaker Issues and PRs that pertain to the sagemaker service.
Projects
None yet
Development

No branches or pull requests

3 participants