Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add health probes and ingress #1408

Merged
merged 7 commits into from
Mar 7, 2025
Merged

fix: add health probes and ingress #1408

merged 7 commits into from
Mar 7, 2025

Conversation

tjololo
Copy link
Member

@tjololo tjololo commented Mar 7, 2025

Description

Related Issue(s)

  • #{issue number}

Verification

  • Your code builds clean without any errors or warnings
  • Manual testing done (required)
  • Relevant automated test added (if you find this hard, leave it and we'll help out)
  • All tests run green

Documentation

  • User documentation is updated with a separate linked PR in altinn-studio-docs. (if applicable)

Summary by CodeRabbit

  • New Features
    • Introduced enhanced traffic management to improve control over incoming requests.
    • Added multiple health monitoring checks to ensure improved startup, readiness, and ongoing application performance.

@tjololo tjololo requested a review from a team as a code owner March 7, 2025 13:34
Copy link
Contributor

coderabbitai bot commented Mar 7, 2025

📝 Walkthrough

Walkthrough

The pull request introduces a new ingress block to the azurerm_container_app resource, specifying parameters for managing incoming traffic. Additionally, it adds three probe configurations—startup_probe, readiness_probe, and liveness_probe—to monitor the application's health and readiness by targeting the /swagger/swagger.json endpoint on port 8080.

Changes

Files Change Summary
infrastructure/.../backend.tf Added new ingress block with parameters (allow_insecure_connections, target_port, transport, external_enabled, traffic_weight) and three probe blocks (startup_probe, readiness_probe, liveness_probe) with detailed configuration for health checks.

Suggested reviewers

  • bengtfredh

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (1)

95-100: 💡 Verification agent

❓ Verification inconclusive

Role Assignment and Secret Retrieval: Address Authorization Failure

The pipeline failure indicates an authorization error when attempting to retrieve secrets from the Container App: the client does not have permission to perform Microsoft.App/containerApps/listSecrets/action. The current role assignment (AcrPull role) might not include the necessary permissions for listing secrets. Please review and update the identity’s permissions to cover this action. This might involve assigning an additional role or adjusting the current role assignment.


Action: Revisit Role Assignment Permissions for Secret Retrieval

The pipeline failure indicates that the current role assignment using the "AcrPull" role (configured in infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf, lines 95–100) does not grant permission for the Microsoft.App/containerApps/listSecrets/action. Since the AcrPull role is designed for pulling container images from ACR, it typically lacks the secret listing permission needed by Container Apps.

Key points to address:

  • Review the required permissions: Confirm that the action Microsoft.App/containerApps/listSecrets/action is not included in the AcrPull role.
  • Adjust role assignments: Consider assigning an additional or alternative role (for example, a role that includes secret retrieval permissions on the Container App’s scope) or creating a custom role to cover this action.
  • Verify the scope: Ensure that the role assignment’s scope aligns with the Container App resource if secret retrieval is needed there rather than on the container registry.

Please update the permissions accordingly and verify that the identity has the necessary actions allowed.

🧹 Nitpick comments (3)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (3)

56-64: Startup Probe: Consider Adjusting the Initial Delay

The startup probe is configured with an initial_delay of 0 seconds and an interval of 1 second. While this may help detect startup issues quickly, it could also lead to premature failure detections if the application inherently requires more time to initialize. Please verify that your application can reliably start within 0 seconds; if not, consider increasing the initial delay. For example:

-        initial_delay           = 0
+        initial_delay           = 5  # Adjust based on observed application startup timing

65-74: Readiness Probe: Verify Health Check Parameters

The readiness probe is similarly configured with an initial_delay of 0 seconds and a 1‑second interval. Although the thresholds (failure count of 3 and success count of 1) may be sufficient, ensure that these values are appropriate for your application's normal startup and steady-state behavior. If your app needs a short warm-up period before it can reliably answer readiness checks, consider adjusting initial_delay accordingly.


75-83: Liveness Probe: Confirm Adequate Delay Settings

The liveness probe also starts checking immediately with an initial_delay of 0 seconds. This configuration is quite aggressive and might lead to unintended container restarts if the application needs a few seconds to stabilize after startup. Validate these values against your application’s behavior, and if necessary, adjust the initial_delay and related thresholds to prevent false positives.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 33bccf2 and b985fad.

📒 Files selected for processing (1)
  • infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (2 hunks)
🧰 Additional context used
🪛 GitHub Actions: altinn-apim-test-rg
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf

[error] 22-22: Error: retrieving secrets for Container App: unexpected status 403 (403 Forbidden) with error: AuthorizationFailed: The client '02b17d28-7ec8-4084-92ce-d26051fa4998' does not have authorization to perform action 'Microsoft.App/containerApps/listSecrets/action'.

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (1)

37-45: Ingress Block: Validate Configuration and Future Flexibility

The new ingress block is configured with explicit parameters (e.g., allow_insecure_connections = false, target_port = 8080, transport = "http", and external_enabled = true) along with a 100% traffic weight. This looks correct for exposing the application on the specified port. However, please ensure that these static values align with your deployment and security requirements. In a future revision, you might consider parameterizing some of these settings (such as traffic_weight) if dynamic traffic routing or blue/green deployments become necessary.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (3)

37-45: Ingress Block Configuration Review

The newly added ingress block correctly configures external access with a target port of 8080 and specifies a full (100%) traffic weight. One minor suggestion is to verify consistent casing for the transport property. Here, it's defined as "http" while health probes use "HTTP", which might lead to confusion or unintended behavior if the provider is case-sensitive.


56-64: Startup Probe Configuration

The startup_probe is set to access the /swagger/swagger.json endpoint on port 8080. While the configuration is syntactically correct, using an initial_delay of 0 and a very short interval_seconds of 1 might be too aggressive depending on the actual startup time of the application. Please verify that the application can reliably serve the endpoint immediately after container launch.


75-83: Liveness Probe Configuration

The liveness_probe uses an aggressive schedule – a 1-second interval with a failure threshold of 3 – which will quickly mark the container as unhealthy if issues occur. This approach may lead to frequent restarts if transient delays occur. It may be beneficial to use a dedicated health endpoint instead of the /swagger/swagger.json used across all probes, to better isolate health-check concerns from API documentation availability.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b985fad and ee15e75.

📒 Files selected for processing (1)
  • infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (1)

65-74: Readiness Probe Configuration

The readiness_probe is configured similarly to the startup probe, with parameters ensuring checks occur very quickly after startup. It's important to confirm that invoking the /swagger/swagger.json endpoint immediately (with an initial delay of 0) does not lead to false negatives before the app is fully ready. Additionally, consider whether using the Swagger endpoint is the most reliable indicator of application readiness.

Copy link

github-actions bot commented Mar 7, 2025

Terraform environment test

Format and Style 🖌success

Initialization ⚙️success

Validation 🤖success

Validation Output

Success! The configuration is valid.


Plan 📖success

Show Plan

[Lines containing Refreshing state removed]
[Truncated to 120000 bytes! See logoutput for complete plan]
Acquiring state lock. This may take a few moments...
data.azurerm_container_registry.altinncr: Reading...
data.azurerm_container_registry.altinncr: Read complete after 0s [id=/subscriptions/a6e9ee7d-2b65-41e1-adfb-0c8c23515cf9/resourceGroups/acr/providers/Microsoft.ContainerRegistry/registries/altinncr]

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform since the
last "terraform apply" which may have affected this plan:

  # azurerm_container_app_environment.container_app_environment has been deleted
  - resource "azurerm_container_app_environment" "container_app_environment" {
      - id                                          = "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/resourceGroups/altinn-apim-test-rg/providers/Microsoft.App/managedEnvironments/altinn-apim-test-hcpyxw-acaenv" -> null
        name                                        = "altinn-apim-test-hcpyxw-acaenv"
        tags                                        = {}
        # (16 unchanged attributes hidden)
    }


Unless you have made equivalent changes to your configuration, or ignored the
relevant attributes using ignore_changes, the following plan may include
actions to undo or respond to these changes.

─────────────────────────────────────────────────────────────────────────────

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # azurerm_container_app.container_app will be created
  + resource "azurerm_container_app" "container_app" {
      + container_app_environment_id  = (known after apply)
      + custom_domain_verification_id = (sensitive value)
      + id                            = (known after apply)
      + latest_revision_fqdn          = (known after apply)
      + latest_revision_name          = (known after apply)
      + location                      = (known after apply)
      + name                          = "altinn-apim-test-hcpyxw-aca"
      + outbound_ip_addresses         = (known after apply)
      + resource_group_name           = "altinn-apim-test-rg"
      + revision_mode                 = "Single"

      + identity {
          + identity_ids = [
              + "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/resourceGroups/altinn-apim-test-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/altinn-apim-test-hcpyxw-aca-mi",
            ]
          + principal_id = (known after apply)
          + tenant_id    = (known after apply)
          + type         = "UserAssigned"
        }

      + ingress {
          + allow_insecure_connections = false
          + client_certificate_mode    = "ignore"
          + custom_domain              = (known after apply)
          + external_enabled           = true
          + fqdn                       = (known after apply)
          + target_port                = 8080
          + transport                  = "auto"

          + traffic_weight {
              + latest_revision = true
              + percentage      = 100
            }
        }

      + registry {
          + identity = "/subscriptions/1ce8e9af-c2d6-44e7-9c5e-099a308056fe/resourceGroups/altinn-apim-test-rg/providers/Microsoft.ManagedIdentity/userAssignedIdentities/altinn-apim-test-hcpyxw-aca-mi"
          + server   = "altinncr.azurecr.io"
        }

      + template {
          + max_replicas                     = 1
          + min_replicas                     = 0
          + revision_suffix                  = (known after apply)
          + termination_grace_period_seconds = 0

          + container {
              + args              = [
                  + "webserver",
                  + "--auth-enabled",
                ]
              + cpu               = 0.5
              + ephemeral_storage = (known after apply)
              + image             = "altinncr.azurecr.io/dis-hackaton/dis-demo-pgsql:latest"
              + memory            = "1Gi"
              + name              = "dis-demo-pgsql"

              + liveness_probe {
                  + failure_count_threshold          = 3
                  + initial_delay                    = 0
                  + interval_seconds                 = 1
                  + path                             = "/swagger/swagger.json"
                  + port                             = 8080
                  + termination_grace_period_seconds = (known after apply)
                  + timeout                          = 1
                  + transport                        = "HTTP"
                }

              + readiness_probe {
                  + failure_count_threshold = 3
                  + initial_delay           = 0
                  + interval_seconds        = 1
                  + path                    = "/swagger/swagger.json"
                  + port                    = 8080
                  + success_count_threshold = 1
                  + timeout                 = 1
                  + transport               = "HTTP"
                }

              + startup_probe {
                  + failure_count_threshold          = 10
                  + initial_delay                    = 0
                  + interval_seconds                 = 1
                  + path                             = "/swagger/swagger.json"
                  + port                             = 8080
                  + termination_grace_period_seconds = (known after apply)
                  + timeout                          = 1
                  + transport                        = "HTTP"
                }
            }

          + http_scale_rule {
              + concurrent_requests = "1000"
              + name                = "http-scale-rule"
            }
        }
    }

  # azurerm_container_app_environment.container_app_environment will be created
  + resource "azurerm_container_app_environment" "container_app_environment" {
      + custom_domain_verification_id    = (known after apply)
      + default_domain                   = (known after apply)
      + docker_bridge_cidr               = (known after apply)
      + id                               = (known after apply)
      + internal_load_balancer_enabled   = false
      + location                         = "norwayeast"
      + logs_destination                 = (known after apply)
      + mutual_tls_enabled               = false
      + name                             = "altinn-apim-test-hcpyxw-acaenv"
      + platform_reserved_cidr           = (known after apply)
      + platform_reserved_dns_ip_address = (known after apply)
      + resource_group_name              = "altinn-apim-test-rg"
      + static_ip_address                = (known after apply)
      + zone_redundancy_enabled          = false
    }

Plan: 2 to add, 0 to change, 0 to destroy.

─────────────────────────────────────────────────────────────────────────────

Saved the plan to: tfplan.out

To perform exactly these actions, run the following command to apply:
    terraform apply "tfplan.out"
Releasing state lock. This may take a few moments...

Context Values
Pusher @tjololo
Action workflow_dispatch
Working Directory ./infrastructure/adminservices-test/altinn-apim-test-rg
State File github.com/altinn/altinn-platform/environments/test/altinn-apim-test-rg.tfstate
Plan File github.com_altinn_altinn-platform_environments_test_altinn-apim-test-rg.tfstate.tfplan

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (1)

76-84: Liveness Probe: Evaluate Endpoint Choice and Robustness

The liveness probe also targets /swagger/swagger.json. While this may work for verifying service availability, consider whether using a dedicated health or heartbeat endpoint might provide a more reliable indication of the application's true liveness, especially under load or during partial failures.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee15e75 and e46441d.

📒 Files selected for processing (1)
  • infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (3)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (3)

37-46: Ingress Block: Validate Consistency and Correctness

The new ingress block is correctly configured to expose the container app on port 8080 with secure settings. One item to verify is the value of the transport attribute which is set to "http" here—while in the probe blocks the transport is denoted as "HTTP". Ensure that this discrepancy is intentional and conforms to the expected enumeration in the provider documentation.


57-65: Startup Probe: Revisit Timing Parameters

The startup probe is configured to target /swagger/swagger.json with an initial_delay of 0 seconds, an interval of 1 second, and a timeout of 1 second. This aggressive timing might trigger false negatives if the application takes slightly longer to start up. Please verify that these values match your application's startup characteristics.


66-75: Readiness Probe: Confirm Threshold Settings

The readiness probe settings (using an interval of 1 second, a failure threshold of 3, and a success threshold of 1) are designed for rapid feedback. Confirm that these thresholds are appropriate for your application's typical response times so that temporary delays or minor hiccups do not result in the app being marked as unready.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (4)

58-66: Startup Probe Configuration Verification

The startup probe block is syntactically correct and appears to target the intended endpoint. However, an initial_delay of 0 and an interval_seconds of 1 combined with a failure_count_threshold of 10 might be a bit aggressive depending on your container's startup time. Consider verifying these values in a staging environment to ensure they do not lead to premature failures.


67-76: Readiness Probe Configuration Evaluation

The readiness probe uses the same endpoint and aggressive timing (0-second delay, 1-second interval). While this configuration can work for a very responsive service, it’s worth confirming that the application is ready to handle requests immediately at startup. If there’s any delay in becoming fully ready, you might see false negatives.


77-85: Liveness Probe Configuration Evaluation

The liveness probe is consistently configured with the same endpoint and similar aggressive timings (0-second initial delay, 1-second interval). Ensure that these settings do not inadvertently cause the container to restart during transient issues. Consider whether a slightly longer initial delay or timeout would provide a more stable assessment of container health.


37-85: Overall Health Probes and Ingress Configuration Consideration

All added configurations integrate well with the existing resource definition and align with the PR objectives. One point to confirm: all probes target /swagger/swagger.json. This endpoint is typically used for API documentation rather than a dedicated health check. Verify that this is intentional and adequately reflects your application's health status.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e46441d and 35912a6.

📒 Files selected for processing (1)
  • infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Analyze (go)
🔇 Additional comments (1)
infrastructure/adminservices-test/altinn-apim-test-rg/backend.tf (1)

37-47: Ingress Block Configuration Review

The new ingress block correctly sets key parameters—such as disabling insecure connections, specifying the target port (8080), and enabling external traffic—with a nested traffic_weight that routes 100% of traffic to the latest revision. Please double-check that these values meet the security and traffic distribution requirements for your deployment.

@Herskis
Copy link
Collaborator

Herskis commented Mar 7, 2025

LGTM

@tjololo tjololo merged commit e0ebab6 into main Mar 7, 2025
8 checks passed
@tjololo tjololo deleted the fix/add-probes-ingress branch March 7, 2025 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants