Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add upgrade_settings block for default nodepool #391

Merged

Conversation

CiucurDaniel
Copy link
Contributor

Describe your changes

This PR gives the user the option to configure upgrade_settings block on the default node pool. Until now this was only possible on the additional node pools.

Issue number

#388

Checklist before requesting a review

  • The pr title can be used to describe what this pr did in CHANGELOG.md file
  • I have executed pre-commit on my machine
  • I have passed pr-check on my machine

@CiucurDaniel
Copy link
Contributor Author

@microsoft-github-policy-service agree

@zioproto
Copy link
Collaborator

@lonegunmanb please can we enable the CI for this PR ?

@CiucurDaniel Have you tested locally the pre-commit and pr-check as explained here https://github.com/Azure/terraform-azurerm-aks#pre-commit--pr-check--test ?

@CiucurDaniel
Copy link
Contributor Author

@zioproto I added a second commit with the updated readme after pre-commit ran. I confirm now I did run both pre-commit and pre-check as documented in the README file.

Copy link
Member

@lonegunmanb lonegunmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CiucurDaniel for opening this pr! One comment on the variable's type.

variables.tf Outdated
@@ -57,6 +57,12 @@ variable "agents_min_count" {
description = "Minimum number of nodes in a pool"
}

variable "max_surge" {
type = number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we can set a percentage value to this variable, should we use string as type?


Thanks to your pr, I found that the current variable's type for additional node pool is also incorrect, would you please correct it in this pr for us? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will update it as well. Having string makes perfect sense so we can use both number or percentage.

variables.tf Outdated
@@ -57,6 +57,12 @@ variable "agents_min_count" {
description = "Minimum number of nodes in a pool"
}

variable "max_surge" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought, since this variable is used for default node pool, and other variables for this default node pool all have "agents_pool_" as name's prefix, could we rename this variable to agents_pool_max_surge?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I totally agree.

@CiucurDaniel
Copy link
Contributor Author

I updated the type on both node pools and I did ran both pre-commit and pr-check scripts again ✅

@github-actions
Copy link
Contributor

Potential Breaking Changes in 0273da6:
[update] "Variables.node_pools.Type" from 'map(object({
name = string
node_count = optional(number)
tags = optional(map(string))
vm_size = string
host_group_id = optional(string)
capacity_reservation_group_id = optional(string)
custom_ca_trust_enabled = optional(bool)
enable_auto_scaling = optional(bool)
enable_host_encryption = optional(bool)
enable_node_public_ip = optional(bool)
eviction_policy = optional(string)
kubelet_config = optional(object({
cpu_manager_policy = optional(string)
cpu_cfs_quota_enabled = optional(bool)
cpu_cfs_quota_period = optional(string)
image_gc_high_threshold = optional(number)
image_gc_low_threshold = optional(number)
topology_manager_policy = optional(string)
allowed_unsafe_sysctls = optional(set(string))
container_log_max_size_mb = optional(number)
container_log_max_files = optional(number)
pod_max_pid = optional(number)
}))
linux_os_config = optional(object({
sysctl_config = optional(object({
fs_aio_max_nr = optional(number)
fs_file_max = optional(number)
fs_inotify_max_user_watches = optional(number)
fs_nr_open = optional(number)
kernel_threads_max = optional(number)
net_core_netdev_max_backlog = optional(number)
net_core_optmem_max = optional(number)
net_core_rmem_default = optional(number)
net_core_rmem_max = optional(number)
net_core_somaxconn = optional(number)
net_core_wmem_default = optional(number)
net_core_wmem_max = optional(number)
net_ipv4_ip_local_port_range_min = optional(number)
net_ipv4_ip_local_port_range_max = optional(number)
net_ipv4_neigh_default_gc_thresh1 = optional(number)
net_ipv4_neigh_default_gc_thresh2 = optional(number)
net_ipv4_neigh_default_gc_thresh3 = optional(number)
net_ipv4_tcp_fin_timeout = optional(number)
net_ipv4_tcp_keepalive_intvl = optional(number)
net_ipv4_tcp_keepalive_probes = optional(number)
net_ipv4_tcp_keepalive_time = optional(number)
net_ipv4_tcp_max_syn_backlog = optional(number)
net_ipv4_tcp_max_tw_buckets = optional(number)
net_ipv4_tcp_tw_reuse = optional(bool)
net_netfilter_nf_conntrack_buckets = optional(number)
net_netfilter_nf_conntrack_max = optional(number)
vm_max_map_count = optional(number)
vm_swappiness = optional(number)
vm_vfs_cache_pressure = optional(number)
}))
transparent_huge_page_enabled = optional(string)
transparent_huge_page_defrag = optional(string)
swap_file_size_mb = optional(number)
}))
fips_enabled = optional(bool)
kubelet_disk_type = optional(string)
max_count = optional(number)
max_pods = optional(number)
message_of_the_day = optional(string)
mode = optional(string, "User")
min_count = optional(number)
node_network_profile = optional(object({
node_public_ip_tags = optional(map(string))
}))
node_labels = optional(map(string))
node_public_ip_prefix_id = optional(string)
node_taints = optional(list(string))
orchestrator_version = optional(string)
os_disk_size_gb = optional(number)
os_disk_type = optional(string, "Managed")
os_sku = optional(string)
os_type = optional(string, "Linux")
pod_subnet_id = optional(string)
priority = optional(string, "Regular")
proximity_placement_group_id = optional(string)
spot_max_price = optional(number)
scale_down_mode = optional(string, "Delete")
ultra_ssd_enabled = optional(bool)
vnet_subnet_id = optional(string)
upgrade_settings = optional(object({
max_surge = number
}))
windows_profile = optional(object({
outbound_nat_enabled = optional(bool, true)
}))
workload_runtime = optional(string)
zones = optional(set(string))
}))' to 'map(object({
name = string
node_count = optional(number)
tags = optional(map(string))
vm_size = string
host_group_id = optional(string)
capacity_reservation_group_id = optional(string)
custom_ca_trust_enabled = optional(bool)
enable_auto_scaling = optional(bool)
enable_host_encryption = optional(bool)
enable_node_public_ip = optional(bool)
eviction_policy = optional(string)
kubelet_config = optional(object({
cpu_manager_policy = optional(string)
cpu_cfs_quota_enabled = optional(bool)
cpu_cfs_quota_period = optional(string)
image_gc_high_threshold = optional(number)
image_gc_low_threshold = optional(number)
topology_manager_policy = optional(string)
allowed_unsafe_sysctls = optional(set(string))
container_log_max_size_mb = optional(number)
container_log_max_files = optional(number)
pod_max_pid = optional(number)
}))
linux_os_config = optional(object({
sysctl_config = optional(object({
fs_aio_max_nr = optional(number)
fs_file_max = optional(number)
fs_inotify_max_user_watches = optional(number)
fs_nr_open = optional(number)
kernel_threads_max = optional(number)
net_core_netdev_max_backlog = optional(number)
net_core_optmem_max = optional(number)
net_core_rmem_default = optional(number)
net_core_rmem_max = optional(number)
net_core_somaxconn = optional(number)
net_core_wmem_default = optional(number)
net_core_wmem_max = optional(number)
net_ipv4_ip_local_port_range_min = optional(number)
net_ipv4_ip_local_port_range_max = optional(number)
net_ipv4_neigh_default_gc_thresh1 = optional(number)
net_ipv4_neigh_default_gc_thresh2 = optional(number)
net_ipv4_neigh_default_gc_thresh3 = optional(number)
net_ipv4_tcp_fin_timeout = optional(number)
net_ipv4_tcp_keepalive_intvl = optional(number)
net_ipv4_tcp_keepalive_probes = optional(number)
net_ipv4_tcp_keepalive_time = optional(number)
net_ipv4_tcp_max_syn_backlog = optional(number)
net_ipv4_tcp_max_tw_buckets = optional(number)
net_ipv4_tcp_tw_reuse = optional(bool)
net_netfilter_nf_conntrack_buckets = optional(number)
net_netfilter_nf_conntrack_max = optional(number)
vm_max_map_count = optional(number)
vm_swappiness = optional(number)
vm_vfs_cache_pressure = optional(number)
}))
transparent_huge_page_enabled = optional(string)
transparent_huge_page_defrag = optional(string)
swap_file_size_mb = optional(number)
}))
fips_enabled = optional(bool)
kubelet_disk_type = optional(string)
max_count = optional(number)
max_pods = optional(number)
message_of_the_day = optional(string)
mode = optional(string, "User")
min_count = optional(number)
node_network_profile = optional(object({
node_public_ip_tags = optional(map(string))
}))
node_labels = optional(map(string))
node_public_ip_prefix_id = optional(string)
node_taints = optional(list(string))
orchestrator_version = optional(string)
os_disk_size_gb = optional(number)
os_disk_type = optional(string, "Managed")
os_sku = optional(string)
os_type = optional(string, "Linux")
pod_subnet_id = optional(string)
priority = optional(string, "Regular")
proximity_placement_group_id = optional(string)
spot_max_price = optional(number)
scale_down_mode = optional(string, "Delete")
ultra_ssd_enabled = optional(bool)
vnet_subnet_id = optional(string)
upgrade_settings = optional(object({
max_surge = string
}))
windows_profile = optional(object({
outbound_nat_enabled = optional(bool, true)
}))
workload_runtime = optional(string)
zones = optional(set(string))
}))'

@zioproto
Copy link
Collaborator

I tested it and it is not a breaking change.

I first applied this change to our existing example, and I applied the Terraform example:

diff --git a/examples/multiple_node_pools/main.tf b/examples/multiple_node_pools/main.tf
index c2c4c96..4ce7563 100644
--- a/examples/multiple_node_pools/main.tf
+++ b/examples/multiple_node_pools/main.tf
@@ -38,6 +38,9 @@ locals {
       vm_size        = "Standard_D2s_v3"
       node_count     = 1
       vnet_subnet_id = azurerm_subnet.test.id
+      upgrade_settings = {
+        max_surge       = 1
+      }
     }
   }
 }

and then I applied again with this module change:

diff --git a/variables.tf b/variables.tf
index bd6dbcc..0f0a0e1 100644
--- a/variables.tf
+++ b/variables.tf
@@ -784,7 +784,7 @@ variable "node_pools" {
     ultra_ssd_enabled            = optional(bool)
     vnet_subnet_id               = optional(string)
     upgrade_settings = optional(object({
-      max_surge = number
+      max_surge = string
     }))
     windows_profile = optional(object({
       outbound_nat_enabled = optional(bool, true)
@@ -881,7 +881,7 @@ variable "node_pools" {
     ultra_ssd_enabled            = (Optional) Used to specify whether the UltraSSD is enabled in the Node Pool. Defaults to `false`. See [the documentation](https://docs.microsoft.com/azure/aks/use-ultra-disks) for more information. Changing this forces a new resource to be created.
     vnet_subnet_id               = (Optional) The ID of the Subnet where this Node Pool should exist. Changing this forces a new resource to be created. A route table must be configured on this Subnet.
     upgrade_settings = optional(object({
-      max_surge = number
+      max_surge = string
     }))
     windows_profile = optional(object({
       outbound_nat_enabled = optional(bool, true)

The second terraform apply operation did not detect any change.

@CiucurDaniel
Copy link
Contributor Author

How may I be of further help here? @zioproto

@zioproto
Copy link
Collaborator

@lonegunmanb the e2e tests failed here:

TestExampleUpgrade_named_cluster 2023-06-14T16:28:54Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; ╷
│ Error: local error: tls: bad record MAC
│ 
│   with data.curl.public_ip[0],
│   on key_vault.tf line 10, in data "curl" "public_ip":
│   10: data "curl" "public_ip" {
│ 
╵}
=== NAME  TestExampleUpgrade_named_cluster
    plan.go:85: 
        	Error Trace:	/src/test/vendor/github.com/gruntwork-io/terratest/modules/terraform/plan.go:85
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:150
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:143
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:135
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:45
        	            				/src/test/upgrade/upgrade_test.go:72
        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; ╷
        	            	│ Error: local error: tls: bad record MAC
        	            	│ 
        	            	│   with data.curl.public_ip[0],
        	            	│   on key_vault.tf line 10, in data "curl" "public_ip":
        	            	│   10: data "curl" "public_ip" {
        	            	│ 
        	            	╵}
        	Test:       	TestExampleUpgrade_named_cluster
TestExampleUpgrade_named_cluster 2023-06-14T16:28:54Z retry.go:91: terraform [destroy -auto-approve -input=false -refresh=false -var managed_identity_principal_id=579b2585-3109-4a7a-b61c-5248a547d90b -lock=false]

It this failure related to the change we are testing ?

@zioproto
Copy link
Collaborator

zioproto commented Jun 16, 2023

I am running the end to end tests on my laptop on main branch to make sure the e2e tests are working, but I get an unexpected error:

TestExamplesWithoutAssertion/examples/multiple_node_pools 2023-06-16T14:18:57Z test_structure.go:130: Copied terraform folder ../../examples/multiple_node_pools to /tmp/multiple_node_pools3917474392/src/examples/multiple_node_pools
TestExamplesWithoutAssertion/examples/multiple_node_pools 2023-06-16T14:18:57Z retry.go:91: terraform [init -upgrade=false -no-color]
TestExamplesWithoutAssertion/examples/multiple_node_pools 2023-06-16T14:19:11Z retry.go:91: terraform [version]
--- PASS: TestExamplesWithoutAssertion (0.00s)
    --- PASS: TestExamplesWithoutAssertion/examples/with_acr (876.62s)
    --- PASS: TestExamplesWithoutAssertion/examples/multiple_node_pools (1130.41s)
=== RUN   TestExamples_differentLocationForLogAnalyticsSolution
--- FAIL: TestExamples_differentLocationForLogAnalyticsSolution (0.00s)
panic: assignment to entry in nil map [recovered]
	panic: assignment to entry in nil map

goroutine 58 [running]:
testing.tRunner.func1.2({0x127b820, 0x18b0a80})
	/root/go/src/testing/testing.go:1526 +0x1c8
testing.tRunner.func1()
	/root/go/src/testing/testing.go:1529 +0x364
panic({0x127b820, 0x18b0a80})
	/root/go/src/runtime/panic.go:884 +0x1f4
github.com/Azure/terraform-azurerm-aks/e2e.TestExamples_differentLocationForLogAnalyticsSolution(0x18d11c0?)
	/src/test/e2e/terraform_aks_test.go:115 +0xd0
testing.tRunner(0x4000502b60, 0x159f0f8)
	/root/go/src/testing/testing.go:1576 +0x104
created by testing.(*T).Run
	/root/go/src/testing/testing.go:1629 +0x370
FAIL	github.com/Azure/terraform-azurerm-aks/e2e	1130.448s
FAIL
make: *** [tfmod-scaffold/GNUmakefile:54: e2e-test] Error 1

@lonegunmanb
Copy link
Member

lonegunmanb commented Jun 16, 2023

Thanks @CiucurDaniel and @zioproto, the error was triggered by incorrect test code, I'm curious why we didn't find it at the first place.

The incorrect test code:

func TestExamples_differentLocationForLogAnalyticsSolution(t *testing.T) {
	var vars map[string]any
	managedIdentityId := os.Getenv("MSI_ID")
	if managedIdentityId != "" {
		vars = map[string]any{
			"managed_identity_principal_id": managedIdentityId,
		}
	}
	vars["log_analytics_workspace_location"] = "eastus2"
	test_helper.RunE2ETest(t, "../../", "examples/named_cluster", terraform.Options{
		Upgrade: true,
		Vars:    vars,
	}, nil)
}

We need init vars before use it. I'll submit a pr to fix the test. But this bug only occurs on our local machine, once in CI pipeline because of the existence of MSI_ID the map would be iniialized, and we won't meet the same error.

@lonegunmanb
Copy link
Member

@lonegunmanb the e2e tests failed here:

TestExampleUpgrade_named_cluster 2023-06-14T16:28:54Z retry.go:99: Returning due to fatal error: FatalError{Underlying: error while running command: exit status 1; ╷
│ Error: local error: tls: bad record MAC
│ 
│   with data.curl.public_ip[0],
│   on key_vault.tf line 10, in data "curl" "public_ip":
│   10: data "curl" "public_ip" {
│ 
╵}
=== NAME  TestExampleUpgrade_named_cluster
    plan.go:85: 
        	Error Trace:	/src/test/vendor/github.com/gruntwork-io/terratest/modules/terraform/plan.go:85
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:150
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:143
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:135
        	            				/src/test/vendor/github.com/Azure/terraform-module-test-helper/upgradetest.go:45
        	            				/src/test/upgrade/upgrade_test.go:72
        	Error:      	Received unexpected error:
        	            	FatalError{Underlying: error while running command: exit status 1; ╷
        	            	│ Error: local error: tls: bad record MAC
        	            	│ 
        	            	│   with data.curl.public_ip[0],
        	            	│   on key_vault.tf line 10, in data "curl" "public_ip":
        	            	│   10: data "curl" "public_ip" {
        	            	│ 
        	            	╵}
        	Test:       	TestExampleUpgrade_named_cluster
TestExampleUpgrade_named_cluster 2023-06-14T16:28:54Z retry.go:91: terraform [destroy -auto-approve -input=false -refresh=false -var managed_identity_principal_id=579b2585-3109-4a7a-b61c-5248a547d90b -lock=false]

It this failure related to the change we are testing ?

I believe it's an occasional error, we used https://api.ipify.org to retrieve our public IP, due to the error message Error: local error: tls: bad record MAC maybe we've met a temporary network issue? Never mind, I've restarted the e2e test.

@zioproto
Copy link
Collaborator

It seems api.ipify.org is having issue and it is failing our CI :(

TestExampleUpgrade_named_cluster 2023-06-16T16:19:02Z retry.go:144: 'terraform [apply -input=false -auto-approve -var managed_identity_principal_id=579b2585-3109-4a7a-b61c-5248a547d90b -lock=false]' failed with the error 'error while running command: exit status 1; ╷
│ Error: Get "https://api.ipify.org?format=json": read tcp 10.1.28.0:58082->173.231.16.76:443: read: connection reset by peer
│ 
│   with data.curl.public_ip[0],
│   on key_vault.tf line 10, in data "curl" "public_ip":
│   10: data "curl" "public_ip" {
│ 
╵' but this error was expected and warrants a retry. Further details: Failed to reach helm charts repository.

@CiucurDaniel CiucurDaniel temporarily deployed to acctests June 19, 2023 00:35 — with GitHub Actions Inactive
Copy link
Member

@lonegunmanb lonegunmanb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @CiucurDaniel for the update, LGTM! 🚀

@lonegunmanb lonegunmanb merged commit ab31e96 into Azure:main Jun 19, 2023
skolobov pushed a commit to skolobov/terraform-azurerm-aks that referenced this pull request Oct 29, 2023
* Add upgrade_settings block for default nodepool
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants