Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for gke node group (and update waiter) #16

Merged
merged 7 commits into from
Jan 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ and **Merged pull requests**. Critical items to know are:
The versions coincide with releases on pip. Only major versions will be released as tags on Github.

## [0.0.x](https://github.com/converged-computing/kubescaler/tree/main) (0.0.x)
- do not use the waiter for nodegroup_active it does not work! (0.0.18)
- support for Google Cloud instance group creation, etc.
- support adding one-off node groups to a cluster (0.0.17)
- allow manual customization and timing of nodegroup (e.g., for spot) (0.0.16)
- extensive changes to aws client (thanks to @rajibhossen!) (0.0.15)
Expand Down
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2022-2023 LLNS, LLC and other HPCIC DevTools Developers.
Copyright (c) 2022-2024 LLNS, LLC and other HPCIC DevTools Developers.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
2 changes: 1 addition & 1 deletion kubescaler/cluster.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down
2 changes: 1 addition & 1 deletion kubescaler/decorators.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2022-2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down
2 changes: 1 addition & 1 deletion kubescaler/defaults.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down
2 changes: 1 addition & 1 deletion kubescaler/logger.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down
2 changes: 1 addition & 1 deletion kubescaler/scaler/aws/ami.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down
43 changes: 28 additions & 15 deletions kubescaler/scaler/aws/cluster.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down Expand Up @@ -306,13 +306,33 @@ def _generate_configuration(self):
token["status"]["expirationTimestamp"], "%Y-%m-%dT%H:%M:%SZ"
)

def waiter_wait_for_nodes(self, nodegroup_name):
"""
Use the "waiter" provided by eks to wait for nodes.

It is not recommended to use this function as it is flaky.
We are keeping it here to preserve the code to try again,
as perhaps the flakiness might improve!
"""
try:
print(f"Waiting for {nodegroup_name} nodegroup...")
waiter = self.eks.get_waiter("nodegroup_active")
# MaxAttempts defaults to 120, and Delay 30 seconds
waiter.wait(clusterName=self.cluster_name, nodegroupName=nodegroup_name)
except Exception as e:
# Allow waiting 3 more minutes
print(f"Waiting for nodegroup creation exceeded wait time: {e}")
time.sleep(180)

@timed
def wait_for_nodes(self):
"""
Wait for the nodes to be ready.

We do this separately to allow timing. This function would be improved if
we didn't need subprocess, but the waiter doesn't seem to work.
We do this separately to allow timing. This function
can't get a perfectly accurate timing given the sleep, but the
waiter doesn't work. But I suspect the waiter has a sleep too, so
maybe not so bad.
"""
start = time.time()
kubectl = self.get_k8s_client()
Expand All @@ -329,10 +349,9 @@ def wait_for_nodes(self):
if ready_count >= self.node_count:
break
print(f"Time for kubernetes to get nodes - {time.time()-start}")
return ready_count
# The waiter doesn't seem to work - so we call kubectl until it's ready
# waiter = self.eks.get_waiter("nodegroup_active")
# waiter.wait(clusterName=self.cluster_name, nodegroupName=self.node_autoscaling_group_name)
# self.waiter_wait_for_nodes(self.node_autoscaling_group_name)
return ready_count

@timed
def watch_for_nodes_in_k8s(self, count):
Expand Down Expand Up @@ -778,15 +797,9 @@ def _create_nodegroup(self, node_group, nodegroup_name):
if node_group is None:
raise ValueError("Could not create nodegroup")

try:
print(f"Waiting for {nodegroup_name} nodegroup...")
waiter = self.eks.get_waiter("nodegroup_active")
# MaxAttempts defaults to 120, and Delay 30 seconds
waiter.wait(clusterName=self.cluster_name, nodegroupName=nodegroup_name)
except Exception as e:
# Allow waiting 3 more minutes
print(f"Waiting for nodegroup creation exceeded wait time: {e}")
time.sleep(180)
# DO NOT USE THE WAITER, it is buggy and does not work.
# self.waiter_wait_for_nodes(nodegroup_name)
self.wait_for_nodes()

# Retrieve the same metadata if we had retrieved it
return self.eks.describe_nodegroup(
Expand Down
2 changes: 1 addition & 1 deletion kubescaler/scaler/aws/template.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down
2 changes: 1 addition & 1 deletion kubescaler/scaler/aws/token.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright 2023 Lawrence Livermore National Security, LLC and other
# Copyright 2023-2024 Lawrence Livermore National Security, LLC and other
# HPCIC DevTools Developers. See the top-level COPYRIGHT file for details.
#
# SPDX-License-Identifier: (MIT)
Expand Down
Loading