Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

work in progress - remote run #3986

Closed
wants to merge 14 commits into from
Closed

work in progress - remote run #3986

wants to merge 14 commits into from

Conversation

Qmando
Copy link
Member

@Qmando Qmando commented Nov 2, 2024

Work in progress, not fully functional!

Create a new API, /service/{service}/{instance}/remote_run

Currently accepts additional a username, whether it's interactive, and whether it should recreate the deployment

Launches a deployment with -remote-run-{username}. If interactive, replaces command with sleep. If deployment exists, maybe terminate it and relaunch. Returns the name of the pod and the namespace.

Created CLI cmd to interact with this api. Gives you a kubectl command to run to get a shell in the container.

def remote_run(args) -> int:
"""Run stuff, but remotely!"""
system_paasta_config = load_system_paasta_config(
"/nail/home/qlo/paasta_config/paasta/"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using this to override the API endpoint

Comment on lines 58 to 64
def wait_until_deployment_gone(kube_client, namespace, deployment_name):
for retry in range(10):
pod = find_pod(kube_client, namespace, deployment_name, 1)
if not pod:
return
sleep(5)
raise Exception("Pod still exists!")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't work. The request times out before the deployment terminates. I think we'd need to return and the client needs to wait and retry.


# Create the app with a new name
formatted_application = deployment.format_kubernetes_app()
formatted_application.metadata.name += f"-remote-run-{user}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's the risk of the pod name being truncated, so we'll definitely also want to indicators of this being a remote-run in the pod's labels

Comment on lines 2049 to 2053
selector=V1LabelSelector(
match_labels={
"paasta.yelp.com/service": self.get_service(),
"paasta.yelp.com/instance": self.get_instance(),
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo: This was copied over from the Deployment function below and is causing errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, yeah, I don't see why you would need a select at this stage

kind="Job",
metadata=self.get_kubernetes_metadata(git_sha),
spec=V1JobSpec(
active_deadline_seconds=3600,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'll need to propagate this somehow from a parameter

Comment on lines 2049 to 2053
selector=V1LabelSelector(
match_labels={
"paasta.yelp.com/service": self.get_service(),
"paasta.yelp.com/instance": self.get_instance(),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmm, yeah, I don't see why you would need a select at this stage

Comment on lines 2074 to 2087
# DO NOT ADD LABELS AFTER THIS LINE
config_hash = get_config_hash(
self.sanitize_for_config_hash(complete_config),
force_bounce=self.get_force_bounce(),
)
complete_config.metadata.labels["yelp.com/paasta_config_sha"] = config_hash
complete_config.metadata.labels["paasta.yelp.com/config_sha"] = config_hash

complete_config.spec.template.metadata.labels[
"yelp.com/paasta_config_sha"
] = config_hash
complete_config.spec.template.metadata.labels[
"paasta.yelp.com/config_sha"
] = config_hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this stuff is very related to deployment bounces, so I'd say we don't need it in this case

audience = "remote-run-" + user
service_account = "default"
token_spec = V1TokenRequestSpec(
expiration_seconds=600, audiences=[audience] # minimum allowed by k8s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The audience will need to be the kubernetes one, otherwise it won't be usable for doing execs. You should be able to just leave it unset to achieve that.

def create_temp_exec_token(kube_client: KubeClient, namespace: str, user: str):
"""Create a short lived token for exec"""
audience = "remote-run-" + user
service_account = "default"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need to come from the main paasta config. We'll have to create a user that can just exec in to places.

@@ -4354,6 +4452,19 @@ def get_kubernetes_secret_env_variables(
return decrypted_secrets


def create_temp_exec_token(kube_client: KubeClient, namespace: str, user: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user parameter is not used here. Is the plan to include it in the service account name if we go for creating ephemeral service accounts?

@piax93
Copy link
Contributor

piax93 commented Mar 19, 2025

This has been fixed and refactored in more sensible chunks

@piax93 piax93 closed this Mar 19, 2025
@piax93 piax93 mentioned this pull request Mar 19, 2025
nemacysts pushed a commit that referenced this pull request Mar 24, 2025
This is the largest portion of #3986, implementing the new API for remote-run (look at the second commit, the first is mostly codegen stuff).

I deviated a bit from the original design poc code, as I saw that was waiting for the remote-run pods to become available as part of the `remote_run/.../start` endpoint, and that's not a great idea as it would just keep one API worker busy doing nothing. So I modified that endpoint to return as soon as the job resource is created, and then added another one to allow the CLI client to poll for updates until the pod becomes available.

So in summary the logic is the following:
* remote-run/start
  * load all the necessary config for a service instance, create a job resource, and launch it
  * return name of the job
* remote-run/poll
  * look if a pod for a remote-run job is ready
* remote-run/token
  * get temporary credentials to exec into the remote-run pod
* remote-run/stop
  * explicitly end the remote-run job (which will eventually be anyway since we set a deadline)

I also refused to add more stuff to the `kubernetes_tools` module as that already has **4500+ lines in it**, which is too much for me not to get triggered about, so I grouped all the new methods needed in a new module under `paasta_tools.kubernetes`
nemacysts pushed a commit that referenced this pull request Mar 24, 2025
Last portion of #3986, with the updated logic to reflect the updated API conventions introduced in #4022.
piax93 added a commit that referenced this pull request Mar 25, 2025
Last portion of #3986, with the updated logic to reflect the updated API conventions introduced in #4022.
nemacysts pushed a commit that referenced this pull request Mar 25, 2025
This was meant to be included in the last release but I (luisp) don't know how to read and merged things in the "wrong" order. 

This correctly merges in the following PRs:

* cleanup for remote-run resources (#4024)

Clean up logic for the (meant to be) ephemeral resources which are created in remote-run invocations (#4022).

* new remote-run cli (#4025)

Last portion of #3986, with the updated logic to reflect the updated API conventions introduced in #4022.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants