Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GCP] skip MIG termination for only CPU instances #4654

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

luyaor
Copy link

@luyaor luyaor commented Feb 6, 2025

Resolve for #4594

check:

https://cloud.google.com/compute/docs/reference/rest/beta/instanceGroupManagerResizeRequests/get

https://cloud.google.com/compute/docs/reference/rest/beta/instanceGroupManagerResizeRequests/cancel

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@Michaelvll
Copy link
Collaborator

Is this the case for the issue in #4594? IIRC, for CPU machines, we will skip the MIG code path for creation, so the termination should also correctly skip that?

@luyaor
Copy link
Author

luyaor commented Feb 6, 2025

The sky down will run this line which requires extra permissions.

operation = compute.instanceGroupManagerResizeRequests().list(

It seems that it skips in my reproduce: Skip resize request cancellation and Skip deletion.

截屏2025-02-05 22 09 11

I am not sure if it is as expected. I have tried sky stop it first and then sky down which can successfully terminate.

If it is not as expected, which means that the permission is not the root cause, right? If so, let me figure it out.

@Michaelvll
Copy link
Collaborator

Please double check the code path for launching a CPU instance with the provision_timeout set. I think MIG should only be effective for GPU instances, i.e. we skip the MIG creation code path for provisioning CPU instance, and we should do similar thing in the termination path as well, instead of go through MIG and have them skipped.

@luyaor
Copy link
Author

luyaor commented Feb 6, 2025

Hi Michaelvll,

Thanks for the explanation.

Is this the case for the issue in #4594? IIRC, for CPU machines, we will skip the MIG code path for creation, so the termination should also correctly skip that?

Just want to make sure:

  • for launching a CPU instance with following provision_timeout set,
managed_instance_group:
    run_duration: 30
    provision_timeout: 600

I find that

  • in Provision config, "use_managed_instance_group": true
  • in System Setup After Provision, in "provider_config", "use_managed_instance_group": true
  • in System Setup After Provision, in "instances", "use-managed-instance-group": 1

We need to keep the above tags/attributes as true or 1, but skip the code path related to MIG, right? Or, just mark them as false or 0 under this situation.

For creation, I found that current code path will check both MIG config and availablity of Accelerators to make GCPNodeType as MIG. here

If the node do not have TPUs, the node type will be GCPNodeType.COMPUTE, and then create_instances will follow compute node, and skip the code path of MIG.

So, for termination, we could also rely on (1) MIG config and (2) the existence of TPUs to decide whether skip or not. I am not sure if it is correct, please check it.

Consequently, under above situation, for termination, the code path will not call API requests for GCP ManagedInstanceGroup.

  • the permission still needs to add somewhere in provision/gcp/constatins.py ?

    If user needs MIG, I think the following permission is required.

     'compute.instanceGroupManagers.get',
     'compute.instanceGroupManagers.update',
    

@luyaor luyaor changed the title update permission for instanceGroupManagerResizeRequests. [GCP] skip MIG termination for only CPU instances Feb 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants