Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDK] Get the correct TrainJob components using get_job() API #2348

Open
andreyvelich opened this issue Dec 11, 2024 · 11 comments
Open

[SDK] Get the correct TrainJob components using get_job() API #2348

andreyvelich opened this issue Dec 11, 2024 · 11 comments

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Dec 11, 2024

What you would like to be added?

As we discussed, currently get_job() API can return multiple Pods for every TrainJob component, like initializer or trainer-node-0: #2324 (comment). That can happen when Pods are re-created based on Batch/Job restart policies.
Therefore, users can see unexpected logs while using the Kubeflow Training SDK.

We should improve this API to show the correct TrainJob components to users.
For example, when we list all of the Pods, we can select the most recently created Pod with the same role (e.g. dataset-initializer).

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member Author

/remove-lifecycle stale
/good-first-issue

Copy link

@andreyvelich:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/remove-lifecycle stale
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Beihao-Zhou
Copy link

/assign

Hiiii! I'm new to the kubeflow community but I'm very interested in contributing to the community by starting with the good first issue! Let me know if there is anything specific I should be aware of before jumping into the implementation.

@andreyvelich
Copy link
Member Author

Thank you for your interest @Beihao-Zhou!
Please try to follow the PR threads that mentioned in this PR, and explore the Kubeflow Trainer SDK.

@Beihao-Zhou
Copy link

Hi @andreyvelich, I'm currently trying to test the Python SDK locally and ran into a few setup issues. Could you share how you typically configure your environment for testing the SDK? Thanks in advance!!

@tenzen-y
Copy link
Member

Hi @andreyvelich, I'm currently trying to test the Python SDK locally and ran into a few setup issues. Could you share how you typically configure your environment for testing the SDK? Thanks in advance!!

You can check this: https://www.kubeflow.org/docs/components/trainer/getting-started/

@Beihao-Zhou
Copy link

Hi @andreyvelich, I'm currently trying to test the Python SDK locally and ran into a few setup issues. Could you share how you typically configure your environment for testing the SDK? Thanks in advance!!

You can check this: https://www.kubeflow.org/docs/components/trainer/getting-started/

Thanks!! Just to confirm, so the usual way for now is running docker containers and run make <different_tests> right?

@Garvit-77
Copy link

Hello @Beihao-Zhou are you still working this issue ?

@Beihao-Zhou
Copy link

@Garvit-77 Sorry about the delay. I was busy these days and wouldn't be able to ship it recently. Feel free to work on it if you want!

@Garvit-77
Copy link

@Beihao-Zhou Surely I would be raising my pr shortly
/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants