-
Notifications
You must be signed in to change notification settings - Fork 762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDK] Get the correct TrainJob components using get_job()
API
#2348
Comments
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
/remove-lifecycle stale |
@andreyvelich: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign Hiiii! I'm new to the kubeflow community but I'm very interested in contributing to the community by starting with the good first issue! Let me know if there is anything specific I should be aware of before jumping into the implementation. |
Thank you for your interest @Beihao-Zhou! |
Hi @andreyvelich, I'm currently trying to test the Python SDK locally and ran into a few setup issues. Could you share how you typically configure your environment for testing the SDK? Thanks in advance!! |
You can check this: https://www.kubeflow.org/docs/components/trainer/getting-started/ |
Thanks!! Just to confirm, so the usual way for now is running docker containers and run |
Hello @Beihao-Zhou are you still working this issue ? |
@Garvit-77 Sorry about the delay. I was busy these days and wouldn't be able to ship it recently. Feel free to work on it if you want! |
@Beihao-Zhou Surely I would be raising my pr shortly |
What you would like to be added?
As we discussed, currently
get_job()
API can return multiple Pods for every TrainJob component, like initializer or trainer-node-0: #2324 (comment). That can happen when Pods are re-created based on Batch/Job restart policies.Therefore, users can see unexpected logs while using the Kubeflow Training SDK.
We should improve this API to show the correct TrainJob components to users.
For example, when we list all of the Pods, we can select the most recently created Pod with the same role (e.g. dataset-initializer).
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: