-
Notifications
You must be signed in to change notification settings - Fork 757
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some Prometheus metrics not being reported properly #2408
Comments
I would like work on this issue. |
@andreyvelich @Electronic-Waste
I am trying to replicate above behavior locally using kind cluster. Please tell me the equivalent training operator image and training client image for kind cluster. |
Tried with following configuration found similar issue and behavior of @andreyvelich could you clarify what should be exact behavior, Like If I have ran 1 pyTorchJob with 3 workers then what should be intended increment in
|
Hi @izuku-sds — thanks for investigating further!
Regarding the above, my configuration with all versions is included at the bottom of the issue body |
Increment in trainer/pkg/controller.v1/pytorch/pytorchjob_controller.go Lines 390 to 401 in 5840e81
Inspite trying different things I couldnt come up with proper condition which could tell if we have seen this master replica before. |
Also a race condition here.
alternative code:
|
What happened?
My
training-operator
pod's/metrics
endpoint is not reporting the Prometheus metrics mentioned here properly.To be precise,
training_operator_jobs_created_total
is being incremented as expected — the issue (thus far) has been withtraining_operator_jobs_successful_total
andtraining_operator_jobs_deleted_total
.I started a PyTorch training job using this code (closely based on the guide here):
This job completed and was successful:
Code:
Output:
However, when visiting the
/metrics
endpoint, I could only see thattraining_operator_jobs_created_total
was at 1 — neithertraining_operator_jobs_successful_total
nortraining_operator_jobs_failed_total
had been incremented (neither one was present on the page).So, I deleted this job to try again:
Still, on the
/metrics
endpoint,training_operator_jobs_deleted_total
was not incremented/visible.I created a job again with the same code and repeated the process. From this point on,
training_operator_jobs_successful_total
was incremented as expected. However,training_operator_jobs_deleted_total
is still failing to update.I have not had any failed/restarted jobs so I do not know about the behavior of those two metrics.
What did you expect to happen?
I expected
training_operator_jobs_successful_total
to be incremented if a job is deemed successful by the Python client.I expected
training_operator_jobs_deleted_total
to be incremented if a job's resources are deleted successfully and there are no error logs related to job deletion in thetraining-operator
pod logs.Environment
Kubernetes version:
Training Operator version:
Training Operator Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: