Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some job metrics are missing if job has no conditions #2443

Closed
DerRockWolf opened this issue Jul 7, 2024 · 4 comments · Fixed by #2485
Closed

Some job metrics are missing if job has no conditions #2443

DerRockWolf opened this issue Jul 7, 2024 · 4 comments · Fixed by #2485
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@DerRockWolf
Copy link

DerRockWolf commented Jul 7, 2024

What happened:
Some job metrics (kube_job_status_failed, kube_job_complete, kube_job_failed) are missing for Jobs without conditions.

What you expected to happen:
Metrics are present regardless if conditions exist.

How to reproduce it (as minimally and precisely as possible):

  1. Create this job
job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: test
spec:
  template:
    spec:
      containers:
      - image: alpine
        name: test
        command:
          - "sh"
          - "-c"
        args:
          - sleep 5 && exit 1
      restartPolicy: Never

  1. Observe that the .status is missing a conditions objects (before the backoffLimit is reached)
  2. Observe that e.g., the kube_job_status_failed metric is missing (curl localhost:8080/metrics | grep kube_job_status_failed)

Anything else we need to know?:

The reason for this bug is that the labels and values are only set within a for loop (if a condition of type Failed exists):

for _, c := range j.Status.Conditions {
condition := c
if condition.Type == v1batch.JobFailed {

I can provide a fix that sets the value regardless if there are any condition with type Failed.

Environment:

  • kube-state-metrics version: v2.12.0 (master 85d1423)
  • Kubernetes version (use kubectl version): v1.29.6
  • Cloud provider or hardware configuration: homelab
@DerRockWolf DerRockWolf added the kind/bug Categorizes issue or PR as related to a bug. label Jul 7, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Jul 7, 2024
@dgrisonnet
Copy link
Member

/assign @richabanker
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 8, 2024
@richabanker
Copy link
Contributor

I just tried to reproduce this, so far I am actually able to see the kube_job_status_failed metric being reported

# HELP kube_job_status_failed The number of pods which reached Phase Failed.
# TYPE kube_job_status_failed gauge
kube_job_status_failed{namespace="default",job_name="test"} 0

but yes, I dont see the kube_job_complete,kube_job_failed metrics, which I believe is WAI since these metrics should only be reported when the job's Status.Condition.Type changes to Complete or Failed

@DerRockWolf
Copy link
Author

kube_job_status_failed was there for roughly 30 seconds and then disappeared.

From my perspective the kube_job_status_succeeded, kube_job_status_failed & kube_job_status_active metrics should all be emitted regardless of the overall job conditions. kube_job_status_failed is currently the only one depending on the condition being present.

I listed kube_job_complete & kube_job_failed because they also depend on the existence of an condition, but I agree that these probably work as designed.

I also found that kube_job_status_failed isn't correctly implemented.
The description states: "The number of pods which reached Phase Failed and the reason for failure", but currently the number of failed pods is only present if reasonKnown is false...

@richabanker
Copy link
Contributor

Discussed briefly with @dgrisonnet, the suggestion to emit kube_job_status_failed even when there are no job conditions seems worthwhile. @DerRockWolf would you be open to create a PR for that? If not, then I can get started on one, please let us know what you prefer. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
4 participants