Some Prometheus metrics not being reported properly #2408

ishaan-mehta · 2025-01-31T16:15:36Z

What happened?

My training-operator pod's /metrics endpoint is not reporting the Prometheus metrics mentioned here properly.

To be precise, training_operator_jobs_created_total is being incremented as expected — the issue (thus far) has been with training_operator_jobs_successful_total and training_operator_jobs_deleted_total .

I started a PyTorch training job using this code (closely based on the guide here):

def train_func():
    import torch
    import torch.nn.functional as F
    from torch.utils.data import DistributedSampler
    from torchvision import datasets, transforms
    import torch.distributed as dist
    import os

    # [1] Setup PyTorch DDP. Distributed environment will be set automatically by Training Operator.
    dist.init_process_group(backend="gloo")
    Distributor = torch.nn.parallel.DistributedDataParallel
    local_rank = int(os.getenv("LOCAL_RANK", 0))
    print(
        "Distributed Training for WORLD_SIZE: {}, RANK: {}, LOCAL_RANK: {}".format(
            dist.get_world_size(),
            dist.get_rank(),
            local_rank,
        )
    )

    # [2] Create PyTorch CNN Model.
    class Net(torch.nn.Module):
        def __init__(self):
            super(Net, self).__init__()
            self.conv1 = torch.nn.Conv2d(1, 20, 5, 1)
            self.conv2 = torch.nn.Conv2d(20, 50, 5, 1)
            self.fc1 = torch.nn.Linear(4 * 4 * 50, 500)
            self.fc2 = torch.nn.Linear(500, 10)

        def forward(self, x):
            x = F.relu(self.conv1(x))
            x = F.max_pool2d(x, 2, 2)
            x = F.relu(self.conv2(x))
            x = F.max_pool2d(x, 2, 2)
            x = x.view(-1, 4 * 4 * 50)
            x = F.relu(self.fc1(x))
            x = self.fc2(x)
            return F.log_softmax(x, dim=1)

    # [3] Attach model to the correct GPU device and distributor.
    device = torch.device(f"cuda:{local_rank}" if torch.cuda.is_available() else "cpu")
    model = Net().to(device)
    model = Distributor(model)
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)

    # [4] Setup FashionMNIST dataloader and distribute data across PyTorchJob workers.
    dataset = datasets.FashionMNIST(
        "./data",
        download=True,
        train=True,
        transform=transforms.Compose([transforms.ToTensor()]),
    )
    train_loader = torch.utils.data.DataLoader(
        dataset=dataset,
        batch_size=128,
        sampler=DistributedSampler(dataset),
    )

    # [5] Start model Training.
    for epoch in range(3):
        for batch_idx, (data, target) in enumerate(train_loader):
            # Attach Tensors to the device.
            data = data.to(device)
            target = target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()
            optimizer.step()
            if batch_idx % 10 == 0 and dist.get_rank() == 0:
                print(
                    "Train Epoch: {} [{}/{} ({:.0f}%)]\tloss={:.4f}".format(
                        epoch,
                        batch_idx * len(data),
                        len(train_loader.dataset),
                        100.0 * batch_idx / len(train_loader),
                        loss.item(),
                    )
                )


from kubeflow.training import TrainingClient

training_client = TrainingClient()
job_name = "pytorch-ddp"

# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
training_client.create_job(
    name=job_name,
    train_func=train_func,
    num_procs_per_worker="auto",
    num_workers=3,
    base_image="***.azurecr.io/pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime",
    # resources_per_worker={"gpu": "1"},
)

This job completed and was successful:
Code:

training_client.is_job_succeeded(name=job_name)

Output:

True

However, when visiting the /metrics endpoint, I could only see that training_operator_jobs_created_total was at 1 — neither training_operator_jobs_successful_total nor training_operator_jobs_failed_total had been incremented (neither one was present on the page).

So, I deleted this job to try again:

training_client.delete_job(job_name)

Still, on the /metrics endpoint, training_operator_jobs_deleted_total was not incremented/visible.

I created a job again with the same code and repeated the process. From this point on, training_operator_jobs_successful_total was incremented as expected. However, training_operator_jobs_deleted_total is still failing to update.

I have not had any failed/restarted jobs so I do not know about the behavior of those two metrics.

What did you expect to happen?

I expected training_operator_jobs_successful_total to be incremented if a job is deemed successful by the Python client.

I expected training_operator_jobs_deleted_total to be incremented if a job's resources are deleted successfully and there are no error logs related to job deletion in the training-operator pod logs.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.30.8
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.7

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
***.azurecr.io/kubeflow/training-operator:v1-5a5f92d

Training Operator Python SDK version:

$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.9.0rc0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

izuku-sds · 2025-03-18T14:06:49Z

I would like work on this issue.
/assign

izuku-sds · 2025-03-18T16:28:02Z

@andreyvelich @Electronic-Waste

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
***.azurecr.io/kubeflow/training-operator:v1-5a5f92d

# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
training_client.create_job(
    name=job_name,
    train_func=train_func,
    num_procs_per_worker="auto",
    num_workers=3,
    base_image="***.azurecr.io/pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime",
    # resources_per_worker={"gpu": "1"},
)

I am trying to replicate above behavior locally using kind cluster. Please tell me the equivalent training operator image and training client image for kind cluster.

izuku-sds · 2025-03-19T07:51:10Z

Tried with following configuration found similar issue and behavior of training_operator_jobs_successful_total still not reliable. same jobs when run multiple times, increment is found to be different.

@andreyvelich could you clarify what should be exact behavior, Like If I have ran 1 pyTorchJob with 3 workers then what should be intended increment in training_operator_jobs_successful_total when this job is successful.

training_client.create_job(
    name=job_name,
    train_func=train_func,
    num_procs_per_worker=2,
    num_workers=3,
    resources_per_worker={"cpu": "1"},
)

$ kubectl version
Client Version: v1.32.1
Kustomize Version: v5.5.0
Server Version: v1.32.1

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
kubeflow/training-operator:v1-5170a36

$ pip show kubeflow-training
Name: kubeflow-training
Version: 1.9.0
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /home/izuku/miniconda3/envs/issue/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by:

ishaan-mehta · 2025-03-19T14:56:37Z

Hi @izuku-sds — thanks for investigating further!

@andreyvelich @Electronic-Waste

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"
***.azurecr.io/kubeflow/training-operator:v1-5a5f92d

# Start PyTorchJob with 3 Workers and 1 GPU per Worker (e.g. multi-node, multi-worker job).
training_client.create_job(
    name=job_name,
    train_func=train_func,
    num_procs_per_worker="auto",
    num_workers=3,
    base_image="***.azurecr.io/pytorch/pytorch:2.1.2-cuda11.8-cudnn8-runtime",
    # resources_per_worker={"gpu": "1"},
)

I am trying to replicate above behavior locally using kind cluster. Please tell me the equivalent training operator image and training client image for kind cluster.

Regarding the above, my configuration with all versions is included at the bottom of the issue body

izuku-sds · 2025-03-20T13:09:03Z

Increment in training_operator_jobs_successful_total metric is unpredictable due to the condition which is used to determine whether it is master replica. expected == 0 condition is insufficient in itself. Ideally we want to see master replica only once and mark it succeeded but due to insufficient condition we keep meeting succeeded master replica again.

trainer/pkg/controller.v1/pytorch/pytorchjob_controller.go

Lines 390 to 401 in 5840e81

    
           if expected == 0 { 
        
           	msg := fmt.Sprintf("PyTorchJob %s is successfully completed.", pytorchjob.Name) 
        
           	logrus.Info(msg) 
        
           	r.Recorder.Event(pytorchjob, corev1.EventTypeNormal, commonutil.NewReason(kubeflowv1.PyTorchJobKind, commonutil.JobSucceededReason), msg) 
        
           	if jobStatus.CompletionTime == nil { 
        
           		now := metav1.Now() 
        
           		jobStatus.CompletionTime = &now 
        
           	} 
        
           	commonutil.UpdateJobConditions(jobStatus, kubeflowv1.JobSucceeded, corev1.ConditionTrue, commonutil.NewReason(kubeflowv1.PyTorchJobKind, commonutil.JobSucceededReason), msg) 
        
           	trainingoperatorcommon.SuccessfulJobsCounterInc(pytorchjob.Namespace, r.GetFrameworkName()) 
        
           	return nil 
        
           }

Inspite trying different things I couldnt come up with proper condition which could tell if we have seen this master replica before.
@andreyvelich @Electronic-Waste @tenzen-y please see if you could help in this bug.

izuku-sds · 2025-03-20T13:23:47Z

Also a race condition here.

trainer/pkg/controller.v1/pytorch/pytorchjob_controller.go

Line 475 in 5840e81

result := r.Status().Update(context.Background(), pytorchjob)

alternative code:

patch := client.MergeFrom(pytorchjob.DeepCopy())
result := r.Status().Patch(context.Background(), pytorchjob, patch)

ishaan-mehta added kind/bug lifecycle/needs-triage labels Jan 31, 2025

google-oss-prow bot assigned izuku-sds Mar 18, 2025

izuku-sds mentioned this issue Mar 21, 2025

Fix Prometheus metrics counter #2553

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some Prometheus metrics not being reported properly #2408

Some Prometheus metrics not being reported properly #2408

ishaan-mehta commented Jan 31, 2025

izuku-sds commented Mar 18, 2025

izuku-sds commented Mar 18, 2025 •

edited

Loading

izuku-sds commented Mar 19, 2025 •

edited

Loading

ishaan-mehta commented Mar 19, 2025

izuku-sds commented Mar 20, 2025

izuku-sds commented Mar 20, 2025

Some Prometheus metrics not being reported properly #2408

Some Prometheus metrics not being reported properly #2408

Comments

ishaan-mehta commented Jan 31, 2025

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

izuku-sds commented Mar 18, 2025

izuku-sds commented Mar 18, 2025 • edited Loading

izuku-sds commented Mar 19, 2025 • edited Loading

ishaan-mehta commented Mar 19, 2025

izuku-sds commented Mar 20, 2025

izuku-sds commented Mar 20, 2025

izuku-sds commented Mar 18, 2025 •

edited

Loading

izuku-sds commented Mar 19, 2025 •

edited

Loading