Scheduled long running test failed - Run ID: 12755105579 #8238

rad-ci-bot · 2025-01-13T21:15:34Z

Bug information

This issue is automatically generated if the scheduled long running test fails. The Radius long running test operates on a schedule of every 2 hours everyday. It's important to understand that the test may fail due to workflow infrastructure issues, like network problems, rather than the flakiness of the test itself. For the further investigation, please visit here.

AB#14063

radius-triage-bot · 2025-01-13T21:15:46Z

👋 @rad-ci-bot Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

kachawla · 2025-01-13T23:14:10Z

Name: "ucp", Namespace: "radius-system"'
Retryable error? true
Retrying as current number of retries 0 less than max number of retries 30
Error received when checking status of resource controller. Error: 'Get "[https://radlrtest00-aks-lrkgoivm.hcp.westus3.azmk8s.io:443/api/v1/namespaces/radius-system/services/controller](https://radlrtest00-aks-lrkgoivm.hcp.westus3.azmk8s.io/api/v1/namespaces/radius-system/services/controller)": context deadline exceeded', Resource details: 'Resource: "/v1, Resource=services", GroupVersionKind: "/v1, Kind=Service"```

kachawla · 2025-01-13T23:14:55Z

Looking into this further, I'm seeing this error on the long running test AKS cluster:

    "message": "0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..

kachawla · 2025-01-13T23:34:51Z

This is related to the PR that added Postgres database: #8072. I'm going to revert the commit for now to unblock long running tests.

Pod details:

% kubectl describe pod database-0 -n radius-system
Name:             database-0
Namespace:        radius-system
Priority:         0
Service Account:  database
Node:             <none>
Labels:           app.kubernetes.io/name=database
                  app.kubernetes.io/part-of=radius
                  apps.kubernetes.io/pod-index=0
                  control-plane=database
                  controller-revision-hash=database-77fc8687f8
                  statefulset.kubernetes.io/pod-name=database-0
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/database
Containers:
  database:
    Image:      ghcr.io/radius-project/mirror/postgres:latest
    Port:       5432/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     2
      memory:  512Mi
    Environment Variables from:
      database-secret  ConfigMap  Optional: false
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5h62g (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  database:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  database-database-0
    ReadOnly:   false
  kube-api-access-5h62g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   47m (x10 over 76m)      default-scheduler   0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling..
  Warning  FailedScheduling   7m17s (x8 over 42m)     default-scheduler   0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  3m39s (x840 over 144m)  cluster-autoscaler  pod didn't trigger scale-up: 2 pod has unbound immediate PersistentVolumeClaims

% kubectl get pvc -n radius-system
NAME                  STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
database-database-0   Pending                                      standard       3d11h

No matching persistent volume to bind the pvc to -

% kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                         STORAGECLASS   REASON   AGE
pvc-3bc9bae6-18a1-41ed-b78e-01c347df6f68   1Gi        RWO            Delete           Bound    dapr-system/dapr-scheduler-data-dir-dapr-scheduler-server-0   default                 116d

# Description This reverts commit 9e74e73. Reverting this commit as it is breaking long running tests. Needs to be investigated and checked in again with a fix. ``` Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 47m (x10 over 76m) default-scheduler 0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling.. Warning FailedScheduling 7m17s (x8 over 42m) default-scheduler 0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling.. Normal NotTriggerScaleUp 3m39s (x840 over 144m) cluster-autoscaler pod didn't trigger scale-up: 2 pod has unbound immediate PersistentVolumeClaims ``` ``` % kubectl get pvc -n radius-system NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE database-database-0 Pending standard 3d11h ``` No matching persistent volume to bind the pvc to - ``` % kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-3bc9bae6-18a1-41ed-b78e-01c347df6f68 1Gi RWO Delete Bound dapr-system/dapr-scheduler-data-dir-dapr-scheduler-server-0 default 116d ``` More details can be found here: #8238. Signed-off-by: Karishma Chawla <[email protected]>

kachawla · 2025-01-14T18:20:13Z

Reverting the commit has fixed the issue with long running tests https://github.com/radius-project/radius/actions/workflows/long-running-azure.yaml

rad-ci-bot added bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated labels Jan 13, 2025

kachawla mentioned this issue Jan 14, 2025

Revert "Adding Postgres Helm chart to rad init (#8072)" #8241

Merged

kachawla closed this as completed Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduled long running test failed - Run ID: 12755105579 #8238

Scheduled long running test failed - Run ID: 12755105579 #8238

rad-ci-bot commented Jan 13, 2025 •

edited by azure-boards bot

Loading

radius-triage-bot bot commented Jan 13, 2025

kachawla commented Jan 13, 2025

kachawla commented Jan 13, 2025

kachawla commented Jan 13, 2025 •

edited

Loading

kachawla commented Jan 14, 2025

Scheduled long running test failed - Run ID: 12755105579 #8238

Scheduled long running test failed - Run ID: 12755105579 #8238

Comments

rad-ci-bot commented Jan 13, 2025 • edited by azure-boards bot Loading

Bug information

radius-triage-bot bot commented Jan 13, 2025

kachawla commented Jan 13, 2025

kachawla commented Jan 13, 2025

kachawla commented Jan 13, 2025 • edited Loading

kachawla commented Jan 14, 2025

rad-ci-bot commented Jan 13, 2025 •

edited by azure-boards bot

Loading

kachawla commented Jan 13, 2025 •

edited

Loading