Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduled long running test failed - Run ID: 12755105579 #8238

Closed
rad-ci-bot opened this issue Jan 13, 2025 · 5 comments
Closed

Scheduled long running test failed - Run ID: 12755105579 #8238

rad-ci-bot opened this issue Jan 13, 2025 · 5 comments
Labels
bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated

Comments

@rad-ci-bot
Copy link
Collaborator

rad-ci-bot commented Jan 13, 2025

Bug information

This issue is automatically generated if the scheduled long running test fails. The Radius long running test operates on a schedule of every 2 hours everyday. It's important to understand that the test may fail due to workflow infrastructure issues, like network problems, rather than the flakiness of the test itself. For the further investigation, please visit here.

AB#14063

@rad-ci-bot rad-ci-bot added bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated labels Jan 13, 2025
@radius-triage-bot
Copy link

👋 @rad-ci-bot Thanks for filing this bug report.

A project maintainer will review this report and get back to you soon. If you'd like immediate help troubleshooting, please visit our Discord server.

For more information on our triage process please visit our triage overview

@kachawla
Copy link
Contributor

Name: "ucp", Namespace: "radius-system"'
Retryable error? true
Retrying as current number of retries 0 less than max number of retries 30
Error received when checking status of resource controller. Error: 'Get "[https://radlrtest00-aks-lrkgoivm.hcp.westus3.azmk8s.io:443/api/v1/namespaces/radius-system/services/controller](https://radlrtest00-aks-lrkgoivm.hcp.westus3.azmk8s.io/api/v1/namespaces/radius-system/services/controller)": context deadline exceeded', Resource details: 'Resource: "/v1, Resource=services", GroupVersionKind: "/v1, Kind=Service"```

@kachawla
Copy link
Contributor

Looking into this further, I'm seeing this error on the long running test AKS cluster:

    "message": "0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..

@kachawla
Copy link
Contributor

kachawla commented Jan 13, 2025

This is related to the PR that added Postgres database: #8072. I'm going to revert the commit for now to unblock long running tests.

Screenshot 2025-01-13 at 3 20 48 PM

Pod details:

% kubectl describe pod database-0 -n radius-system
Name:             database-0
Namespace:        radius-system
Priority:         0
Service Account:  database
Node:             <none>
Labels:           app.kubernetes.io/name=database
                  app.kubernetes.io/part-of=radius
                  apps.kubernetes.io/pod-index=0
                  control-plane=database
                  controller-revision-hash=database-77fc8687f8
                  statefulset.kubernetes.io/pod-name=database-0
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/database
Containers:
  database:
    Image:      ghcr.io/radius-project/mirror/postgres:latest
    Port:       5432/TCP
    Host Port:  0/TCP
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:     2
      memory:  512Mi
    Environment Variables from:
      database-secret  ConfigMap  Optional: false
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5h62g (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  database:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  database-database-0
    ReadOnly:   false
  kube-api-access-5h62g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   47m (x10 over 76m)      default-scheduler   0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling..
  Warning  FailedScheduling   7m17s (x8 over 42m)     default-scheduler   0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  3m39s (x840 over 144m)  cluster-autoscaler  pod didn't trigger scale-up: 2 pod has unbound immediate PersistentVolumeClaims
% kubectl get pvc -n radius-system
NAME                  STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
database-database-0   Pending                                      standard       3d11h

No matching persistent volume to bind the pvc to -

% kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                         STORAGECLASS   REASON   AGE
pvc-3bc9bae6-18a1-41ed-b78e-01c347df6f68   1Gi        RWO            Delete           Bound    dapr-system/dapr-scheduler-data-dir-dapr-scheduler-server-0   default                 116d

kachawla added a commit that referenced this issue Jan 14, 2025
# Description

This reverts commit 9e74e73.

Reverting this commit as it is breaking long running tests. Needs to be
investigated and checked in again with a fix.

```
Events:
  Type     Reason             Age                     From                Message
  ----     ------             ----                    ----                -------
  Warning  FailedScheduling   47m (x10 over 76m)      default-scheduler   0/5 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/5 nodes are available: 5 Preemption is not helpful for scheduling..
  Warning  FailedScheduling   7m17s (x8 over 42m)     default-scheduler   0/4 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/4 nodes are available: 4 Preemption is not helpful for scheduling..
  Normal   NotTriggerScaleUp  3m39s (x840 over 144m)  cluster-autoscaler  pod didn't trigger scale-up: 2 pod has unbound immediate PersistentVolumeClaims
```
```
% kubectl get pvc -n radius-system
NAME                  STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
database-database-0   Pending                                      standard       3d11h
```

No matching persistent volume to bind the pvc to -
```
% kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                                                         STORAGECLASS   REASON   AGE
pvc-3bc9bae6-18a1-41ed-b78e-01c347df6f68   1Gi        RWO            Delete           Bound    dapr-system/dapr-scheduler-data-dir-dapr-scheduler-server-0   default                 116d
```

More details can be found here:
#8238.

Signed-off-by: Karishma Chawla <[email protected]>
@kachawla
Copy link
Contributor

Reverting the commit has fixed the issue with long running tests https://github.com/radius-project/radius/actions/workflows/long-running-azure.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not working as expected test-failure A scheduled test run has failed and needs to be investigated
Projects
None yet
Development

No branches or pull requests

2 participants