Init container gets OOM killed on new cluster POD startup #1566

andrey-dubnik · 2023-11-14T12:05:06Z

What happened?

Created the local cluster using the k3s
Created the operator
Created the cluster, got OOM on Init container
Edited the STS, increased the resource to 150Mi, cluster got created

Unfortunately the init container limits seem to be hard-coded so there is no way to influence the allocation.

What did you expect to happen?

no OOM, or at least able to change initContainer limits

How can we reproduce it (as minimally and precisely as possible)?

k3d cluster create --config $(pwd)/config.yaml

config.yaml

apiVersion: k3d.io/v1alpha4
kind: Simple
metadata:
  name: edge-composition
servers: 1
agents: 3
image: rancher/k3s:v1.22.11-k3s1
kubeAPI: # same as `--api-port myhost.my.domain:6445` (where the name would resolve to 127.0.0.1)
  host: "localhost"
  hostIP: "127.0.0.1"
  hostPort: "6447"
# expose ingress controller on local host port 8080
ports:
  - port: 9980:80 # same as `--port '8080:80@loadbalancer'`
    nodeFilters:
      - loadbalancer
  - port: 9943:443
    nodeFilters:
      - loadbalancer
options:
  k3s:
    extraArgs:
    - arg: --no-deploy=traefik # do not deploy traefik ingress, we will use a different one
      nodeFilters:
          - server:*
    nodeLabels:
      - label: topology.kubernetes.io/zone=3 # same as `--k3s-node-label 'foo=bar@agent:1'` -> this results in a Kubernetes node label
        nodeFilters:
          - agent:2
      - label: topology.kubernetes.io/zone=2 # same as `--k3s-node-label 'foo=bar@agent:1'` -> this results in a Kubernetes node label
        nodeFilters:
          - agent:0
      - label: topology.kubernetes.io/zone=1 # same as `--k3s-node-label 'foo=bar@agent:1'` -> this results in a Kubernetes node label
        nodeFilters:
          - agent:1

kubectl apply -f https://raw.githubusercontent.com/scylladb/scylla-operator/master/deploy/operator.yaml

kubectl wait --for condition=established crd/scyllaclusters.scylla.scylladb.com
kubectl wait --for condition=established crd/nodeconfigs.scylla.scylladb.com
kubectl wait --for condition=established crd/scyllaoperatorconfigs.scylla.scylladb.com
kubectl -n scylla-operator rollout status deployment.apps/scylla-operator
kubectl -n scylla-operator rollout status deployment.apps/webhook-server

Create cluster

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  name: temporal-cluster
  namespace: temporal
spec:
  version: 5.2.7
  agentVersion: 3.1.2
  repository: docker.io/scylladb/scylla
  agentRepository: docker.io/scylladb/scylla-manager-agent
  developerMode: true
  cpuset: true
  datacenter:
    name: manager-dc
    racks:
      - agentResources:
          requests:
            cpu: 50m
            memory: 80M
        members: 1
        name: zone1
        resources:
          limits:
            cpu: 1
            memory: 200Mi
          requests:
            cpu: 1
            memory: 200Mi
        storage:
          capacity: 1Gi
          # storageClassName: scylla-manager
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: topology.kubernetes.io/zone
                      operator: In
                      values:
                        - "1"
      - agentResources:
          requests:
            cpu: 50m
            memory: 80M
        members: 1
        name: zone2
        resources:
          limits:
            cpu: 1
            memory: 200Mi
          requests:
            cpu: 1
            memory: 200Mi
        storage:
          capacity: 1Gi
          # storageClassName: scylla-manager
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: topology.kubernetes.io/zone
                      operator: In
                      values:
                        - "2"
      - agentResources:
          requests:
            cpu: 50m
            memory: 80M
        members: 1
        name: zone3
        resources:
          limits:
            cpu: 1
            memory: 200Mi
          requests:
            cpu: 1
            memory: 200Mi
        storage:
          capacity: 1Gi
          # storageClassName: scylla-manager
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: topology.kubernetes.io/zone
                      operator: In
                      values:
                        - "3"

Scylla Operator version

v1.12.0-alpha.0-102-geb68db4
also reproducible on v.1.11.0

Kubernetes platform name and version

reproduced on 1.21 & on 1.25

```console $ kubectl version # paste output here WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:40:17Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"darwin/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.15+k3s1", GitCommit:"d19260dc59280c5f5a3c6596c653e7cfdbb5f3c8", GitTreeState:"clean", BuildDate:"2023-10-30T21:44:53Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"} ```

Kubernetes platform info:

Please attach the must-gather archive.

scylla-operator-must-gather-hdqcl4psgfqd.zip

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

tnozicka · 2023-11-14T13:35:59Z

can you please upload must-gather without changes? the temporal namespaces seems to have other resources deleted, except the scyllacluster - that makes me unable to see say pods, pod logs or events :(

andrey-dubnik · 2023-11-14T14:06:28Z

There are no other resources. I'm only creating a cluster and it fails on init container OOM without producing any logs so nothing is actually booting up. Do you anticipate anything apart from the cluster STS pods to be running?

andrey-dubnik · 2023-11-14T15:13:53Z

scylla-operator-must-gather-c4gbk79rtd8d.zip

andrey-dubnik · 2023-11-14T15:14:02Z

collected again

tnozicka · 2023-11-14T15:38:28Z

sorry, my bad, we don't yet collect events related to the scyllaclusters. can you please run it with --all-resources? I want to have a closer look at the events and pod definitions. thanks

andrey-dubnik · 2023-11-15T08:26:31Z

here you go scylla-operator-must-gather-gdc4dj7vl8p8.zip

tnozicka · 2023-11-15T14:20:00Z

Thanks, it is weird that we don't hit it on other platforms. I think the resources should be fixed and cp should use only as much memory as it has available, not sure if k3d behaves differently or if it's somehow connected to the hardware.

  initContainers:
  - command:
    - /bin/sh
    - -c
    - cp -a /usr/bin/scylla-operator /mnt/shared
    image: docker.io/scylladb/scylla-operator@sha256:942098adb09134460264c7470d06efa2d5ee32e98354bb3929f9a16f12cf8b4a
    name: sidecar-injection
    resources:
      limits:
        cpu: 10m
        memory: 50Mi
      requests:
        cpu: 10m
        memory: 50Mi

status:
  initContainerStatuses:
  - containerID: containerd://51ceeb40ef28485c803b983b42e48b0ec12a704cee93b99f79e23ed51edcb2af
    image: sha256:fb3b6307d762133e6e718eb4c61078b423859be3d21d54f294f8e3e3e91095d3
    imageID: docker.io/scylladb/scylla-operator@sha256:942098adb09134460264c7470d06efa2d5ee32e98354bb3929f9a16f12cf8b4a
    lastState:
      terminated:
        containerID: containerd://676594f27b4b1d80d24e8f2171f97f04814da13d45f36b30d22d7c083a29b740
        exitCode: 137
        finishedAt: "2023-11-15T08:25:22Z"
        reason: OOMKilled
        startedAt: "2023-11-15T08:25:17Z"

/assign @rzetelskik
ptal

scylla-operator-bot · 2024-06-30T10:41:39Z

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 30d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out

/lifecycle stale

ylebi · 2024-06-30T11:18:50Z

/remove-lifecycle stale

tnozicka · 2024-07-01T17:13:09Z

/triage accepted

andrey-dubnik added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2023

scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 14, 2023

tnozicka mentioned this issue Nov 14, 2023

Teach must-gather about related resources for scyllaclusters #1568

Closed

scylla-operator-bot bot assigned rzetelskik Nov 15, 2023

tnozicka added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2023

scylla-operator-bot bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 30, 2024

scylla-operator-bot bot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init container gets OOM killed on new cluster POD startup #1566

Init container gets OOM killed on new cluster POD startup #1566

andrey-dubnik commented Nov 14, 2023

tnozicka commented Nov 14, 2023

andrey-dubnik commented Nov 14, 2023

andrey-dubnik commented Nov 14, 2023

andrey-dubnik commented Nov 14, 2023

tnozicka commented Nov 14, 2023

andrey-dubnik commented Nov 15, 2023

tnozicka commented Nov 15, 2023

scylla-operator-bot bot commented Jun 30, 2024

ylebi commented Jun 30, 2024

tnozicka commented Jul 1, 2024

Init container gets OOM killed on new cluster POD startup #1566

Init container gets OOM killed on new cluster POD startup #1566

Comments

andrey-dubnik commented Nov 14, 2023

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Scylla Operator version

Kubernetes platform name and version

Please attach the must-gather archive.

Anything else we need to know?

tnozicka commented Nov 14, 2023

andrey-dubnik commented Nov 14, 2023

andrey-dubnik commented Nov 14, 2023

andrey-dubnik commented Nov 14, 2023

tnozicka commented Nov 14, 2023

andrey-dubnik commented Nov 15, 2023

tnozicka commented Nov 15, 2023

scylla-operator-bot bot commented Jun 30, 2024

ylebi commented Jun 30, 2024

tnozicka commented Jul 1, 2024