Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init container gets OOM killed on new cluster POD startup #1566

Open
andrey-dubnik opened this issue Nov 14, 2023 · 10 comments
Open

Init container gets OOM killed on new cluster POD startup #1566

andrey-dubnik opened this issue Nov 14, 2023 · 10 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@andrey-dubnik
Copy link

What happened?

Created the local cluster using the k3s
Created the operator
Created the cluster, got OOM on Init container
Edited the STS, increased the resource to 150Mi, cluster got created

Unfortunately the init container limits seem to be hard-coded so there is no way to influence the allocation.

What did you expect to happen?

no OOM, or at least able to change initContainer limits

How can we reproduce it (as minimally and precisely as possible)?

k3d cluster create --config $(pwd)/config.yaml

config.yaml

apiVersion: k3d.io/v1alpha4
kind: Simple
metadata:
  name: edge-composition
servers: 1
agents: 3
image: rancher/k3s:v1.22.11-k3s1
kubeAPI: # same as `--api-port myhost.my.domain:6445` (where the name would resolve to 127.0.0.1)
  host: "localhost"
  hostIP: "127.0.0.1"
  hostPort: "6447"
# expose ingress controller on local host port 8080
ports:
  - port: 9980:80 # same as `--port '8080:80@loadbalancer'`
    nodeFilters:
      - loadbalancer
  - port: 9943:443
    nodeFilters:
      - loadbalancer
options:
  k3s:
    extraArgs:
    - arg: --no-deploy=traefik # do not deploy traefik ingress, we will use a different one
      nodeFilters:
          - server:*
    nodeLabels:
      - label: topology.kubernetes.io/zone=3 # same as `--k3s-node-label 'foo=bar@agent:1'` -> this results in a Kubernetes node label
        nodeFilters:
          - agent:2
      - label: topology.kubernetes.io/zone=2 # same as `--k3s-node-label 'foo=bar@agent:1'` -> this results in a Kubernetes node label
        nodeFilters:
          - agent:0
      - label: topology.kubernetes.io/zone=1 # same as `--k3s-node-label 'foo=bar@agent:1'` -> this results in a Kubernetes node label
        nodeFilters:
          - agent:1
kubectl apply -f https://raw.githubusercontent.com/scylladb/scylla-operator/master/deploy/operator.yaml

kubectl wait --for condition=established crd/scyllaclusters.scylla.scylladb.com
kubectl wait --for condition=established crd/nodeconfigs.scylla.scylladb.com
kubectl wait --for condition=established crd/scyllaoperatorconfigs.scylla.scylladb.com
kubectl -n scylla-operator rollout status deployment.apps/scylla-operator
kubectl -n scylla-operator rollout status deployment.apps/webhook-server

Create cluster

apiVersion: scylla.scylladb.com/v1
kind: ScyllaCluster
metadata:
  name: temporal-cluster
  namespace: temporal
spec:
  version: 5.2.7
  agentVersion: 3.1.2
  repository: docker.io/scylladb/scylla
  agentRepository: docker.io/scylladb/scylla-manager-agent
  developerMode: true
  cpuset: true
  datacenter:
    name: manager-dc
    racks:
      - agentResources:
          requests:
            cpu: 50m
            memory: 80M
        members: 1
        name: zone1
        resources:
          limits:
            cpu: 1
            memory: 200Mi
          requests:
            cpu: 1
            memory: 200Mi
        storage:
          capacity: 1Gi
          # storageClassName: scylla-manager
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: topology.kubernetes.io/zone
                      operator: In
                      values:
                        - "1"
      - agentResources:
          requests:
            cpu: 50m
            memory: 80M
        members: 1
        name: zone2
        resources:
          limits:
            cpu: 1
            memory: 200Mi
          requests:
            cpu: 1
            memory: 200Mi
        storage:
          capacity: 1Gi
          # storageClassName: scylla-manager
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: topology.kubernetes.io/zone
                      operator: In
                      values:
                        - "2"
      - agentResources:
          requests:
            cpu: 50m
            memory: 80M
        members: 1
        name: zone3
        resources:
          limits:
            cpu: 1
            memory: 200Mi
          requests:
            cpu: 1
            memory: 200Mi
        storage:
          capacity: 1Gi
          # storageClassName: scylla-manager
        placement:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
                - matchExpressions:
                    - key: topology.kubernetes.io/zone
                      operator: In
                      values:
                        - "3"

Scylla Operator version

v1.12.0-alpha.0-102-geb68db4
also reproducible on v.1.11.0

Kubernetes platform name and version

reproduced on 1.21 & on 1.25

```console $ kubectl version # paste output here WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.3", GitCommit:"9e644106593f3f4aa98f8a84b23db5fa378900bd", GitTreeState:"clean", BuildDate:"2023-03-15T13:40:17Z", GoVersion:"go1.19.7", Compiler:"gc", Platform:"darwin/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.15+k3s1", GitCommit:"d19260dc59280c5f5a3c6596c653e7cfdbb5f3c8", GitTreeState:"clean", BuildDate:"2023-10-30T21:44:53Z", GoVersion:"go1.20.10", Compiler:"gc", Platform:"linux/amd64"} ```

Kubernetes platform info:

Please attach the must-gather archive.

scylla-operator-must-gather-hdqcl4psgfqd.zip

Anything else we need to know?

No response

@andrey-dubnik andrey-dubnik added the kind/bug Categorizes issue or PR as related to a bug. label Nov 14, 2023
@scylla-operator-bot scylla-operator-bot bot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 14, 2023
@tnozicka
Copy link
Contributor

can you please upload must-gather without changes? the temporal namespaces seems to have other resources deleted, except the scyllacluster - that makes me unable to see say pods, pod logs or events :(

@andrey-dubnik
Copy link
Author

There are no other resources. I'm only creating a cluster and it fails on init container OOM without producing any logs so nothing is actually booting up. Do you anticipate anything apart from the cluster STS pods to be running?

@andrey-dubnik
Copy link
Author

@andrey-dubnik
Copy link
Author

collected again

@tnozicka
Copy link
Contributor

sorry, my bad, we don't yet collect events related to the scyllaclusters. can you please run it with --all-resources? I want to have a closer look at the events and pod definitions. thanks

@andrey-dubnik
Copy link
Author

@tnozicka
Copy link
Contributor

Thanks, it is weird that we don't hit it on other platforms. I think the resources should be fixed and cp should use only as much memory as it has available, not sure if k3d behaves differently or if it's somehow connected to the hardware.

  initContainers:
  - command:
    - /bin/sh
    - -c
    - cp -a /usr/bin/scylla-operator /mnt/shared
    image: docker.io/scylladb/scylla-operator@sha256:942098adb09134460264c7470d06efa2d5ee32e98354bb3929f9a16f12cf8b4a
    name: sidecar-injection
    resources:
      limits:
        cpu: 10m
        memory: 50Mi
      requests:
        cpu: 10m
        memory: 50Mi
status:
  initContainerStatuses:
  - containerID: containerd://51ceeb40ef28485c803b983b42e48b0ec12a704cee93b99f79e23ed51edcb2af
    image: sha256:fb3b6307d762133e6e718eb4c61078b423859be3d21d54f294f8e3e3e91095d3
    imageID: docker.io/scylladb/scylla-operator@sha256:942098adb09134460264c7470d06efa2d5ee32e98354bb3929f9a16f12cf8b4a
    lastState:
      terminated:
        containerID: containerd://676594f27b4b1d80d24e8f2171f97f04814da13d45f36b30d22d7c083a29b740
        exitCode: 137
        finishedAt: "2023-11-15T08:25:22Z"
        reason: OOMKilled
        startedAt: "2023-11-15T08:25:17Z"

/assign @rzetelskik
ptal

@tnozicka tnozicka added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 15, 2023
Copy link
Contributor

The Scylla Operator project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 30d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out

/lifecycle stale

@scylla-operator-bot scylla-operator-bot bot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 30, 2024
@ylebi
Copy link
Collaborator

ylebi commented Jun 30, 2024

/remove-lifecycle stale

@tnozicka
Copy link
Contributor

tnozicka commented Jul 1, 2024

/triage accepted

@scylla-operator-bot scylla-operator-bot bot added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants