Issue with Cassandra Pods Restarting in Large Kubernetes Clusters #759

garrynigel · 2025-02-13T15:41:09Z

What happened?

We have observed an issue within our internal Kubernetes clusters (which contain approximately 5000 pods and 400 nodes) where the Cassandra pods are continuously restarting and failing to come up. Specifically, the Cassandra pods are unable to start, although the management API process is running.

Upon further investigation, we found that when a pod starts, the cass-operator attempts to make a remote call to initiate the Cassandra process on the specific pod. However, the process fails to start.

To investigate further, I modified the code to log any errors before the pod is deleted. Here is the log that was captured:

2025-02-12T21:57:39.491Z        DEBUG   events  Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-2    {"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"463825506"}, "reason": "StartingCassandra"}
2025-02-12T21:57:39.492Z        ERROR   Call to start Cassandra remotely failed.        {"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "b66fb680-15cd-4b47-b2e9-edd1bd20e232", "namespace": "apps", "datacenterName": "cs-6a03e53f4b", "clusterName": "cs-6a03e53f4b", "error": "Post \"http://172.20.68.35:8080/api/v0/lifecycle/start\": dial tcp 172.20.68.35:8080: connect: connection refused"}
github.com/k8ssandra/cass-operator/pkg/reconciliation.(*ReconciliationContext).startCassandra.func1
        /workspace/pkg/reconciliation/reconcile_racks.go:1965

The issue seems to be related to a minor timing problem and heavy load on the Kubernetes cluster. Even though the pod itself is up, the cass-operator is unable to connect and make the required request. You can see further details about the implementation here: ReconciliationContext.

To temporarily mitigate this issue, I added a retry mechanism with exponential backoff in the cass operator code. After several attempts, the cass-operator was eventually able to start the Cassandra process successfully on the pod.

2025-02-12T22:55:50.698Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:50.698Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"463938256"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.296Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:51.296Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:51.296Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.806Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:51.806Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:51.806Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:53.012Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:53.012Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}
2025-02-12T22:55:53.012Z	DEBUG	events	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"type": "Normal", "object": {"kind":"CassandraDatacenter","namespace":"apps","name":"cs-6a03e53f4b","uid":"8fa6708e-5dee-4c89-82c4-70293420b0f2","apiVersion":"cassandra.datastax.com/v1beta1","resourceVersion":"464085205"}, "reason": "StartingCassandra"}
2025-02-12T22:55:55.532Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-12T22:55:55.532Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "8c789763-9d9a-4316-8e8e-3a1166315b48", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.11", "replaceIP": ""}

Could you kindly advise if this retry mechanism is an appropriate fix for the problem, or if there is a potential issue with how the Cassandra pod is being marked as ready to accept API requests from the operator?

What did you expect to happen?

No response

How can we reproduce it (as minimally and precisely as possible)?

These issues have been intermittent; however, under conditions of heavy load on the Kubernetes cluster, we have observed a higher frequency of occurrences. Not sure how to reproduce locally.

cass-operator version

1.22.4

Kubernetes version

1.30.8

Method of installation

helm

Anything else we need to know?

No response

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-92

The text was updated successfully, but these errors were encountered:

burmanm · 2025-02-13T16:28:05Z

The pod should have the mgmt-api already running (the address it tries to call) as seen here:

cass-operator/pkg/reconciliation/reconcile_racks.go

Line 2048 in ab95ae1

if isServerReadyToStart(pod) && isMgmtApiRunning(pod) {

The function that determines it is here:

cass-operator/pkg/reconciliation/reconcile_racks.go

Line 2086 in ab95ae1

func isMgmtApiRunning(pod *corev1.Pod) bool {

It seems that the start of mgmt-api seems to take longer than 10 seconds for you. Do you happen to have logs from container "cassandra" when it fails to listen to the port? Why is it taking so long to start?

That said, we should probably poll the liveness of that container to detect if it's ready for a start and not rely to have it started yet. On the other hand, it shouldn't take that long to start and somehow we should detect that it needs a kill also.

garrynigel · 2025-02-13T18:14:15Z

@burmanm Thanks for the response.

Please find attached the logs from the cassandra container on one of the pods

Starting Management API
Default Java version: 11.0.25-internal
Running java -Xms128m -Xmx128m -jar /opt/management-api/datastax-mgmtapi-server.jar --cassandra-socket /tmp/cassandra.sock --host tcp://0.0.0.0:8080 --host file:///tmp/oss-mgmt.sock --explicit-start true --cassandra-home /opt/cassandra --no-keep-alive true
INFO  [main] 2025-02-13 18:11:36,774 Cli.java:727 - Found Cassandra binary on PATH: /usr/bin/cassandra
INFO  [main] 2025-02-13 18:11:38,002 Cli.java:376 - Cassandra Version 4.1.7
INFO  [main] 2025-02-13 18:11:38,277 ResteasyDeploymentImpl.java:612 - RESTEASY002225: Deploying jakarta.ws.rs.core.Application: class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,280 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.LifecycleResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,281 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.K8OperatorResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,281 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.KeyspaceOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,281 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v1.KeyspaceOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,281 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.MetadataResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,281 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.NodeOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,282 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v1.NodeOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,282 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.TableOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,282 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v1.TableOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,282 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.AuthResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,282 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v2.RepairResourcesV2 from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,283 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v2.TokenResourcesV2 from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,283 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource io.swagger.v3.jaxrs2.integration.resources.OpenApiResource from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,283 ResteasyDeploymentImpl.java:643 - RESTEASY002210: Adding provider singleton io.swagger.v3.jaxrs2.SwaggerSerializers from Application class com.datastax.mgmtapi.ManagementApplication
Started service on tcp://0.0.0.0:8080
INFO  [main] 2025-02-13 18:11:38,464 ResteasyDeploymentImpl.java:612 - RESTEASY002225: Deploying jakarta.ws.rs.core.Application: class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,465 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.LifecycleResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,465 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.K8OperatorResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,465 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.KeyspaceOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,465 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v1.KeyspaceOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,465 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.MetadataResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,465 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.NodeOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,465 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v1.NodeOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,466 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.TableOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,466 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v1.TableOpsResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,466 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.AuthResources from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,466 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v2.RepairResourcesV2 from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,466 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource com.datastax.mgmtapi.resources.v2.TokenResourcesV2 from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,466 ResteasyDeploymentImpl.java:635 - RESTEASY002220: Adding singleton resource io.swagger.v3.jaxrs2.integration.resources.OpenApiResource from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,466 ResteasyDeploymentImpl.java:643 - RESTEASY002210: Adding provider singleton io.swagger.v3.jaxrs2.SwaggerSerializers from Application class com.datastax.mgmtapi.ManagementApplication
INFO  [main] 2025-02-13 18:11:38,494 IPCController.java:111 - Starting Server
INFO  [main] 2025-02-13 18:11:38,499 IPCController.java:121 - Started Server
Started service on file:///tmp/oss-mgmt.sock
INFO  [nioEventLoopGroup-3-1] 2025-02-13 18:11:39,518 LifecycleResources.java:225 - Started Cassandra
INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:11:39,531 Cli.java:672 - address=/172.20.60.6:43684 url=/api/v0/lifecycle/start status=201 Created
INFO  [nioEventLoopGroup-2-1] 2025-02-13 18:11:55,410 Cli.java:672 - address=/10.0.0.228:43846 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:12:05,407 Cli.java:672 - address=/10.0.0.228:51738 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [nioEventLoopGroup-3-4] 2025-02-13 18:12:15,627 DefaultMavenCoordinates.java:37 - DataStax Java driver for Apache Cassandra(R) (com.datastax.oss:java-driver-core) version 4.17.0
INFO  [epollEventLoopGroup-5-1] 2025-02-13 18:12:15,935 Clock.java:47 - Using native clock for microsecond precision
WARN  [epollEventLoopGroup-5-2] 2025-02-13 18:12:15,954 AbstractBootstrap.java:488 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x7364a620]'
WARN  [epollEventLoopGroup-5-2] 2025-02-13 18:12:21,016 Loggers.java:39 - [s0] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=20c568d8), trying next node (DriverTimeoutException: [s0|control|id: 0x7364a620, L: - R:/tmp/cassandra.sock] Protocol initialization request, step 2 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=DataStax Java driver for Apache Cassandra(R), DRIVER_VERSION=4.17.0, CLIENT_ID=8a39b2cc-8c62-4f85-b678-faff5ff563d2}): timed out after 5000 ms)
INFO  [nioEventLoopGroup-2-1] 2025-02-13 18:12:23,043 Cli.java:672 - address=/10.0.0.228:48456 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [epollEventLoopGroup-6-1] 2025-02-13 18:12:25,468 Clock.java:47 - Using native clock for microsecond precision
WARN  [epollEventLoopGroup-6-2] 2025-02-13 18:12:25,470 AbstractBootstrap.java:488 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x1b2a82e7]'
INFO  [nioEventLoopGroup-2-1] 2025-02-13 18:12:30,431 Cli.java:672 - address=/10.0.0.228:37632 url=/api/v0/probes/liveness status=200 OK
WARN  [epollEventLoopGroup-6-2] 2025-02-13 18:12:30,478 Loggers.java:39 - [s1] Error connecting to Node(endPoint=/tmp/cassandra.sock, hostId=null, hashCode=652d1dd3), trying next node (DriverTimeoutException: [s1|control|id: 0x1b2a82e7, L: - R:/tmp/cassandra.sock] Protocol initialization request, step 2 (STARTUP {CQL_VERSION=3.0.0, DRIVER_NAME=DataStax Java driver for Apache Cassandra(R), DRIVER_VERSION=4.17.0, CLIENT_ID=93c65fe4-bf50-420e-9b76-65d0a555a996}): timed out after 5000 ms)
INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:12:32,486 Cli.java:672 - address=/10.0.0.228:33668 url=/api/v0/probes/readiness status=500 Internal Server Error
INFO  [epollEventLoopGroup-7-1] 2025-02-13 18:12:35,455 Clock.java:47 - Using native clock for microsecond precision
WARN  [epollEventLoopGroup-7-2] 2025-02-13 18:12:35,457 AbstractBootstrap.java:488 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0xf8af4702]'
WARN  [epollEventLoopGroup-7-2] 2025-02-13 18:12:35,928 AbstractBootstrap.java:488 - Unknown channel option 'TCP_NODELAY' for channel '[id: 0x0d81bea2]'
INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:12:35,961 Cli.java:672 - address=/10.0.0.228:37636 url=/api/v0/probes/readiness status=200 OK
INFO  [nioEventLoopGroup-2-1] 2025-02-13 18:12:36,198 Cli.java:672 - address=/172.20.60.6:56620 url=/api/v0/probes/cluster status=200 OK
INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:12:36,477 Cli.java:672 - address=/172.20.60.6:56626 url=/api/v0/ops/seeds/reload status=200 OK
INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:12:36,610 Cli.java:672 - address=/172.20.60.6:56638 url=/api/v0/probes/cluster status=200 OK
INFO  [nioEventLoopGroup-2-1] 2025-02-13 18:12:36,620 Cli.java:672 - address=/172.21.116.83:38306 url=/api/v0/ops/keyspace/replication status=200 OK
INFO  [nioEventLoopGroup-2-1] 2025-02-13 18:12:37,674 Cli.java:672 - address=/172.20.60.6:56640 url=/api/v0/ops/seeds/reload status=200 OK
INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:12:37,694 Cli.java:672 - address=/172.20.60.6:56656 url=/api/v0/probes/cluster status=200 OK

This is the cass-operator logs at the time with the multiple tries

 k logs c3aiops-cass-operator-ccbdc6867-2sf6c -n c3-opsadmin | grep lifecycle/start                                                                                                                         
2025-02-13T18:11:38.006Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "2bb9a72c-eb56-4dae-8ae4-f7e6df4c9d0d", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.50", "replaceIP": ""}
2025-02-13T18:11:38.678Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "2bb9a72c-eb56-4dae-8ae4-f7e6df4c9d0d", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.21.70.50", "replaceIP": ""}
2025-02-13T18:15:20.356Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "6214ae4a-0e32-4bb0-a566-f51a01b29271", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-2", "podIP": "172.20.68.70", "replaceIP": ""}
2025-02-13T18:15:20.805Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "6214ae4a-0e32-4bb0-a566-f51a01b29271", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-2", "podIP": "172.20.68.70", "replaceIP": ""}
2025-02-13T18:15:21.207Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "6214ae4a-0e32-4bb0-a566-f51a01b29271", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-2", "podIP": "172.20.68.70", "replaceIP": ""}
2025-02-13T18:15:21.806Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "6214ae4a-0e32-4bb0-a566-f51a01b29271", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-2", "podIP": "172.20.68.70", "replaceIP": ""}
2025-02-13T18:19:12.568Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "b76e37c9-40e5-497d-87b3-fec1894268ed", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-1", "podIP": "172.21.141.35", "replaceIP": ""}
2025-02-13T18:19:13.181Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "b76e37c9-40e5-497d-87b3-fec1894268ed", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-1", "podIP": "172.21.141.35", "replaceIP": ""}
2025-02-13T18:19:13.859Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "b76e37c9-40e5-497d-87b3-fec1894268ed", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-1", "podIP": "172.21.141.35", "replaceIP": ""}
2025-02-13T18:23:10.989Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "7383c683-5bd2-4419-86a0-bf4a31038e3f", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-0", "podIP": "172.20.110.152", "replaceIP": ""}
2025-02-13T18:23:11.489Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "7383c683-5bd2-4419-86a0-bf4a31038e3f", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-0", "podIP": "172.20.110.152", "replaceIP": ""}
2025-02-13T18:23:12.050Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "7383c683-5bd2-4419-86a0-bf4a31038e3f", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-0", "podIP": "172.20.110.152", "replaceIP": ""}

burmanm · 2025-02-14T08:16:45Z

Hey, that looks like a correct log (as it should look) with the mgmt-api responding to the /start call.

INFO  [nioEventLoopGroup-2-2] 2025-02-13 18:11:39,531 Cli.java:672 - address=/172.20.60.6:43684 url=/api/v0/lifecycle/start status=201 Create

I was more interested in the cases where this did not happen correctly - as in cass-operator trying to call the mgmt-api after 10 seconds it was starter, but the pod wasn't in reality started (it was still starting / something else).

If the pod answers with 201 to the Create, we do not delete the Pod. Only in case of 500 or for example the connection refused in your case would that happen. So those are the cases we should investigate.

We don't use retry to poll the same endpoint as it should never resolve the real issue, the pod restart is more safe method to do it the retry. But judging from your issues, something prevents the pods from coming up in correct time and that's the part that's worrying.

In the end there's the balance how to detect the pod is never going to be alive and account for a bit slower systems. Maybe just waiting for more than 10 seconds is necessary in your system. But if you happen to catch the connection refused event with what the mgmt-api has logged, maybe we can see some other reason the network interface didn't come up. Or is there some sort of CNI or equal that controls the connections?

garrynigel · 2025-02-14T19:05:52Z

@burmanm Yes, that makes sense, I think the pod takes longer than 10 seconds to start atleast from the time the container is marked running to actually when the mgmt-api process starts

So I modified the operator code to show when the cass-operator retries and I feel the timing of things is what could be interesting.

The time the cass-operator detected Cassandra is already up is 2025-02-14T18:51:59.006Z

2025-02-14T18:51:59.006Z	INFO	Starting Cassandra for pod cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "f1c04e3f-8e47-4c77-a9d8-7014683d6e94", "reason": "StartingCassandra", "eventType": "Normal"}
2025-02-14T18:51:59.006Z	INFO	calling Management API start node - POST /api/v0/lifecycle/start	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "f1c04e3f-8e47-4c77-a9d8-7014683d6e94", "pod": "cs-6a03e53f4b-cs-6a03e53f4b-default-sts-3", "podIP": "172.20.58.14", "replaceIP": ""}
2025-02-14T18:51:59.008Z	ERROR	Call to start Cassandra remotely failed.	{"controller": "cassandradatacenter_controller", "controllerGroup": "cassandra.datastax.com", "controllerKind": "CassandraDatacenter", "CassandraDatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "namespace": "apps", "name": "cs-6a03e53f4b", "reconcileID": "f1c04e3f-8e47-4c77-a9d8-7014683d6e94", "namespace": "apps", "datacenterName": "cs-6a03e53f4b", "clusterName": "cs-6a03e53f4b", "error": "Post \"http://172.20.58.14:8080/api/v0/lifecycle/start\": dial tcp 172.20.58.14:8080: connect: connection refused"}
github.com/k8ssandra/cass-operator/pkg/reconciliation.(*ReconciliationContext).startCassandra.func1.1
	/workspace/pkg/reconciliation/reconcile_racks.go:1954
github.com/cenkalti/backoff/v4.RetryNotifyWithTimer.Operation.withEmptyData.func1
	/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:18
github.com/cenkalti/backoff/v4.doRetryNotify[...]
	/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:88
github.com/cenkalti/backoff/v4.RetryNotifyWithTimer
	/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:61
github.com/cenkalti/backoff/v4.RetryNotify
	/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:49
github.com/cenkalti/backoff/v4.Retry
	/go/pkg/mod/github.com/cenkalti/backoff/[email protected]/retry.go:38
github.com/k8ssandra/cass-operator/pkg/reconciliation.(*ReconciliationContext).startCassandra.func1
	/workspace/pkg/reconciliation/reconcile_racks.go:1958
2025-02-14T18:51:59.082Z	INFO	controllers.CassandraDatacenter	Reconcile loop completed	{"cassandradatacenter": {"name":"cs-6a03e53f4b","namespace":"apps"}, "requestNamespace": "apps", "requestName": "cs-6a03e53f4b", "loopID": "e2f1a544-43e5-4539-bb0d-278d152298f5", "duration": 0.132699506}

the cassandra pod logs show it starts up at 2025-02-14 18:51:59,267

INFO  [main] 2025-02-14 18:51:57,901 Cli.java:727 - Found Cassandra binary on PATH: /usr/bin/cassandra
INFO  [main] 2025-02-14 18:51:59,267 Cli.java:376 - Cassandra Version 4.1.7

If I describe the pod and see the start time on the container I see 2025-02-14 18:51:48

  cassandra:
    Container ID:  containerd://0c393eb9e5685c51e61769d6d398b76a7609300c4152c9f70aee843636306ed1
    Ports:         9042/TCP, 9142/TCP, 7000/TCP, 7001/TCP, 7199/TCP, 8080/TCP, 9103/TCP, 9000/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP, 0/TCP
    Command:
      /docker-entrypoint.sh
    Args:
      mgmtapi
    State:          Running
      Started:      Fri, 14 Feb 2025 10:51:48 -0800
    Ready:          True

garrynigel added the bug Something isn't working label Feb 13, 2025

garrynigel linked a pull request Feb 24, 2025 that will close this issue

Add pod liveness check before starting cassandra server. #762

Draft

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Cassandra Pods Restarting in Large Kubernetes Clusters #759

Issue with Cassandra Pods Restarting in Large Kubernetes Clusters #759

garrynigel commented Feb 13, 2025 •

edited by sync-by-unito bot

Loading

burmanm commented Feb 13, 2025

garrynigel commented Feb 13, 2025 •

edited

Loading

burmanm commented Feb 14, 2025

garrynigel commented Feb 14, 2025 •

edited

Loading

Issue with Cassandra Pods Restarting in Large Kubernetes Clusters #759

Issue with Cassandra Pods Restarting in Large Kubernetes Clusters #759

Comments

garrynigel commented Feb 13, 2025 • edited by sync-by-unito bot Loading

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

cass-operator version

Kubernetes version

Method of installation

Anything else we need to know?

burmanm commented Feb 13, 2025

garrynigel commented Feb 13, 2025 • edited Loading

burmanm commented Feb 14, 2025

garrynigel commented Feb 14, 2025 • edited Loading

garrynigel commented Feb 13, 2025 •

edited by sync-by-unito bot

Loading

garrynigel commented Feb 13, 2025 •

edited

Loading

garrynigel commented Feb 14, 2025 •

edited

Loading