-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with Cassandra Pods Restarting in Large Kubernetes Clusters #759
Comments
The pod should have the mgmt-api already running (the address it tries to call) as seen here:
The function that determines it is here:
It seems that the start of mgmt-api seems to take longer than 10 seconds for you. Do you happen to have logs from container "cassandra" when it fails to listen to the port? Why is it taking so long to start? That said, we should probably poll the liveness of that container to detect if it's ready for a start and not rely to have it started yet. On the other hand, it shouldn't take that long to start and somehow we should detect that it needs a kill also. |
@burmanm Thanks for the response. Please find attached the logs from the cassandra container on one of the pods
This is the cass-operator logs at the time with the multiple tries
|
Hey, that looks like a correct log (as it should look) with the mgmt-api responding to the /start call.
I was more interested in the cases where this did not happen correctly - as in cass-operator trying to call the mgmt-api after 10 seconds it was starter, but the pod wasn't in reality started (it was still starting / something else). If the pod answers with 201 to the Create, we do not delete the Pod. Only in case of 500 or for example the connection refused in your case would that happen. So those are the cases we should investigate. We don't use retry to poll the same endpoint as it should never resolve the real issue, the pod restart is more safe method to do it the retry. But judging from your issues, something prevents the pods from coming up in correct time and that's the part that's worrying. In the end there's the balance how to detect the pod is never going to be alive and account for a bit slower systems. Maybe just waiting for more than 10 seconds is necessary in your system. But if you happen to catch the connection refused event with what the mgmt-api has logged, maybe we can see some other reason the network interface didn't come up. Or is there some sort of CNI or equal that controls the connections? |
@burmanm Yes, that makes sense, I think the pod takes longer than 10 seconds to start atleast from the time the container is marked running to actually when the mgmt-api process starts So I modified the operator code to show when the cass-operator retries and I feel the timing of things is what could be interesting. The time the cass-operator detected Cassandra is already up is 2025-02-14T18:51:59.006Z
the cassandra pod logs show it starts up at 2025-02-14 18:51:59,267
If I describe the pod and see the start time on the container I see 2025-02-14 18:51:48
|
What happened?
We have observed an issue within our internal Kubernetes clusters (which contain approximately 5000 pods and 400 nodes) where the Cassandra pods are continuously restarting and failing to come up. Specifically, the Cassandra pods are unable to start, although the management API process is running.
Upon further investigation, we found that when a pod starts, the cass-operator attempts to make a remote call to initiate the Cassandra process on the specific pod. However, the process fails to start.
To investigate further, I modified the code to log any errors before the pod is deleted. Here is the log that was captured:
The issue seems to be related to a minor timing problem and heavy load on the Kubernetes cluster. Even though the pod itself is up, the cass-operator is unable to connect and make the required request. You can see further details about the implementation here: ReconciliationContext.
To temporarily mitigate this issue, I added a retry mechanism with exponential backoff in the cass operator code. After several attempts, the cass-operator was eventually able to start the Cassandra process successfully on the pod.
Could you kindly advise if this retry mechanism is an appropriate fix for the problem, or if there is a potential issue with how the Cassandra pod is being marked as ready to accept API requests from the operator?
What did you expect to happen?
No response
How can we reproduce it (as minimally and precisely as possible)?
These issues have been intermittent; however, under conditions of heavy load on the Kubernetes cluster, we have observed a higher frequency of occurrences. Not sure how to reproduce locally.
cass-operator version
1.22.4
Kubernetes version
1.30.8
Method of installation
helm
Anything else we need to know?
No response
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-92
The text was updated successfully, but these errors were encountered: