-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reoccurrence of Service does not have any active Endpoint [when it actually does] #9932
Comments
This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-kind bug Hi, this has been reported twice and this related to the change where the endpointslice is being used. This Issue in itself, in its current state, does not contain enough data to hint at a action item. It would help a lot if there you write a step-by-step instruction to copy/paste and reproduce the problem on a minikube cluster or a kind cluster. It is also likely that there could be a reason, so far unknown, as to why the endpoint slice does not get populated. Even for that it becomes more important to know a way to reproduce the problem and debug it (because just creating a workload with something like the image nginx:alpine, does not create this problem). Thanks |
@scott-kausler please provide |
Hi I am having the same issue as reported in this ticket. I initially created a ticket under rancher Issue 41584 as I wasn't sure if it is a rancher issue or isolated to just the kubernetes ingress-nginx related issue. Is it possible to provide some insight as to why this can be happening? |
Every 25 to 45 minutes one service is available but then during the next interval the Rancher GUI becomes unavailable "404 page not found"; "service "rancher Service does not have an active endpoint" error. |
Hi @tombokombo I ran the commands as recommended; please refer to the output below:
|
Hi, I am having the same problem reported in this issue, and I noticed this only happens when the service name is too large, and it was introduced in this change: #8890 when migrating to endpointslices. This error didn't happen with endpoints because the name of an endpoint is always the same as de service, but, the endpointslices are truncated when the name is too long, and the controller is trying to get the endpointslices with the service name, which doesn't match. Example: # kubectl get endpoints -n my-awesome-service | grep sensorgroup
my-awesome-service-telemetry-online-processor-dlc-sensorgroup 10.0.0.21:8080
# kubectl get EndpointSlice -n my-awesome-service | grep sensorgr
my-awesome-service-telemetry-online-processor-dlc-sensorgrn4mvj IPv4 8080 10.0.0.21 35d
I think this issue is related and could be the fix #9908 |
If its really about long names, then ;
|
This would then indicate a fix has already been implemented? Also if it relates to the svc long name, why would this be happening to the "rancher" service .... which does not seem to be a long name ... |
Looks like the fix for long service names was fixed in this release https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.5.1 Thanks @longwuyuan |
Thank you for the information, but if it was fixed then why are these issues still occurring? Do you have any idea why this is the case? Any feedback would be much appreciated! |
@longwuyuan is this issue due to long service names? Is that why the services are being reported to not have an active endpoint? |
Please see below the error logged for the rancher service, along with the endpointslice + prefix.
Are all services ignored due to the prefix being added to the endpointslice name? Or are the services being ignored Does anyone have any thoughts on this? Forgot to mention that the services' endpoints/endpointslices are periodically recognized and function as expected. |
Hi, The data posted in this issue does not look like something that a developer can use to reproduce the problem. Any help on reproducing the problem is welcome. Any data that is a complete coverage of the bad state, like logs combined with the output of |
@longwuyuan please see the requested, the only logging for this issue that is found is "Service cattle-system/rancher does not have any active Endpoint"
|
When it errors out with the 404 page not found the following is logged in the "rke2-ingress-nginx-controller-" logs:
It logs the same error above for each service that periodically times out. |
@rdb0101 your latest post above is one example of not having data to analyse or reproduce. To be precise, if someone can post the logs of controllerpod and also the output of If you post If you see logs of rancher pod, you could see rancher events and check if any are related. In any case I don't think any developer can reproduce this problem, with the information currently posted in this issue. |
@longwuyuan Thank you for clarifying what data is needed in order to provide a reproducable problem. Please see below the errors that show what happens when the rancher service goes from having no active endpoint, to restart the ingress: Please note that this problem is reproducable by setting up rke2 with helm install of rancher 2.7.3 |
@rdb0101 I am sorry you are having this issue and I hope it resolves sooner. Here are my thoughts and I hope you see the practical side of an issue being created here in this project.
|
Hi @longwuyuan thanks very much for your feedback. I used rancher just as an example. However, it is not specific to just rancher, this issue impacts all of the services I have deployed. I used rancher as an example; as the service + prefix for the endpointslicename is under the 63 character limit. I was trying to determine as to how or whether the nginx-controller was filtering out even the rancher service name, despite being well under the limit. I apologize again if my feedback was unclear. If this issue was specific to just rancher then it would likely only impact the rancher service correct? |
Correct. I am using v1.7.1 of the controller with TLS and I don't face tis problem. |
@longwuyuan Thanks very much for the feedback. I will go ahead and stand up minikube with the version and image as recommended. I will provide the output once I have reproduced the issue. |
@longwuyuan Is your current environment multi-node as well? |
no |
Wow, this may have been an issue as early as 2018: #3060 (comment) |
The reporter of #6962 says this started happening when he added port names to his service. We're using port names, and all the manifest examples I see in this thread have port names. Does anyone have an example of this happening without port names? |
Until we have some way to reproduce or some helpful data that is convincing, I am not sure what a developer would do to address this issue |
I agree. 6 years of bug reports isn't convincing. We need a few more years. 😂 |
Its OSS so your sentiment is ack'd. If you can help me reproduce, I'll appreciate it |
I'm just agreeing that 6 years of bug reports is not nearly enough time to be "convincing." I think people are coming here to report same problem over and over for fun. And honestly, who can blame them? It really is great fun! 😂 I posted the helm chart I'm using with params above. Seems like a pretty basic setup. If I really wanted to reproduce this, I'd just deploy some kind of hello world app and slam it with requests until the problem occurred. I'd also pay close attention to what happens when I add/remove other hello world apps in the same cluster (all of which are being proxied by the same ingress-nginx instance of course). I just don't have the time to do that right now, and I'm guessing neither does anyone else. In the meantime, the best clue I have is that port name thing. When I have some time, I'll try removing the port names from the helm chart in my own app and see if that makes a difference. But before I take the time to do that, hopefully someone else will chime in and let us know if they're seeing this problem without port names. I don't know a lot about Kubernetes internals, so this is a total shot in the dark based on dealing with DNS issues for more years than I have fingers and toes, but the more I dig into this, the more this smells like yet another DNS issue to me...
https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#srv-records |
@mconigliaro I'd be interested to see the testing without named ports. |
I'm sad to report that the problem still occurs when using port numbers instead of names, but I'm happy to report that it's easily reproducible. I can also say the I made a script to run all the commands in #9932 (comment), but it takes way too long to run (20+ secs), and that's longer than the window in which the problem occurs, so I doubt most of the data will be valid. What are the most important commands I should run? |
in which resource's spec did you use port numbers instead of names for ports ? |
I had names in my service (as described in #6962), deployment, and ingress. I just tried to remove the names everywhere I could find them. |
I'm now able to reproduce this pretty easily with a simple bash while loop:
Everything looks fine until suddenly...
curl fails a second later...
Where did my endpoint go?
But then it magically comes back a second or two later?
Let me know what other info might be helpful, but note that I only have a second or two to catch it. |
OK, it turns out even a second or two is not small enough of a window to catch this most of the time. I now have commands running in two separate terminals:
When I do this, I definitely see the Endpoint |
Does this still happen on 1.9.X and 1.10.0? |
I just upgraded to helm chart 4.10.0 and it's still happening.
But what I'm not sure of is whether nginx is causing the problem or just revealing it. What would cause nginx to remove endpoints from services like that? Seems unlikely, but this is also the only place we're seeing this problem (we only use nginx to proxy to our ephemeral dev environments, and we use AWS load balancers in production). And it's interesting that other people seem to be reporting similar behavior. |
I'm back, and I'm now 99% sure the root cause was that we were running out of IP addresses in our EKS cluster. I killed a bunch of unnecessary pods and the random 503s and the "active Endpoint" message went away. I never found any error messages about this in our EKS logs, and I never saw anything else complaining. I only figured it out when I saw a suspicious-looking message about IP addresses on one of our services while poking around the cluster with Lens. Somehow, the only clue that something was wrong at the cluster level was this error message in the nginx controller logs. I'll bet there are a whole bunch of things that might trigger this message (which would explain the six years of bug reports). Apologies for defaming DNS, and thanks to nginx for this error message! |
/assign |
in that case maybe a very small subnet configured on minikube or kind and manually exhausting the ip-addresses could potentially reproduce the error message |
I am having the same issue. Is there any progress on this?
Each service can be reached from inside each container, and the services have never restarted.
Further still each container maintains a peer to peer persistent websocket connection between the nodes.. All the services are up and working between the containers. So the services are working just fine, but for some reason the ingress server thinks they are down?
|
I have an odd update.. If I removed the domain name from the ingress files then the ingress server starts working.. I am guessing this has something to do with dns. Unrelated to this issue.. I am having issues with the nginx container ignoring the tls cert.. no idea why.. it just ignores the secret. ( ya I know this is the wrong place to mention this ) |
This is the exact behaviour we're seeing right now, chart 4.10, app 1.10 We've been at it for hours |
I don't know what valuable anything I can add after reading the whole thread. |
@debdutdeb wishful thinking is having a step-by-step guide to reproduce the does not have any endpoints when it does |
I'll try today. This was on a customer's environment yesterday on AKS. To be perfectly honest, nginx is my default testing controller everytime, and have never seen this happen. New installation at least once a week. So I haven't crossed paths myself yet. |
Hi, This has been reported in multiple issues and now after several occasions of seeing the data on this, one fact has come to light. The fact is that this problem of endpoint related error message is not easy to reproduce at will. The reason it is hard to reproduce at will is because this state of endpoints not being available is transient at best and never ever a bug. Regardless of the volume of resources like compute, memory, networking (and to a minor extent storage), every single transition of state, for the K8S object of Developers can explore options to increase the timers around this, but that is exactly what it will be. Options. There will not be a standard to determine what timers are best for every single user and every single use-case and every singel situation, in the practical world of K8S clusters. This problem is hard to reproduce at will simulating a real use case exactly for the reason that different clusters will have different situations at different times for updating the endpointSlice. Hence there is no action item on the controller currently on this but it may change in the future. But currently all resources are occupied on security & Gateway-API so there is no developer time to allocate to this problem, just to do triaging and research. And this issue is adding to the tally of open issues that is not tracking any action-item. Because there is no action item being tracked here, I will close this issue. The creator of the issue can re-open with step-by-step guide to reproduce at will on a kind cluster, if required, using a recent release of the controller. /close |
@longwuyuan: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
posted a temporary workaround on another related issue. #6135 (comment) |
What happened:
The ingress controller reported that the "Service does not have any active Endpoint" when in fact the service did have active endpoints.
I was able to verify the service was active by execing into the nginx pod and curling the health check endpoint of the service.
The only way I was able to recover was to reinstall the helm chart.
What you expected to happen:
The service to be added to ingress controller
NGINX Ingress controller version:
Kubernetes version (use
kubectl version
):Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.6-eks-48e63af", GitCommit:"9f22d4ae876173884749c0701f01340879ab3f95", GitTreeState:"clean", BuildDate:"2023-01-24T19:19:02Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
Environment:
AWS EKS
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.6-eks-48e63af", GitCommit:"9f22d4ae876173884749c0701f01340879ab3f95", GitTreeState:"clean", BuildDate:"2023-01-24T19:19:02Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}
How was the ingress-nginx-controller installed:
nginx nginx 1 2023-05-06 16:52:09.643618809 +0000 UTC deployed ingress-nginx-4.5.2 1.6.4
Values:
How to reproduce this issue:
Unknown. There was a single replica of the pod, and it was deployed for 42 days before exhibiting this problem.
However, others have recently reported this issue in #6135.
Anything else we need to know:
The problem was previously reported in #6135, but the defect was closed.
The text was updated successfully, but these errors were encountered: