Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reoccurrence of Service does not have any active Endpoint [when it actually does] #9932

Closed
scott-kausler opened this issue May 6, 2023 · 106 comments
Assignees
Labels
kind/support Categorizes issue or PR as a support question. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@scott-kausler
Copy link

scott-kausler commented May 6, 2023

What happened:
The ingress controller reported that the "Service does not have any active Endpoint" when in fact the service did have active endpoints.

I was able to verify the service was active by execing into the nginx pod and curling the health check endpoint of the service.

The only way I was able to recover was to reinstall the helm chart.

What you expected to happen:

The service to be added to ingress controller

NGINX Ingress controller version:

-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.6.4
  Build:         69e8833858fb6bda12a44990f1d5eaa7b13f4b75
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.21.6

-------------------------------------------------------------------------------

Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.6-eks-48e63af", GitCommit:"9f22d4ae876173884749c0701f01340879ab3f95", GitTreeState:"clean", BuildDate:"2023-01-24T19:19:02Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Environment:
AWS EKS

Server Version: version.Info{Major:"1", Minor:"25+", GitVersion:"v1.25.6-eks-48e63af", GitCommit:"9f22d4ae876173884749c0701f01340879ab3f95", GitTreeState:"clean", BuildDate:"2023-01-24T19:19:02Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

How was the ingress-nginx-controller installed:
nginx nginx 1 2023-05-06 16:52:09.643618809 +0000 UTC deployed ingress-nginx-4.5.2 1.6.4

Values:

  ingressClassResource:
    default: true
  service:
    annotations:
      service.beta.kubernetes.io/aws-load-balancer-internal: "true"
      service.beta.kubernetes.io/aws-load-balancer-type: nlb

How to reproduce this issue:
Unknown. There was a single replica of the pod, and it was deployed for 42 days before exhibiting this problem.

However, others have recently reported this issue in #6135.

Anything else we need to know:

The problem was previously reported in #6135, but the defect was closed.

@scott-kausler scott-kausler added the kind/bug Categorizes issue or PR as related to a bug. label May 6, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 6, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

/remove-kind bug

Hi, this has been reported twice and this related to the change where the endpointslice is being used.

This Issue in itself, in its current state, does not contain enough data to hint at a action item. It would help a lot if there you write a step-by-step instruction to copy/paste and reproduce the problem on a minikube cluster or a kind cluster.

It is also likely that there could be a reason, so far unknown, as to why the endpoint slice does not get populated. Even for that it becomes more important to know a way to reproduce the problem and debug it (because just creating a workload with something like the image nginx:alpine, does not create this problem). Thanks

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 7, 2023
@tombokombo
Copy link
Contributor

tombokombo commented May 16, 2023

@scott-kausler please provide kubectl -n $ns get svc,ing,ep,endpointslice and kubectl -n $ns get svc,ing,ep,endpointslice -o yaml

@rdb0101
Copy link

rdb0101 commented May 19, 2023

Hi I am having the same issue as reported in this ticket. I initially created a ticket under rancher Issue 41584 as I wasn't sure if it is a rancher issue or isolated to just the kubernetes ingress-nginx related issue. Is it possible to provide some insight as to why this can be happening?

@rdb0101
Copy link

rdb0101 commented May 19, 2023

Every 25 to 45 minutes one service is available but then during the next interval the Rancher GUI becomes unavailable "404 page not found"; "service "rancher Service does not have an active endpoint" error.

@rdb0101
Copy link

rdb0101 commented May 19, 2023

Hi @tombokombo I ran the commands as recommended; please refer to the output below:

apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-coredns
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:29:46Z"
    labels:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-coredns
      helm.sh/chart: rke2-coredns-1.19.402
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: CoreDNS
    name: rke2-coredns-rke2-coredns
    namespace: kube-system
    resourceVersion: "668"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: udp-53
      port: 53
      protocol: UDP
      targetPort: 53
    - name: tcp-53
      port: 53
      protocol: TCP
      targetPort: 53
    selector:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/name: rke2-coredns
      k8s-app: kube-dns
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1I
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-ingress-nginx
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:30:20Z"
    labels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-ingress-nginx
      app.kubernetes.io/part-of: rke2-ingress-nginx
      app.kubernetes.io/version: 1.6.4
      helm.sh/chart: rke2-ingress-nginx-4.5.201
    name: rke2-ingress-nginx-controller-admission
    namespace: kube-system
    resourceVersion: "1183"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - appProtocol: https
      name: https-webhook
      port: 443
      protocol: TCP
      targetPort: webhook
    selector:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/name: rke2-ingress-nginx
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-metrics-server
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:30:09Z"
    labels:
      app: rke2-metrics-server
      app.kubernetes.io/managed-by: Helm
      chart: rke2-metrics-server-2.11.100-build2022101107
      heritage: Helm
      release: rke2-metrics-server
    name: rke2-metrics-server
    namespace: kube-system
    resourceVersion: "5197581"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: https
      port: 443
      protocol: TCP
      targetPort: https
    - name: metrics
      port: 10250
      protocol: TCP
      targetPort: 10250
    selector:
      app: rke2-metrics-server
      release: rke2-metrics-server
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-snapshot-validation-webhook
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:30:10Z"
    labels:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
      app.kubernetes.io/version: v6.2.1
      helm.sh/chart: rke2-snapshot-validation-webhook-1.7.100
    name: rke2-snapshot-validation-webhook
    namespace: kube-system
    resourceVersion: "980"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: https
      port: 443
      protocol: TCP
      targetPort: https
    selector:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:33Z"
    creationTimestamp: "2023-05-02T17:29:46Z"
    labels:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-coredns
      helm.sh/chart: rke2-coredns-1.19.402
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: CoreDNS
    name: rke2-coredns-rke2-coredns
    namespace: kube-system
    resourceVersion: "5534372"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-coredns-rke2-coredns-6b9548f79f-fg2th
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-coredns-rke2-coredns-6b9548f79f-n4p5l
        namespace: kube-system
        uid: REDACTED
    ports:
    - name: tcp-53
      port: 53
      protocol: TCP
    - name: udp-53
      port: 53
      protocol: UDP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:23Z"
    creationTimestamp: "2023-05-02T17:30:20Z"
    labels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-ingress-nginx
      app.kubernetes.io/part-of: rke2-ingress-nginx
      app.kubernetes.io/version: 1.6.4
      helm.sh/chart: rke2-ingress-nginx-4.5.201
    name: rke2-ingress-nginx-controller-admission
    namespace: kube-system
    resourceVersion: "5534140"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-2h95m
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-8hvtl
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-c8x24
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-df4lk
        namespace: kube-system
        uid: REDACTED
    ports:
    - appProtocol: https
      name: https-webhook
      port: 8443
      protocol: TCP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:09Z"
    creationTimestamp: "2023-05-02T17:30:09Z"
    labels:
      app: rke2-metrics-server
      app.kubernetes.io/managed-by: Helm
      chart: rke2-metrics-server-2.11.100-build2022101107
      heritage: Helm
      release: rke2-metrics-server
    name: rke2-metrics-server
    namespace: kube-system
    resourceVersion: "5533133"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-metrics-server-7d58bbc9c6-xvgg8
        namespace: kube-system
        uid: REDACTED
    ports:
    - name: metrics
      port: 10250
      protocol: TCP
    - name: https
      port: 10250
      protocol: TCP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:10Z"
    creationTimestamp: "2023-05-02T17:30:10Z"
    labels:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
      app.kubernetes.io/version: v6.2.1
      helm.sh/chart: rke2-snapshot-validation-webhook-1.7.100
    name: rke2-snapshot-validation-webhook
    namespace: kube-system
    resourceVersion: "5533131"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-snapshot-validation-webhook-7748dbf6ff-xdtm2
        namespace: kube-system
        uid: REDACTED
    ports:
    - name: https
      port: 8443
      protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-coredns-rke2-coredns-6b9548f79f-fg2th
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-coredns-rke2-coredns-6b9548f79f-n4p5l
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:33Z"
    creationTimestamp: "2023-05-02T17:29:46Z"
    generateName: rke2-coredns-rke2-coredns-
    generation: 78
    labels:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-coredns
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      helm.sh/chart: rke2-coredns-1.19.402
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: CoreDNS
      kubernetes.io/service-name: rke2-coredns-rke2-coredns
    name: rke2-coredns-rke2-coredns-d7srf
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-coredns-rke2-coredns
      uid: REDACTED
    resourceVersion: "5534370"
    uid: REDACTED
  ports:
  - name: tcp-53
    port: 53
    protocol: TCP
  - name: udp-53
    port: 53
    protocol: UDP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-2h95m
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-c8x24
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-df4lk
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-8hvtl
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:23Z"
    creationTimestamp: "2023-05-02T17:30:20Z"
    generateName: rke2-ingress-nginx-controller-admission-
    generation: 265
    labels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-ingress-nginx
      app.kubernetes.io/part-of: rke2-ingress-nginx
      app.kubernetes.io/version: 1.6.4
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      helm.sh/chart: rke2-ingress-nginx-4.5.201
      kubernetes.io/service-name: rke2-ingress-nginx-controller-admission
    name: rke2-ingress-nginx-controller-admission-g25cm
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-ingress-nginx-controller-admission
      uid: REDACTED
    resourceVersion: "5534139"
    uid: REDACTED
  ports:
  - appProtocol: https
    name: https-webhook
    port: 8443
    protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-metrics-server-7d58bbc9c6-xvgg8
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:09Z"
    creationTimestamp: "2023-05-02T17:30:09Z"
    generateName: rke2-metrics-server-
    generation: 27
    labels:
      app: rke2-metrics-server
      app.kubernetes.io/managed-by: Helm
      chart: rke2-metrics-server-2.11.100-build2022101107
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      heritage: Helm
      kubernetes.io/service-name: rke2-metrics-server
      release: rke2-metrics-server
    name: rke2-metrics-server-wmz2b
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-metrics-server
      uid: REDACTED
    resourceVersion: "5533128"
    uid: REDACTED
  ports:
  - name: metrics
    port: 10250
    protocol: TCP
  - name: https
    port: 10250
    protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-snapshot-validation-webhook-7748dbf6ff-xdtm2
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:10Z"
    creationTimestamp: "2023-05-02T17:30:10Z"
    generateName: rke2-snapshot-validation-webhook-
    generation: 16
    labels:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
      app.kubernetes.io/version: v6.2.1
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      helm.sh/chart: rke2-snapshot-validation-webhook-1.7.100
      kubernetes.io/service-name: rke2-snapshot-validation-webhook
    name: rke2-snapshot-validation-webhook-mzc9v
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-snapshot-validation-webhook
      uid: REDACTED
    resourceVersion: "5533125"
    uid: REDACTED
  ports:
  - name: https
    port: 8443
    protocol: TCP
kind: List
metadata:
  resourceVersion: ""
---
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-coredns
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:29:46Z"
    labels:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-coredns
      helm.sh/chart: rke2-coredns-1.19.402
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: CoreDNS
    name: rke2-coredns-rke2-coredns
    namespace: kube-system
    resourceVersion: "668"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: udp-53
      port: 53
      protocol: UDP
      targetPort: 53
    - name: tcp-53
      port: 53
      protocol: TCP
      targetPort: 53
    selector:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/name: rke2-coredns
      k8s-app: kube-dns
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-ingress-nginx
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:30:20Z"
    labels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-ingress-nginx
      app.kubernetes.io/part-of: rke2-ingress-nginx
      app.kubernetes.io/version: 1.6.4
      helm.sh/chart: rke2-ingress-nginx-4.5.201
    name: rke2-ingress-nginx-controller-admission
    namespace: kube-system
    resourceVersion: "1183"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - appProtocol: https
      name: https-webhook
      port: 443
      protocol: TCP
      targetPort: webhook
    selector:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/name: rke2-ingress-nginx
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-metrics-server
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:30:09Z"
    labels:
      app: rke2-metrics-server
      app.kubernetes.io/managed-by: Helm
      chart: rke2-metrics-server-2.11.100-build2022101107
      heritage: Helm
      release: rke2-metrics-server
    name: rke2-metrics-server
    namespace: kube-system
    resourceVersion: "5197581"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: https
      port: 443
      protocol: TCP
      targetPort: https
    - name: metrics
      port: 10250
      protocol: TCP
      targetPort: 10250
    selector:
      app: rke2-metrics-server
      release: rke2-metrics-server
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rke2-snapshot-validation-webhook
      meta.helm.sh/release-namespace: kube-system
    creationTimestamp: "2023-05-02T17:30:10Z"
    labels:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
      app.kubernetes.io/version: v6.2.1
      helm.sh/chart: rke2-snapshot-validation-webhook-1.7.100
    name: rke2-snapshot-validation-webhook
    namespace: kube-system
    resourceVersion: "980"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: https
      port: 443
      protocol: TCP
      targetPort: https
    selector:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:33Z"
    creationTimestamp: "2023-05-02T17:29:46Z"
    labels:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-coredns
      helm.sh/chart: rke2-coredns-1.19.402
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: CoreDNS
    name: rke2-coredns-rke2-coredns
    namespace: kube-system
    resourceVersion: "5534372"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-coredns-rke2-coredns-6b9548f79f-fg2th
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-coredns-rke2-coredns-6b9548f79f-n4p5l
        namespace: kube-system
        uid: REDACTED
    ports:
    - name: tcp-53
      port: 53
      protocol: TCP
    - name: udp-53
      port: 53
      protocol: UDP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:23Z"
    creationTimestamp: "2023-05-02T17:30:20Z"
    labels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-ingress-nginx
      app.kubernetes.io/part-of: rke2-ingress-nginx
      app.kubernetes.io/version: 1.6.4
      helm.sh/chart: rke2-ingress-nginx-4.5.201
    name: rke2-ingress-nginx-controller-admission
    namespace: kube-system
    resourceVersion: "5534140"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-2h95m
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-8hvtl
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-c8x24
        namespace: kube-system
        uid: REDACTED
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-ingress-nginx-controller-df4lk
        namespace: kube-system
        uid: REDACTED
    ports:
    - appProtocol: https
      name: https-webhook
      port: 8443
      protocol: TCP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:09Z"
    creationTimestamp: "2023-05-02T17:30:09Z"
    labels:
      app: rke2-metrics-server
      app.kubernetes.io/managed-by: Helm
      chart: rke2-metrics-server-2.11.100-build2022101107
      heritage: Helm
      release: rke2-metrics-server
    name: rke2-metrics-server
    namespace: kube-system
    resourceVersion: "5533133"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-metrics-server-7d58bbc9c6-xvgg8
        namespace: kube-system
        uid: REDACTED
    ports:
    - name: metrics
      port: 10250
      protocol: TCP
    - name: https
      port: 10250
      protocol: TCP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:10Z"
    creationTimestamp: "2023-05-02T17:30:10Z"
    labels:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
      app.kubernetes.io/version: v6.2.1
      helm.sh/chart: rke2-snapshot-validation-webhook-1.7.100
    name: rke2-snapshot-validation-webhook
    namespace: kube-system
    resourceVersion: "5533131"
    uid: REDACTED
  subsets:
  - addresses:
    - REDACTED
      nodeName: REDACTED
      targetRef:
        kind: Pod
        name: rke2-snapshot-validation-webhook-7748dbf6ff-xdtm2
        namespace: kube-system
        uid: REDACTED
    ports:
    - name: https
      port: 8443
      protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-coredns-rke2-coredns-6b9548f79f-fg2th
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-coredns-rke2-coredns-6b9548f79f-n4p5l
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:33Z"
    creationTimestamp: "2023-05-02T17:29:46Z"
    generateName: rke2-coredns-rke2-coredns-
    generation: 78
    labels:
      app.kubernetes.io/instance: rke2-coredns
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-coredns
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      helm.sh/chart: rke2-coredns-1.19.402
      k8s-app: kube-dns
      kubernetes.io/cluster-service: "true"
      kubernetes.io/name: CoreDNS
      kubernetes.io/service-name: rke2-coredns-rke2-coredns
    name: rke2-coredns-rke2-coredns-d7srf
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-coredns-rke2-coredns
      uid: REDACTED
    resourceVersion: "5534370"
    uid: REDACTED
  ports:
  - name: tcp-53
    port: 53
    protocol: TCP
  - name: udp-53
    port: 53
    protocol: UDP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-2h95m
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-c8x24
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-df4lk
      namespace: kube-system
      uid: REDACTED
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-ingress-nginx-controller-8hvtl
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-19T10:05:23Z"
    creationTimestamp: "2023-05-02T17:30:20Z"
    generateName: rke2-ingress-nginx-controller-admission-
    generation: 265
    labels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: rke2-ingress-nginx
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-ingress-nginx
      app.kubernetes.io/part-of: rke2-ingress-nginx
      app.kubernetes.io/version: 1.6.4
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      helm.sh/chart: rke2-ingress-nginx-4.5.201
      kubernetes.io/service-name: rke2-ingress-nginx-controller-admission
    name: rke2-ingress-nginx-controller-admission-g25cm
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-ingress-nginx-controller-admission
      uid: REDACTED
    resourceVersion: "5534139"
    uid: REDACTED
  ports:
  - appProtocol: https
    name: https-webhook
    port: 8443
    protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-metrics-server-7d58bbc9c6-xvgg8
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:09Z"
    creationTimestamp: "2023-05-02T17:30:09Z"
    generateName: rke2-metrics-server-
    generation: 27
    labels:
      app: rke2-metrics-server
      app.kubernetes.io/managed-by: Helm
      chart: rke2-metrics-server-2.11.100-build2022101107
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      heritage: Helm
      kubernetes.io/service-name: rke2-metrics-server
      release: rke2-metrics-server
    name: rke2-metrics-server-wmz2b
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-metrics-server
      uid: REDACTED
    resourceVersion: "5533128"
    uid: REDACTED
  ports:
  - name: metrics
    port: 10250
    protocol: TCP
  - name: https
    port: 10250
    protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - REDACTED
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: REDACTED
    targetRef:
      kind: Pod
      name: rke2-snapshot-validation-webhook-7748dbf6ff-xdtm2
      namespace: kube-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-02T17:30:10Z"
    creationTimestamp: "2023-05-02T17:30:10Z"
    generateName: rke2-snapshot-validation-webhook-
    generation: 16
    labels:
      app.kubernetes.io/instance: rke2-snapshot-validation-webhook
      app.kubernetes.io/managed-by: Helm
      app.kubernetes.io/name: rke2-snapshot-validation-webhook
      app.kubernetes.io/version: v6.2.1
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      helm.sh/chart: rke2-snapshot-validation-webhook-1.7.100
      kubernetes.io/service-name: rke2-snapshot-validation-webhook
    name: rke2-snapshot-validation-webhook-mzc9v
    namespace: kube-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rke2-snapshot-validation-webhook
      uid: REDACTED
    resourceVersion: "5533125"
    uid: REDACTED
  ports:
  - name: https
    port: 8443
    protocol: TCP
kind: List
metadata:
  resourceVersion: 

@mario-juarez
Copy link

mario-juarez commented May 19, 2023

Hi, I am having the same problem reported in this issue, and I noticed this only happens when the service name is too large, and it was introduced in this change: #8890 when migrating to endpointslices.

This error didn't happen with endpoints because the name of an endpoint is always the same as de service, but, the endpointslices are truncated when the name is too long, and the controller is trying to get the endpointslices with the service name, which doesn't match.

Example:

# kubectl get endpoints -n my-awesome-service | grep sensorgroup    
my-awesome-service-telemetry-online-processor-dlc-sensorgroup     10.0.0.21:8080 
# kubectl get EndpointSlice -n my-awesome-service | grep sensorgr   
my-awesome-service-telemetry-online-processor-dlc-sensorgrn4mvj   IPv4          8080      10.0.0.21                                         35d

I think this issue is related and could be the fix #9908

@longwuyuan
Copy link
Contributor

If its really about long names, then ;

@rdb0101
Copy link

rdb0101 commented May 19, 2023

This would then indicate a fix has already been implemented? Also if it relates to the svc long name, why would this be happening to the "rancher" service .... which does not seem to be a long name ...

@mario-juarez
Copy link

If its really about long names, then ;

Looks like the fix for long service names was fixed in this release https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.5.1

Thanks @longwuyuan

@rdb0101
Copy link

rdb0101 commented May 19, 2023

Thank you for the information, but if it was fixed then why are these issues still occurring? Do you have any idea why this is the case? Any feedback would be much appreciated!

@rdb0101
Copy link

rdb0101 commented May 22, 2023

@longwuyuan is this issue due to long service names? Is that why the services are being reported to not have an active endpoint?

@rdb0101
Copy link

rdb0101 commented May 23, 2023

Please see below the error logged for the rancher service, along with the endpointslice + prefix.
Despite the 63 character limit when the prefix is added to the service name, the rancher endpointslice name is
well under the 63 character limit ....

Service "cattle-system/rancher" does not have any active Endpoint.
Endpointslice name (service + prefix)
rancher-hkpgr

Are all services ignored due to the prefix being added to the endpointslice name? Or are the services being ignored
for any endpointslice name that is over 63 characters?

Does anyone have any thoughts on this?

Forgot to mention that the services' endpoints/endpointslices are periodically recognized and function as expected.
However, then randomly one service will throw a 404 error, resulting in the error "service does not have an active endpoint"; when the active endpoint exists.

@longwuyuan
Copy link
Contributor

Hi,

The data posted in this issue does not look like something that a developer can use to reproduce the problem. Any help on reproducing the problem is welcome.

Any data that is a complete coverage of the bad state, like logs combined with the output of kubectl describe ..., of all related objects (controller, application ingress) components (pod, svc, ingrsss ep, epslices etc etc) , when this problem is actively in play, is also welcome.

@rdb0101
Copy link

rdb0101 commented May 23, 2023

@longwuyuan please see the requested, the only logging for this issue that is found is "Service cattle-system/rancher does not have any active Endpoint"

# kubectl -n cattle-system get svc,ing,ep,endpointslice

NAME                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/rancher           ClusterIP   REDACTED     <none>        80/TCP,443/TCP   25h
service/rancher-webhook   ClusterIP   REDACTED    <none>        443/TCP          24h
service/webhook-service   ClusterIP   REDACTED   <none>        443/TCP          24h

NAME                                CLASS    HOSTS                          ADDRESS                                                     PORTS     AGE
ingress.networking.k8s.io/rancher   <none>   REDACTED                       REDACTED                                                    80, 443   25h

NAME                        ENDPOINTS                                          AGE
endpoints/rancher           HOST1:80,HOST2:80,HOST3:80 + 3 more...             25h
endpoints/rancher-webhook   HOST1:9443                                         24h
endpoints/webhook-service   HOST1:8777                                         24h

NAME                                                   ADDRESSTYPE   PORTS    ENDPOINTS                          AGE
endpointslice.discovery.k8s.io/rancher-hkpgr           IPv4          80,444   HOST2,HOST3,HOST1                  25h
endpointslice.discovery.k8s.io/rancher-webhook-sfgns   IPv4          9443     HOST1                              24h
endpointslice.discovery.k8s.io/webhook-service-b4s92   IPv4          8777     HOST1                              24h

--- 
 # kubectl -n cattle-system get svc,ing,ep,endpointslice -o yaml
 
apiVersion: v1
items:
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rancher
      meta.helm.sh/release-namespace: cattle-system
    creationTimestamp: "2023-05-22T15:11:46Z"
    labels:
      app: rancher
      app.kubernetes.io/managed-by: Helm
      chart: rancher-2.7.3
      heritage: Helm
      release: rancher
    name: rancher
    namespace: cattle-system
    resourceVersion: "5250"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: http
      port: 80
      protocol: TCP
      targetPort: 80
    - name: https-internal
      port: 443
      protocol: TCP
      targetPort: 444
    selector:
      app: rancher
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rancher-webhook
      meta.helm.sh/release-namespace: cattle-system
    creationTimestamp: "2023-05-22T15:17:43Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: rancher-webhook
    namespace: cattle-system
    resourceVersion: "9776"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: https
      port: 443
      protocol: TCP
      targetPort: 9443
    selector:
      app: rancher-webhook
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: v1
  kind: Service
  metadata:
    annotations:
      meta.helm.sh/release-name: rancher-webhook
      meta.helm.sh/release-namespace: cattle-system
      need-a-cert.cattle.io/secret-name: rancher-webhook-tls
    creationTimestamp: "2023-05-22T15:17:43Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: webhook-service
    namespace: cattle-system
    resourceVersion: "9772"
    uid: REDACTED
  spec:
    clusterIP: REDACTED
    clusterIPs:
    - REDACTED
    internalTrafficPolicy: Cluster
    ipFamilies:
    - IPv4
    ipFamilyPolicy: SingleStack
    ports:
    - name: https
      port: 443
      protocol: TCP
      targetPort: 8777
    selector:
      app: rancher-webhook
    sessionAffinity: None
    type: ClusterIP
  status:
    loadBalancer: {}
- apiVersion: networking.k8s.io/v1
  kind: Ingress
  metadata:
    annotations:
      field.cattle.io/publicEndpoints: '[{"addresses":["WORKER1","CONTROLPLANE","WORKER3","WORKER2"],"port":443,"protocol":"HTTPS","serviceName":"cattle-system:rancher","ingressName":"cattle-system:rancher","hostname":"CONTROLPLANE-HOSTNAME","allNodes":false}]'
      meta.helm.sh/release-name: rancher
      meta.helm.sh/release-namespace: cattle-system
      nginx.ingress.kubernetes.io/proxy-connect-timeout: "30"
      nginx.ingress.kubernetes.io/proxy-read-timeout: "1800"
      nginx.ingress.kubernetes.io/proxy-send-timeout: "1800"
    creationTimestamp: "2023-05-22T15:11:46Z"
    generation: 1
    labels:
      app: rancher
      app.kubernetes.io/managed-by: Helm
      chart: rancher-2.7.3
      heritage: Helm
      release: rancher
    name: rancher
    namespace: cattle-system
    resourceVersion: "301991"
    uid: REDACTED
  spec:
    rules:
    - host: CONTROLPLANE-HOSTNAME
      http:
        paths:
        - backend:
            service:
              name: rancher
              port:
                number: 80
          pathType: ImplementationSpecific
    tls:
    - hosts:
      - CONTROLPLANE-HOSTNAME
      secretName: tls-rancher-ingress
  status:
    loadBalancer:
      ingress:
      - ip: WORKER1
      - ip: CONTROLPLANE
      - ip: WORKER3
      - ip: WORKER2
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-23T10:10:55Z"
    creationTimestamp: "2023-05-22T15:11:46Z"
    labels:
      app: rancher
      app.kubernetes.io/managed-by: Helm
      chart: rancher-2.7.3
      heritage: Helm
      release: rancher
    name: rancher
    namespace: cattle-system
    resourceVersion: "301212"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: HOST1
      nodeName: CONTROLPLANE
      targetRef:
        kind: Pod
        name: rancher-6b4977f897-jrzjx
        namespace: cattle-system
        uid: REDACTED
    - ip: HOST2
      nodeName: WORKER2
      targetRef:
        kind: Pod
        name: rancher-6b4977f897-6sf47
        namespace: cattle-system
        uid: REDACTED
    - ip: HOST3
      nodeName: WORKER3
      targetRef:
        kind: Pod
        name: rancher-6b4977f897-xx8gf
        namespace: cattle-system
        uid: REDACTED
    ports:
    - name: http
      port: 80
      protocol: TCP
    - name: https-internal
      port: 444
      protocol: TCP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-23T10:10:06Z"
    creationTimestamp: "2023-05-22T15:17:44Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: rancher-webhook
    namespace: cattle-system
    resourceVersion: "300547"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: HOST1
      nodeName: WORKER3
      targetRef:
        kind: Pod
        name: rancher-webhook-656cd8b9f-cbjbw
        namespace: cattle-system
        uid: REDACTED
    ports:
    - name: https
      port: 9443
      protocol: TCP
- apiVersion: v1
  kind: Endpoints
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-23T10:10:06Z"
    creationTimestamp: "2023-05-22T15:17:44Z"
    labels:
      app.kubernetes.io/managed-by: Helm
    name: webhook-service
    namespace: cattle-system
    resourceVersion: "300546"
    uid: REDACTED
  subsets:
  - addresses:
    - ip: HOST1
      nodeName: WORKER3
      targetRef:
        kind: Pod
        name: rancher-webhook-656cd8b9f-cbjbw
        namespace: cattle-system
        uid: REDACTED
    ports:
    - name: https
      port: 8777
      protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - HOST2
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: WORKER2
    targetRef:
      kind: Pod
      name: rancher-6b4977f897-6sf47
      namespace: cattle-system
      uid: REDACTED
  - addresses:
    - HOST3
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: WORKER3
    targetRef:
      kind: Pod
      name: rancher-6b4977f897-xx8gf
      namespace: cattle-system
      uid: REDACTED
  - addresses:
    - HOST1
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: CONTROLPLANE
    targetRef:
      kind: Pod
      name: rancher-6b4977f897-jrzjx
      namespace: cattle-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    annotations:
      endpoints.kubernetes.io/last-change-trigger-time: "2023-05-23T10:10:55Z"
    creationTimestamp: "2023-05-22T15:11:46Z"
    generateName: rancher-
    generation: 20
    labels:
      app: rancher
      app.kubernetes.io/managed-by: Helm
      chart: rancher-2.7.3
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      heritage: Helm
      kubernetes.io/service-name: rancher
      release: rancher
    name: rancher-hkpgr
    namespace: cattle-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rancher
      uid: REDACTED
    resourceVersion: "301213"
    uid: REDACTED
  ports:
  - name: http
    port: 80
    protocol: TCP
  - name: https-internal
    port: 444
    protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - HOST1
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: WORKER3
    targetRef:
      kind: Pod
      name: rancher-webhook-656cd8b9f-cbjbw
      namespace: cattle-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    creationTimestamp: "2023-05-22T15:17:44Z"
    generateName: rancher-webhook-
    generation: 6
    labels:
      app.kubernetes.io/managed-by: Helm
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      kubernetes.io/service-name: rancher-webhook
    name: rancher-webhook-sfgns
    namespace: cattle-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: rancher-webhook
      uid: REDACTED
    resourceVersion: "300903"
    uid: REDACTED
  ports:
  - name: https
    port: 9443
    protocol: TCP
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - HOST1
    conditions:
      ready: true
      serving: true
      terminating: false
    nodeName: WORKER3
    targetRef:
      kind: Pod
      name: rancher-webhook-656cd8b9f-cbjbw
      namespace: cattle-system
      uid: REDACTED
  kind: EndpointSlice
  metadata:
    creationTimestamp: "2023-05-22T15:17:44Z"
    generateName: webhook-service-
    generation: 6
    labels:
      app.kubernetes.io/managed-by: Helm
      endpointslice.kubernetes.io/managed-by: endpointslice-controller.k8s.io
      kubernetes.io/service-name: webhook-service
    name: webhook-service-b4s92
    namespace: cattle-system
    ownerReferences:
    - apiVersion: v1
      blockOwnerDeletion: true
      controller: true
      kind: Service
      name: webhook-service
      uid: REDACTED
    resourceVersion: "300904"
    uid: REDACTED
  ports:
  - name: https
    port: 8777
    protocol: TCP
kind: List
metadata:
  resourceVersion: ""

@rdb0101
Copy link

rdb0101 commented May 23, 2023

When it errors out with the 404 page not found the following is logged in the "rke2-ingress-nginx-controller-" logs:

I0523 10:03:33.790084       7 store.go:433] "Found valid IngressClass" ingress="cattle-system/rancher" ingressclass="_"
W0523 10:04:21.542696       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:24.876046       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:32.725971       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:36.060111       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:39.393421       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:42.726233       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:46.059925       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:49.392749       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:52.726522       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:56.059866       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:04:59.393704       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:05:02.726128       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint.
W0523 10:05:06.060042       7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint

It logs the same error above for each service that periodically times out.

@longwuyuan
Copy link
Contributor

@rdb0101 your latest post above is one example of not having data to analyse or reproduce.

To be precise, if someone can post the logs of controllerpod and also the output of kubectl get endpointslices -n cattle-system, while the problem is live, then the timestamp on log message and the output of kubectl can be co-related. Other info that will provide info there is kubectl -n cattle-system get events.

If you post kubectl get po -n cattle-system , you can see the restarts, if any.

If you see logs of rancher pod, you could see rancher events and check if any are related.

In any case I don't think any developer can reproduce this problem, with the information currently posted in this issue.

@rdb0101
Copy link

rdb0101 commented May 23, 2023

@longwuyuan Thank you for clarifying what data is needed in order to provide a reproducable problem. Please see below the errors that show what happens when the rancher service goes from having no active endpoint, to restart the ingress:
controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint. W0523 10:05:19.393530 7 controller.go:1163] Service "cattle-system/rancher" does not have any active Endpoint. I0523 10:10:08.404944 7 event.go:285] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"cattle-system", Name:"rancher", UID:"REDACTED", APIVersion:"networking.k8s.io/v1", ResourceVersion:"300576", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync I0523 10:12:09.170841 7 status.go:300] "updating Ingress status" namespace="cattle-system" ingress="rancher" currentValue=[{IP: CONTROLPLANE Hostname: Ports:[]} {IP:WORKER3 Hostname: Ports:[]} {IP:WORKER2 Hostname: Ports:[]}] newValue=[{IP:WORKER1 Hostname: Ports:[]} {IP: CONTROLPLANE Hostname: Ports:[]} {IP:WORKER3 Hostname: Ports:[]} {IP:WORKER2 Hostname: Ports:[]}]

Please note that this problem is reproducable by setting up rke2 with helm install of rancher 2.7.3
This exact issue occurs even in a minimum install.

@longwuyuan
Copy link
Contributor

@rdb0101 I am sorry you are having this issue and I hope it resolves sooner. Here are my thoughts and I hope you see the practical side of an issue being created here in this project.

  • If there is a bug/problem in the controller code, then it will occur on even non rancher deployment like a deployment created using --image nginx:alpine. So If the problem is specific to rancher, then you should talk to the Rancher forum. They have slack as well as Github project

  • Currently, I guess if you get info from a live outage as I describe below, you can help others to know where to look for cause;

    • rancher pod logs when problem is live
    • controller pod logs when problem is live
    • kubectl get events -A when problem is live
    • kubectl describe of pod,svc,ingress,endpointslices for rancher when problem is live
    • kubectl describe of ingress when problem is live

@rdb0101
Copy link

rdb0101 commented May 23, 2023

Hi @longwuyuan thanks very much for your feedback. I used rancher just as an example. However, it is not specific to just rancher, this issue impacts all of the services I have deployed. I used rancher as an example; as the service + prefix for the endpointslicename is under the 63 character limit. I was trying to determine as to how or whether the nginx-controller was filtering out even the rancher service name, despite being well under the limit. I apologize again if my feedback was unclear. If this issue was specific to just rancher then it would likely only impact the rancher service correct?

@longwuyuan
Copy link
Contributor

Correct.

I am using v1.7.1 of the controller with TLS and I don't face tis problem.
Can you try to reproduce the problem in minikube using image nginx:alpine for creating deployment and exposing using ingress-nginx controller and metalllb.

@rdb0101
Copy link

rdb0101 commented May 24, 2023

@longwuyuan Thanks very much for the feedback. I will go ahead and stand up minikube with the version and image as recommended. I will provide the output once I have reproduced the issue.

@rdb0101
Copy link

rdb0101 commented May 24, 2023

@longwuyuan Is your current environment multi-node as well?

@longwuyuan
Copy link
Contributor

no

@mconigliaro
Copy link

Wow, this may have been an issue as early as 2018: #3060 (comment)

@mconigliaro
Copy link

The reporter of #6962 says this started happening when he added port names to his service. We're using port names, and all the manifest examples I see in this thread have port names. Does anyone have an example of this happening without port names?

@longwuyuan
Copy link
Contributor

Until we have some way to reproduce or some helpful data that is convincing, I am not sure what a developer would do to address this issue

@mconigliaro
Copy link

I agree. 6 years of bug reports isn't convincing. We need a few more years. 😂

@longwuyuan
Copy link
Contributor

Its OSS so your sentiment is ack'd.

If you can help me reproduce, I'll appreciate it

@mconigliaro
Copy link

I'm just agreeing that 6 years of bug reports is not nearly enough time to be "convincing." I think people are coming here to report same problem over and over for fun. And honestly, who can blame them? It really is great fun! 😂

I posted the helm chart I'm using with params above. Seems like a pretty basic setup. If I really wanted to reproduce this, I'd just deploy some kind of hello world app and slam it with requests until the problem occurred. I'd also pay close attention to what happens when I add/remove other hello world apps in the same cluster (all of which are being proxied by the same ingress-nginx instance of course). I just don't have the time to do that right now, and I'm guessing neither does anyone else.

In the meantime, the best clue I have is that port name thing. When I have some time, I'll try removing the port names from the helm chart in my own app and see if that makes a difference. But before I take the time to do that, hopefully someone else will chime in and let us know if they're seeing this problem without port names.

I don't know a lot about Kubernetes internals, so this is a total shot in the dark based on dealing with DNS issues for more years than I have fingers and toes, but the more I dig into this, the more this smells like yet another DNS issue to me...

SRV Records are created for named ports that are part of normal or headless services. For each named port, the SRV record has the form _port-name._port-protocol.my-svc.my-namespace.svc.cluster-domain.example. For a regular Service, this resolves to the port number and the domain name: my-svc.my-namespace.svc.cluster-domain.example. ...

https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#srv-records

@strongjz
Copy link
Member

@mconigliaro I'd be interested to see the testing without named ports.

@mconigliaro
Copy link

I'm sad to report that the problem still occurs when using port numbers instead of names, but I'm happy to report that it's easily reproducible. I can also say the Service "<name>" does not have any active Endpoint message definitely seems correlated.

I made a script to run all the commands in #9932 (comment), but it takes way too long to run (20+ secs), and that's longer than the window in which the problem occurs, so I doubt most of the data will be valid. What are the most important commands I should run?

@longwuyuan
Copy link
Contributor

in which resource's spec did you use port numbers instead of names for ports ?

@mconigliaro
Copy link

I had names in my service (as described in #6962), deployment, and ingress. I just tried to remove the names everywhere I could find them.

@mconigliaro
Copy link

I'm now able to reproduce this pretty easily with a simple bash while loop:

while curl -v --fail $curlurl; do echo; done
kubectl --context $context -n $appnamespace describe svc $appsvcname

Everything looks fine until suddenly...

10.4.150.142 - - [29/Feb/2024:18:51:06 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.069 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.072 200 09e0e90823fb7bd53b9982d97cc10d3d
10.4.139.35 - - [29/Feb/2024:18:51:06 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.074 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.072 200 eca8deb8bd9b54cf9a85c997a03f145f
10.4.153.22 - - [29/Feb/2024:18:51:07 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.106 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.108 200 2ff199638f5f56aa418186c93ddd2481
10.4.157.153 - - [29/Feb/2024:18:51:07 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.116 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.116 200 e8cbbd94df388ee5579e0c658e3fffa0
10.4.150.142 - - [29/Feb/2024:18:51:08 +0000] "GET /healthcheck HTTP/1.1" 200 75 "-" "curl/8.4.0" 256 0.254 [cmd-eph-mb-503-heartbreat-webapp-3000] [] 10.4.146.28:3000 75 0.252 200 4ffc2e0b50ca5cd3c797295ef66a407f
W0229 18:51:08.885353       8 controller.go:1112] Service "cmd-eph-mb-503-heartbreat/webapp" does not have any active Endpoint.

curl fails a second later...

* Trying 10.4.138.132:443...
* Connected to cmd-eph-mb-503-heartbreat-app.dev.redacted.io (10.4.138.132) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=*.dev.redacted.io
*  start date: Jul  5 00:00:00 2023 GMT
*  expire date: Aug  2 23:59:59 2024 GMT
*  subjectAltName: host "cmd-eph-mb-503-heartbreat-app.dev.redacted.io" matched cert's "*.dev.redacted.io"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://cmd-eph-mb-503-heartbreat-app.dev.redacted.io/healthcheck
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: cmd-eph-mb-503-heartbreat-app.dev.redacted.io]
* [HTTP/2] [1] [:path: /healthcheck]
* [HTTP/2] [1] [user-agent: curl/8.4.0]
* [HTTP/2] [1] [accept: */*]
> GET /healthcheck HTTP/2
> Host: cmd-eph-mb-503-heartbreat-app.dev.redacted.io
> User-Agent: curl/8.4.0
> Accept: */*
>
< HTTP/2 503
< date: Thu, 29 Feb 2024 18:51:09 GMT
< content-type: text/html
< content-length: 190
< strict-transport-security: max-age=15724800; includeSubDomains
* The requested URL returned error: 503
* Connection #0 to host cmd-eph-mb-503-heartbreat-app.dev.redacted.io left intact
curl: (22) The requested URL returned error: 503

Where did my endpoint go?

Name:                     webapp
Namespace:                cmd-eph-mb-503-heartbreat
Labels:                   app.kubernetes.io/component=webapp
                          app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=cmd-webapp
                          argocd.argoproj.io/instance=cmd-eph-mb-503-heartbreat
                          helm.sh/chart=cmd-webapp-0.1.0
Annotations:              <none>
Selector:                 app.kubernetes.io/component=webapp,app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat,app.kubernetes.io/name=cmd-webapp
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.20.186.24
IPs:                      172.20.186.24
Port:                     <unset>  3000/TCP
TargetPort:               3000/TCP
NodePort:                 <unset>  30837/TCP
Endpoints:
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

But then it magically comes back a second or two later?

Name:                     webapp
Namespace:                cmd-eph-mb-503-heartbreat
Labels:                   app.kubernetes.io/component=webapp
                          app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=cmd-webapp
                          argocd.argoproj.io/instance=cmd-eph-mb-503-heartbreat
                          helm.sh/chart=cmd-webapp-0.1.0
Annotations:              <none>
Selector:                 app.kubernetes.io/component=webapp,app.kubernetes.io/instance=cmd-eph-mb-503-heartbreat,app.kubernetes.io/name=cmd-webapp
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.20.186.24
IPs:                      172.20.186.24
Port:                     <unset>  3000/TCP
TargetPort:               3000/TCP
NodePort:                 <unset>  30837/TCP
Endpoints:                10.4.146.28:3000
Session Affinity:         None
External Traffic Policy:  Cluster
Events:                   <none>

Let me know what other info might be helpful, but note that I only have a second or two to catch it.

@mconigliaro
Copy link

OK, it turns out even a second or two is not small enough of a window to catch this most of the time. I now have commands running in two separate terminals:

wrk https://cmd-eph-mb-503-heartbreat-app.dev.redacted.io/healthcheck -c 20 -d 60

while kubectl --context $context -n $appnamespace describe svc $appsvcname; do echo; done

When I do this, I definitely see the Endpoint 10.4.146.28:3000 disappearing and reappearing randomly. I now believe this is load related, since it happens much more frequently if I increase the number of wrk connections (e.g. -c 200).

@strongjz
Copy link
Member

Does this still happen on 1.9.X and 1.10.0?

@mconigliaro
Copy link

I just upgraded to helm chart 4.10.0 and it's still happening.

NGINX Ingress controller
  Release:       v1.10.0
  Build:         71f78d49f0a496c31d4c19f095469f3f23900f8a
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.25.3

But what I'm not sure of is whether nginx is causing the problem or just revealing it. What would cause nginx to remove endpoints from services like that? Seems unlikely, but this is also the only place we're seeing this problem (we only use nginx to proxy to our ephemeral dev environments, and we use AWS load balancers in production). And it's interesting that other people seem to be reporting similar behavior.

@mconigliaro
Copy link

I'm back, and I'm now 99% sure the root cause was that we were running out of IP addresses in our EKS cluster. I killed a bunch of unnecessary pods and the random 503s and the "active Endpoint" message went away. I never found any error messages about this in our EKS logs, and I never saw anything else complaining. I only figured it out when I saw a suspicious-looking message about IP addresses on one of our services while poking around the cluster with Lens. Somehow, the only clue that something was wrong at the cluster level was this error message in the nginx controller logs. I'll bet there are a whole bunch of things that might trigger this message (which would explain the six years of bug reports). Apologies for defaming DNS, and thanks to nginx for this error message!

@longwuyuan
Copy link
Contributor

/assign

@longwuyuan
Copy link
Contributor

in that case maybe a very small subnet configured on minikube or kind and manually exhausting the ip-addresses could potentially reproduce the error message

@akalinux
Copy link

akalinux commented Mar 26, 2024

I am having the same issue. Is there any progress on this?

W0326 19:32:27.115911       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"err":"secrets \"ingress-nginx-admission\" not found","level":"info","msg":"no secret found","source":"k8s/k8s.go:229","time":"2024-03-26T19:32:27Z"}
{"level":"info","msg":"creating new secret","source":"cmd/create.go:28","time":"2024-03-26T19:32:27Z"}
W0326 19:32:40.821000       1 client_config.go:618] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
{"level":"info","msg":"patching webhook configurations 'ingress-nginx-admission' mutating=false, validating=true, failurePolicy=Fail","source":"k8s/k8s.go:118","time":"2024-03-26T19:32:40Z"}
{"level":"info","msg":"Patched hook(s)","source":"k8s/k8s.go:138","time":"2024-03-26T19:32:40Z"}
I0326 19:34:00.876850       7 event.go:298] Event(v1.ObjectReference{Kind:"Ingress", Namespace:"namespace-socmon-common", Name:"ingress-socmon-webapps-3-dev", UID:"f4ebe45d-da26-4259-8cdf-d996b8cf3e41", APIVersion:"networking.k8s.io/v1", ResourceVersion:"773", FieldPath:""}): type: 'Normal' reason: 'Sync' Scheduled for sync
W0326 19:34:04.206812       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-3" does not have any active Endpoint.
W0326 19:34:04.206855       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-2" does not have any active Endpoint.
W0326 19:34:04.206868       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-1" does not have any active Endpoint.
W0326 19:34:07.543761       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-3" does not have any active Endpoint.
W0326 19:34:07.543796       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-2" does not have any active Endpoint.
W0326 19:34:07.543809       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-1" does not have any active Endpoint.
W0326 19:34:40.233379       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-3" does not have any active Endpoint.
W0326 19:34:40.233405       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-2" does not have any active Endpoint.
W0326 19:34:40.233415       7 controller.go:1214] Service "namespace-socmon-common/service-socmon-webapps-1" does not have any active Endpoint.
NAME                       TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
service-socmon-webapps-1   NodePort   10.97.236.80    <none>        80:32765/TCP   39m
service-socmon-webapps-2   NodePort   10.99.7.70      <none>        80:32011/TCP   39m
service-socmon-webapps-3   NodePort   10.102.15.166   <none>        80:31318/TCP   39m

Each service can be reached from inside each container, and the services have never restarted.

*   Trying 10.97.236.80:80...
* Connected to service-socmon-webapps-1 (10.97.236.80) port 80 (#0)
> GET /rest/healthcheck HTTP/1.1
> Host: service-socmon-webapps-1
> User-Agent: curl/7.74.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Server: nginx
< Date: Tue, 26 Mar 2024 20:16:24 GMT
< Content-Type: text/plain; charset=UTF-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Encoding
< Set-Cookie: dancer.session=ZgMtFwAAAIts-zMs87oVwpS0U_0uNFjM; Path=/; SameSite=Lax; HttpOnly; Secure; Expires=Mon, 25-Mar-2024 20:16:24 GMT; Domain=service-socmon-webapps-1
< Cache-Control: private, no-cache, no-store, must-revalidate
< X-Frame-Options: sameorigin
< X-XSS-Protection: 1; mode=block
< 
* Connection #0 to host service-socmon-webapps-1 left intact
OK

Further still each container maintains a peer to peer persistent websocket connection between the nodes.. All the services are up and working between the containers. So the services are working just fine, but for some reason the ingress server thinks they are down?

etstat -an|grep tcp|grep ESTA|grep ':80'
tcp        0      0 10.244.0.9:36172        10.97.236.80:80         ESTABLISHED
tcp        0      0 10.244.0.9:57410        10.99.7.70:80           ESTABLISHED
tcp        0      0 10.244.0.9:80           10.244.0.8:40938        ESTABLISHED
tcp        0      0 10.244.0.9:80           10.244.0.7:33442        ESTABLISHED

@akalinux
Copy link

I have an odd update.. If I removed the domain name from the ingress files then the ingress server starts working.. I am guessing this has something to do with dns.

example.txt

Unrelated to this issue.. I am having issues with the nginx container ignoring the tls cert.. no idea why.. it just ignores the secret. ( ya I know this is the wrong place to mention this )

@debdutdeb
Copy link

I have an odd update.. If I removed the domain name from the ingress files then the ingress server starts working.. I am guessing this has something to do with dns.

This is the exact behaviour we're seeing right now, chart 4.10, app 1.10

We've been at it for hours

@debdutdeb
Copy link

I don't know what valuable anything I can add after reading the whole thread.

@longwuyuan
Copy link
Contributor

@debdutdeb wishful thinking is having a step-by-step guide to reproduce the does not have any endpoints when it does

@debdutdeb
Copy link

I'll try today.

This was on a customer's environment yesterday on AKS.

To be perfectly honest, nginx is my default testing controller everytime, and have never seen this happen. New installation at least once a week. So I haven't crossed paths myself yet.

@longwuyuan
Copy link
Contributor

Hi,

This has been reported in multiple issues and now after several occasions of seeing the data on this, one fact has come to light. The fact is that this problem of endpoint related error message is not easy to reproduce at will.

The reason it is hard to reproduce at will is because this state of endpoints not being available is transient at best and never ever a bug. Regardless of the volume of resources like compute, memory, networking (and to a minor extent storage), every single transition of state, for the K8S object of kind: Pod, there will be a related update to the endpointSlice. The controller relies on the endpointSlice returned to determine the destination of a routed request. If the controller gets the endpoint info AFTER a successful update, then there is no problem. If the controller gets the endpoint info that is stale, then its bound to return this error message.

Developers can explore options to increase the timers around this, but that is exactly what it will be. Options. There will not be a standard to determine what timers are best for every single user and every single use-case and every singel situation, in the practical world of K8S clusters. This problem is hard to reproduce at will simulating a real use case exactly for the reason that different clusters will have different situations at different times for updating the endpointSlice.

Hence there is no action item on the controller currently on this but it may change in the future. But currently all resources are occupied on security & Gateway-API so there is no developer time to allocate to this problem, just to do triaging and research.

And this issue is adding to the tally of open issues that is not tracking any action-item. Because there is no action item being tracked here, I will close this issue. The creator of the issue can re-open with step-by-step guide to reproduce at will on a kind cluster, if required, using a recent release of the controller.

/close

@k8s-ci-robot
Copy link
Contributor

@longwuyuan: Closing this issue.

In response to this:

Hi,

This has been reported in multiple issues and now after several occasions of seeing the data on this, one fact has come to light. The fact is that this problem of endpoint related error message is not easy to reproduce at will.

The reason it is hard to reproduce at will is because this state of endpoints not being available is transient at best and never ever a bug. Regardless of the volume of resources like compute, memory, networking (and to a minor extent storage), every single transition of state, for the K8S object of kind: Pod, there will be a related update to the endpointSlice. The controller relies on the endpointSlice returned to determine the destination of a routed request. If the controller gets the endpoint info AFTER a successful update, then there is no problem. If the controller gets the endpoint info that is stale, then its bound to return this error message.

Developers can explore options to increase the timers around this, but that is exactly what it will be. Options. There will not be a standard to determine what timers are best for every single user and every single use-case and every singel situation, in the practical world of K8S clusters. This problem is hard to reproduce at will simulating a real use case exactly for the reason that different clusters will have different situations at different times for updating the endpointSlice.

Hence there is no action item on the controller currently on this but it may change in the future. But currently all resources are occupied on security & Gateway-API so there is no developer time to allocate to this problem, just to do triaging and research.

And this issue is adding to the tally of open issues that is not tracking any action-item. Because there is no action item being tracked here, I will close this issue. The creator of the issue can re-open with step-by-step guide to reproduce at will on a kind cluster, if required, using a recent release of the controller.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@gespi1
Copy link

gespi1 commented Jan 28, 2025

posted a temporary workaround on another related issue. #6135 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests