-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"oc exec" is wrongly caching pod-to-node name/certificate/? mappings #11025
Comments
There's no cache expectation… that code runs during the connection handshake, which is uncached by definition. The cert the node is using to serve doesn't have the right hosts in it. Is it possible the wrong certs are being distributed to the second node? |
The node has the right cert, and the master is connecting to the right node (verified with strace and tcpdump). The master is just expecting the node to have the wrong cert. (Yeah, I know there's no expectation of caching. But... it's happening somewhere...) |
Also, if you restart the openshift-master after doing the steps above, and then try the "oc exec" again, it will work |
The only expectation is that the cert contain the hostname that was connected to. As long as the master is trying to connect to the right node, then the issue is elsewhere (with the serving certs given to the node, or the /etc/hosts config mapping node names to IPs, or something else… not sure what) |
So just to clarify what happens here: oc connects to the master, the master opens a connection to 172.17.0.3 (which is the correct IP address for openshift-node-1, which is the node that the pod is running on), openshift-node-1 sends a correct copy of its certificate, and then the master closes the connection, complaining that it didn't get a TLS certificate for openshift-node-2. |
that error means the master is opening a connection to |
The TCP connection that gets made is to openshift-node-1:
although it looks like it's passing "openshift-node-2" as the TLS Server Name Indication:
so it's connecting to openshift-node-1 but it thinks it's connecting to openshift-node-2. /etc/hosts and "oc get nodes" both have the right IPs for both nodes |
Verified that |
The tlsConfig held by the master kubelet client is being mutated in pkg/util/proxy/dial.go (https://github.com/kubernetes/kubernetes/blob/master/pkg/util/proxy/dial.go#L72) That means the first exec to any node locks the server name to that node, and all subsequent execs to any node will fail x509 validation. Opened kubernetes/kubernetes#33140 |
fixed in #11027 and kubernetes/kubernetes#33141 |
fixed by 11027 |
The networking extended tests are currently disabled due to some tests failing with errors like:
This appears to be due to something in the master incorrectly caching the mapping from pod names to the associated nodes, so that if a test launches a pod, "oc exec"s to it, destroys the pod, and recreates it, and the pod lands on a different node the second time, then "oc exec" to it at that point will fail, because the master will connect to the correct node, but expect it to have the TLS certificate of the old node.
Version
git master
Steps To Reproduce
(the hello-openshift image doesn't have any runnable commands in it, so that error is expected)
(different node from before; if it reuses the same node as it did the first time, delete and recreate the pod until it ends on the other node)
(It connected to the correct node, but expected it to have the old cert.)
Blocks #10972 to re-enable the networking extended tests
The text was updated successfully, but these errors were encountered: