Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"oc exec" is wrongly caching pod-to-node name/certificate/? mappings #11025

Closed
danwinship opened this issue Sep 20, 2016 · 11 comments
Closed

"oc exec" is wrongly caching pod-to-node name/certificate/? mappings #11025

danwinship opened this issue Sep 20, 2016 · 11 comments
Assignees
Labels
component/cli kind/question kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1

Comments

@danwinship
Copy link
Contributor

The networking extended tests are currently disabled due to some tests failing with errors like:

Error from server: x509: certificate is valid for nettest-node-2, 172.17.0.4, not nettest-node-1

This appears to be due to something in the master incorrectly caching the mapping from pod names to the associated nodes, so that if a test launches a pod, "oc exec"s to it, destroys the pod, and recreates it, and the pod lands on a different node the second time, then "oc exec" to it at that point will fail, because the master will connect to the correct node, but expect it to have the TLS certificate of the old node.

Version

git master

Steps To Reproduce
danw@w541:origin (master)> ./hack/dind-cluster.sh start
...
danw@w541:origin (master)> . dind-openshift.rc
danw@w541:origin (master)> oc create -f examples/hello-openshift/hello-pod.json
pod "hello-openshift" created
danw@w541:origin (master)> oc get pod hello-openshift -o yaml | grep nodeName
  nodeName: openshift-node-2
danw@w541:origin (master)> oc exec hello-openshift foo
exec: "foo": executable file not found in $PATH
error: error stream protocol error: invalid exit code value "-1"

(the hello-openshift image doesn't have any runnable commands in it, so that error is expected)

danw@w541:origin (master)> oc delete pod hello-openshift
pod "hello-openshift" deleted
danw@w541:origin (master)> oc create -f examples/hello-openshift/hello-pod.json 
pod "hello-openshift" created
danw@w541:origin (master)> oc get pod hello-openshift -o yaml | grep nodeName
  nodeName: openshift-node-1

(different node from before; if it reuses the same node as it did the first time, delete and recreate the pod until it ends on the other node)

danw@w541:origin (master)> oc exec hello-openshift foo
Error from server: x509: certificate is valid for openshift-node-1, 172.17.0.3, not openshift-node-2

(It connected to the correct node, but expected it to have the old cert.)

Blocks #10972 to re-enable the networking extended tests

@danwinship danwinship added the kind/test-flake Categorizes issue or PR as related to test flakes. label Sep 20, 2016
@liggitt
Copy link
Contributor

liggitt commented Sep 20, 2016

There's no cache expectation… that code runs during the connection handshake, which is uncached by definition. The cert the node is using to serve doesn't have the right hosts in it. Is it possible the wrong certs are being distributed to the second node?

@danwinship
Copy link
Contributor Author

The node has the right cert, and the master is connecting to the right node (verified with strace and tcpdump). The master is just expecting the node to have the wrong cert.

(Yeah, I know there's no expectation of caching. But... it's happening somewhere...)

@danwinship
Copy link
Contributor Author

Also, if you restart the openshift-master after doing the steps above, and then try the "oc exec" again, it will work

@liggitt
Copy link
Contributor

liggitt commented Sep 20, 2016

It connected to the correct node, but expected it to have the old cert

The only expectation is that the cert contain the hostname that was connected to. As long as the master is trying to connect to the right node, then the issue is elsewhere (with the serving certs given to the node, or the /etc/hosts config mapping node names to IPs, or something else… not sure what)

@danwinship
Copy link
Contributor Author

danwinship commented Sep 20, 2016

danw@w541:origin (master)> oc exec hello-openshift foo
Error from server: x509: certificate is valid for openshift-node-1, 172.17.0.3, not openshift-node-2

So just to clarify what happens here: oc connects to the master, the master opens a connection to 172.17.0.3 (which is the correct IP address for openshift-node-1, which is the node that the pod is running on), openshift-node-1 sends a correct copy of its certificate, and then the master closes the connection, complaining that it didn't get a TLS certificate for openshift-node-2.

@liggitt
Copy link
Contributor

liggitt commented Sep 20, 2016

that error means the master is opening a connection to openshift-node-2. it's not clear from the info in the issue whether openshift-node-2 is resolving to the IP of openshift-node-1, or if openshift-node-2 is serving using node-1's certs, or if something else is going on. Can you check the serial numbers of the serving certs held by each node?

@danwinship
Copy link
Contributor Author

The TCP connection that gets made is to openshift-node-1:

18:32:01.321722 IP 172.17.0.2.58758 > 172.17.0.3.10250: Flags [S], seq 374301894, win 29200, options [mss 1460,sackOK,TS val 1167589998 ecr 0,nop,wscale 7], length 0

although it looks like it's passing "openshift-node-2" as the TLS Server Name Indication:

    0x0080:  004f 0000 0015 0013 0000 106f 7065 6e73  .O.........opens
    0x0090:  6869 6674 2d6e 6f64 652d 3200 0500 0501  hift-node-2.....

so it's connecting to openshift-node-1 but it thinks it's connecting to openshift-node-2.

/etc/hosts and "oc get nodes" both have the right IPs for both nodes

@liggitt
Copy link
Contributor

liggitt commented Sep 20, 2016

Verified that ExecREST.Connect returns the correct location for the new node the pod is on

@liggitt
Copy link
Contributor

liggitt commented Sep 20, 2016

The tlsConfig held by the master kubelet client is being mutated in pkg/util/proxy/dial.go (https://github.com/kubernetes/kubernetes/blob/master/pkg/util/proxy/dial.go#L72)

That means the first exec to any node locks the server name to that node, and all subsequent execs to any node will fail x509 validation.

Opened kubernetes/kubernetes#33140

@liggitt
Copy link
Contributor

liggitt commented Sep 21, 2016

fixed in #11027 and kubernetes/kubernetes#33141

@danwinship
Copy link
Contributor Author

fixed by 11027

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/cli kind/question kind/test-flake Categorizes issue or PR as related to test flakes. priority/P1
Projects
None yet
Development

No branches or pull requests

4 participants