Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s: Make node-name obtention more robust #2122

Closed
wants to merge 1 commit into from

Conversation

2opremio
Copy link
Contributor

@2opremio 2opremio commented Jan 11, 2017

  • Align with kubernetes on how the system uuid is obtained
  • Base node-name obtention on the hostname instead of the system uuid (which can be hard to find and may not be unique per machine).
  • Fall back to report all pods (printing a warning) if the node name cannot be obtained

Fixes #2049

@2opremio
Copy link
Contributor Author

@errordeveloper Mind testing if the fix works in katacoda?
@paulbellamy Mind reviewing?

@2opremio 2opremio force-pushed the 2049-improve-k8s-uuid-obtention branch from 396f276 to ffe7381 Compare January 11, 2017 18:29
@rade
Copy link
Member

rade commented Jan 11, 2017

We know from weaveworks/weave#2427 that the product_uuid, which is the first thing GetSystemUUID() returns if present, is the same across many machines on some popular hosting providers. How do k8s and scope cope with that?

@2opremio
Copy link
Contributor Author

2opremio commented Jan 11, 2017

We know from weaveworks/weave#2427 that the product_uuid, which is the first thing GetSystemUUID() returns if present, is the same across many machines on some popular hosting providers.

Great point

How do k8s and scope cope with that?

TLDR: In the case of Scope, badly

We use the uuid to obtain the current node name and, in turn, to filter out pods not scheduled in the current machine.

If the uuids are not unique then the node-name obtention will likely be wrong and/or duplicated which may cause us to miss pods.

e.g:

probe1 runs in a machine with uuid "foo" and node-name "A" 
probe2 runs in a machine with uuid "foo" and node-name "B"

When obtaining the node-name both probe1 and probe2 obtain node-name "A" (because of the searching order)

Result: both probe1 and probe2 report the pods scheduled in node "A" and the pods from node "B" are never reported.

I need to think about how to solve this. In the worse case scenario we can report all pods from all probes (which is what we do on ECS) if the performance impact is not too high.

@2opremio
Copy link
Contributor Author

It seems the nodes have a kubernetes.io/hostname label, which I can match against the hostname and obtain the node name from there.

http://stackoverflow.com/a/35022237/1914440

It seems it was called hostname in the past, so I will have to query that too.

* Base node-name obtention on the hostname instead of the system uuid (which can be hard to find and may not be unique per machine).
* Fallback to report all pods (printing a warning) if the node-name cannot be obtained.
@2opremio 2opremio force-pushed the 2049-improve-k8s-uuid-obtention branch from ffe7381 to de224e5 Compare January 13, 2017 15:54
@2opremio
Copy link
Contributor Author

Done. @paulbellamy PTAL

@paulbellamy
Copy link
Contributor

paulbellamy commented Jan 16, 2017

I don't have any concrete concerns, hostname just seems less unique than a uuid, but apparently some hosting services scupper that anyway. vagrant, web1, web2, db1, being examples of fairly non-unique hostnames.

As a side note, I thought checkpoint wrote some random data into a file on first launch to uniquely identify a host. but maybe that was a previous version

But, checkpoint has to uniquely id a host in the whole planet, scope just needs to be unique within scope's view, so I don't think this is an issue.

@2opremio
Copy link
Contributor Author

2opremio commented Jan 16, 2017

After @paulbellamy's comment I am considering to revert de224e5

@rade thoughts?

@rade
Copy link
Member

rade commented Jan 16, 2017

I don't understand what problem this is actually trying to solve. Obtaining a unique id for every node in a cluster? That is precisely the problem we had to tackle in Weave Net.

@rade
Copy link
Member

rade commented Jan 16, 2017

It seems there is more to than just unique IDs though. You write

We use the uuid to obtain the current node name and, in turn, to filter out pods not scheduled in the current machine.

which suggests that the ID information is somehow correlated with info provided by k8s.

@2opremio
Copy link
Contributor Author

2opremio commented Jan 16, 2017

Obtaining a unique id for every node in a cluster?

No, we are trying to identify the pods scheduled in the current node (to avoid reporting all pods from all nodes). For that we are using the pods' nodename.

The problem is obtaining the nodename. For that, we are obtaining the metadata from all the k8s nodes and narrowing it down by checking against known metadata unique to the current node. So far we have tried the system uuid and the hostname but they don't seem to really be unique.

@rade
Copy link
Member

rade commented Jan 16, 2017

The question we are trying to answer here is "what pods are running on this node?" Correct? So the challenge here then is to construct some form of id from the information we can obtain from the host itself, that in turn can be matched to an id obtained from the meta data associated with a pod. Yes?

@2opremio
Copy link
Contributor Author

Yes, see above.

@rade
Copy link
Member

rade commented Jan 16, 2017

How does k8s come up with the nodename? Surely we should just do the same?

@2opremio
Copy link
Contributor Author

Surely we should just do the same?

Let me try to figure this out.

@2opremio
Copy link
Contributor Author

@rade
Copy link
Member

rade commented Jan 16, 2017

Could we ask the kubelet on the node?

@2opremio
Copy link
Contributor Author

Here's an idea. Instead of assuming the uuids are unique, we verify it in the node list. If they are, we use the uuid. If not, we report all the pods.

Could we ask the kubelet on the node?

Unfortunately it doesn't provide the NodeName, see #2049 (comment)

@rade
Copy link
Member

rade commented Jan 16, 2017

So what we are trying to do here cannot be done?

@2opremio
Copy link
Contributor Author

But ... kubelet seems to expose http://localhost:10255/pods/, so we can obtain the pods from there.

@2opremio
Copy link
Contributor Author

Overridden by #2132

@2opremio 2opremio closed this Jan 16, 2017
@2opremio 2opremio deleted the 2049-improve-k8s-uuid-obtention branch January 16, 2017 18:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants