-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Shipping alternate architectures in Kubernetes release artifacts #5014
[WIP] Shipping alternate architectures in Kubernetes release artifacts #5014
Conversation
Signed-off-by: Davanum Srinivas <[email protected]>
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dims The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cc @estesp @bradtopol @cwsolee (s390x) |
@mkumatag (ppc64le) you have been through the fire on this :) can you please help review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general a very helpful document. I would love some more links at the commented lines to make finding instructions easier. Good work!
|
||
# Step 1 : crawling (Build) | ||
|
||
- bazel based build infrastructure should support this architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does somebody completely unfamiliar with bazel find documentation on how to do this? I guess there is a doc on how kubernetes is built with bazel somewhere but I can't find it in the org. A link here helps a lot.
I am looking for :
- Commands to run in order
- List of prerequisites the machine that runs the test needs installed
- Link to bazels developer docs so I can start fixing build failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation for building kubernetes exists in the kubernetes repo independent of this document. It is in the build
directory.
https://github.com/kubernetes/kubernetes/tree/master/build#building-kubernetes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/kubernetes/community/blob/9438c6f193721100f2ccf02cc9462794a0fb1be5/contributors/devel/sig-testing/bazel.md for bazel. I swear this used to be cross referenced.
In general developer docs are mostly migrated to github.com/kubernetes/community. there's a long tail project to have these on the website ... not sure what happened there ...
these docs do fall out of date as well though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I don't think it needs to block this documentation, the first new architecture that goes through the process defined here is a great opportunity to document the steps required and how they can be accomplished with tangible examples (links to issues / PRs / etc.)
# Step 1 : crawling (Build) | ||
|
||
- bazel based build infrastructure should support this architecture | ||
- docker based build infrastructure should support this architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as for bazel. I assume this is called by the makefiles, but there is a lot of makefile targets which do different things. A link will help a lot by giving confidence that I found the correct doc.
|
||
This will ensure that community members can rely on these architectures on a consistent basis. This will give folks who are making changes a signal when they break things in a specific architecture. | ||
|
||
This implies a set of folks who standup and maintain both post-submit and periodic tests, watch them closely and raise the flag when things break. They will also have to help debug and fix any architecture/plaform specific issues as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be a way to sign up for this for a specific arch?
|
||
# TODO | ||
- Initially this document is primarily focused on hardware architecture. We have interest from RISC-V and Arm64 in adding more formality around "support" for their hardware. Beyond amd64 the other (ie: s390x, ppc64le, and 32bit arm and amd) variations' health may not meet all aspects of this proposed policy and any deltas needs documented and reconciled. | ||
- Operating systems: this is an added dimension to the "support" question. Similar to hardware architecture, beyond linux other (ie: MacOS, Windows, Illumos, ...?) variations need considered in this or another similar document and any existing Windows deltas documented and reconciled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowercase i in the beginning of illumos is the official spelling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we're going to nit that, it's macOS
, I'm not convinced that's important here though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice document.
tbh, I am too new here to say anything relevant, so I commented mostly on the form.
Good luck on that effort.
|
||
Kubernetes by default has been amd64 oriented arch-wise. All our build and release systems originally supported just that. A while ago we started an [effort to support multiple architectures](https://github.com/kubernetes/kubernetes/issues/38067). As part of this effort, we added support in our build/release pipelines for arm, arm64, ppc64le and s390x. | ||
|
||
The main focus was to have binaries, container images available for these architectures and for folks that are interested to be able able to take these artifacts and set up CI jobs to adequately test these platforms. Specifically to call out the ability to run conformance tests on these platforms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: to be able able
-> to be able
.
@@ -0,0 +1,64 @@ | |||
# Shipping alternate architectures in Kubernetes release artifacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Am I the only one to be bothered by not having CPU instruction sets
written here? I have the impression that the "architecture" word might lead to a different meaning for different people. So being clear in the title would help, but it's maybe me...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its an internal term AFAIK. It expresses both Operating system and CPU instructionset. as illumos/amd64 and linux/amd64 are also reffered as architectures, even though illumos is a different OS on the same CPU ISA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying! Is that understood by the average audience? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been using the term "architecture" the same way on multiple projects that are being upstreamed for RISC-V and ARM. It's broadly understood to my knowledge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kubernetes is a go project, in go terminology arch/architecture refers to e.g. amd64
https://golang.org/pkg/runtime/#pkg-constants
https://golang.org/pkg/go/build/#Context
EDIT: So I think the audience should understand this, FWIW.
|
||
The main focus was to have binaries, container images available for these architectures and for folks that are interested to be able able to take these artifacts and set up CI jobs to adequately test these platforms. Specifically to call out the ability to run conformance tests on these platforms. | ||
|
||
So in this document, let's explore with our sig-release/sig-testing hat on, what we are looking for when we talk about a new architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: punctuation at the end of the sentence?
This will ensure that community members can rely on these architectures on a consistent basis. This will give folks who are making changes a signal when they break things in a specific architecture. | ||
|
||
This implies a set of folks who standup and maintain both post-submit and periodic tests, watch them closely and raise the flag when things break. They will also have to help debug and fix any architecture/plaform specific issues as well. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming they have raised the flag, what will happen then? Will they fix things, revert things? I suppose this will be on a case by case basis. Should it be the case, I am imagining the possible tensions close to release time whether to revert an important patch because it breaks different architecture, how will that be accepted? How will the architecture team will feel if they are ignored, assuming they never got the change to enter step 3?
Specifically we are talking about a set of CI jobs in our release-informing and release-blocking tabs of our testgrid. Kubernetes release team has a "CI signal" team that relies on the status(es) of these jobs to either ship or hold a release. Essentially, if things are mostly red with occasional green, it would be prudent to not even bother making this architecture as part of the release. CI jobs get added to release-informing first and when these get to a point where they work really well, then they get promoted to release-blocking. | ||
|
||
The problem here is once we start shipping something, folks will rely on it, whether we like it or not. So it becomes a trust issue on this team that is talking care of a platform/architecture. Do we really trust this team not just for this release but on an ongoing basis. Do they show up consistently when things break, do they proactively work with testing/release on ongoing efforts and try to apply them to their architectures. It's very easy to setup a CI job as a one time thing, tick a box and advocate to get something added. It's a totally different ball game to be there consistently over time and show that you mean it. There has to be a consistent body of people working on this over time (life happens!). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to write this, it's very important point.
Many of these things are largely TODO for existing architectures, have we decided to give them a free pass? |
@BenTheElder not really ... example kubernetes/kubernetes#93621 |
/assign @justaugustus @saschagrunert @hasheddan |
/sig release |
@justaugustus: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
||
# Step 2 : walking (Test) | ||
|
||
It is not enough for builds to work as it gets bit-rotted quickly when we vendor in new changes, update versions of things we use etc. So we need a good set of tests that exercise a wide battery of tests in this new architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick:
It is not enough for builds to work as it gets bit-rotted quickly when we vendor in new changes, update versions of things we use etc. So we need a good set of tests that exercise a wide battery of tests in this new architecture | |
It is not enough for builds to work as it gets bit-rotted quickly when we vendor in new changes, update versions of things we use etc. So we need a good set of jobs that exercise a wide battery of tests in this new architecture |
The above 2 implicitly means the following | ||
|
||
- golang should support the architecture out-of-the-box | ||
- All our dependencies whether vendored or run separately should support this architecture out-of-the-box |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a plan to demonstrate / verify that this is the case? Typically I don't think this will be an issue with most dependencies due to go's ease of cross-platform builds, but if we enforce this as a strict requirement we should have some way to measure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We know this is not the case initially for new platforms, so it can't be a blocking strict requirement in "Step 1 : crawl (Build)". For example there were problems with how Windows manages users and groups compared to Linux. "Runs on platform" doesn't necessarily mean "runs equivalently" and some of this wont come out until e2e's run or maybe conformance fails. Hopefully though that's more of an issue for extending the platform by adding OS's than when adding hardware architectures under existing OS's. But some of this definitely shows up at compile/link time in my experiences porting various software across x86/arm/ppc/s390. In other words "Step 1 : crawl (Build)" is iteratively improving to repeatably buildable state and that doesn't mean functional.
It's ok at the crawl level to observe in docs that the arch isn't explicitly precluded (better: is explicitly included) and then see if at compile/link time that an object is successfully generated. This won't ensure success, but..."Step 2 : walking (Test)" follows right behind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a plan to demonstrate / verify that this is the case?
building?
Typically I don't think this will be an issue with most dependencies due to go's ease of cross-platform builds, [...]
You say that, but I recently had to fix something like this in containerd, it's easy to have use of a package like syscall
(Go standard library, deprecated) with portability issues (in this case arm64 stopped compiling), in a project with low level OS integrations like Kubernetes it's pretty easy to run into this class of issue IMHO ..
Checking that it compiles ought to go a long way. I think the only way to automate this further is to run tests.
|
||
# Step 1 : crawling (Build) | ||
|
||
- bazel based build infrastructure should support this architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Though I don't think it needs to block this documentation, the first new architecture that goes through the process defined here is a great opportunity to document the steps required and how they can be accomplished with tangible examples (links to issues / PRs / etc.)
I generally like the idea and overall Approach. I think we have to separate between architecture and operating system and toolchain. Maybe we can add a list which combinations we would like to provide (and which not)? |
closing in favor of #5300 |
Co-Authored-By: Tim Pepper [email protected]
Signed-off-by: Davanum Srinivas [email protected]
Which issue(s) this PR fixes:
Fixes kubernetes/kubernetes#93620