-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Are s390x/ppc jobs still valuable? #344
Comments
+1 to removing s390x/ppc64le jobs |
Sorta related I'll bring this up with TOC - to even consider dropping s390x/ppc support in our releases. I don't think our releases work on those arch's anyway - given kourier/istio envoy images don't support it (https://explore.ggcr.dev/?image=envoyproxy%2Fenvoy%3Av1.29.0) I'm sourcing some data from the mailing lists |
We as part of enabling OpenShift Serverless for s390x and ppc64le architectures are actively working on knative upstream release to keep them updated. The members here working actively are @dilipgb (for s390x) and @valen-mascarenhas14 (for pp64le) With respect to istio envoy images - we leverage maistra/envoy packages (midstream of istio envoy packages) for testing knative functionalities. There is active work happening on maintaining maistra/envoy for s390x and ppc64le architectures. |
@rishikakedia Were you able to upstream the changes required to support s390x/pp64le architectures to Envoy and Istio? I believe IBM/RH are key maintainers of Istio(not sure about Envoy) |
@upodroid we pick the maistra/envoy images that are needed for knative upstream and patch the code through our ci scripts before we run the tests (refer here: https://github.com/knative/infra/blob/main/prow/jobs_config/knative/serving.yaml#L186). There are some of tests arbitrarily failing for contour and Kourier and we are also trying to debug those issues. It takes some more time for us to figure this out. For example, in s390x today we have kourier job passed but it was failed yesterday, similarly we had contour run successfully on Monday. We need some more time to fix these issues. Also when cron schedule for latest and main conflicts (when release happens), we will see lot of failure in our CI because jobs will compete for same resource to run tests. We make adjust cron schedule to fix them. |
@upodroid |
I agree with @upodroid. This work would be better utilized when done on Istio/Envoy directly, by adding a proper support for P/Z architecture there. Doing it on Knative level is always going to be chasing a moving target... |
So, there are recent discussions started on having P/Z teams enabling upstream CI to publish images. |
Seems like this openshift CI - https://github.com/openshift/release/tree/master/ci-operator/step-registry/servicemesh is being used to run e2e tests. |
We are enabling istio/envoy under the hood of maistra/envoy for s390x and ppc64le architectures. There is roadmap discussion to enable envoy based on openssl for these architecture to be compatible with upstream. |
I agree with @upodroid From my perspective none of our releases work on ppc/s390x without these patches. So I don't really see the utility of these jobs being in our CI from an OSS perspective. There's no benefit to end-users of Knative who consume the releases we produce.
Would it make more sense to add these tests to the RH/IBM midstream repos rather than here? |
Here is the associated PR for enabling envoyproxy/envoy to be openssl based for s390x: envoyproxy/envoy-openssl#128 |
Fyi, what you need to do is get s390x/ppc64le binaries added to https://github.com/envoyproxy/envoy/releases/tag/v1.29.0 |
FYI @upodroid I'd love to, but, Google dropped us from their CI platform, so we can't get boring-ssl support back -- hence @rishikakedia's mention of the openssl roadmap. (She's on the s390x side of the IBM house. I'm on the ppc64le side.) For reference: envoyproxy/envoy#28363 Given @valen-mascarenhas14's comment -- these issues seem to be worked on the ppc64le side. Does that mean there are no issues on Power? I'm trying both understand the situation and to line up all the folks with who they are and who they're referring to when they say "we." :D |
I read the envoy PR and the solution is to fix it properly in BoringSSL. It seems patches do exist but you need to upstream them and give maintainer/vendor X real IBM hardware to test against those architectures. https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6435 All of this stuff needs to be upstreamed |
Fyi, I don't have anything against the s390x/ppcl64e platforms but I have to repeat these important best practices(which might be done but not visible to me/public). I wonder how much hassle we'll go through RISC-V when it becomes a thing in the future. |
@upodroid : we did internal assessment and we believe that https://github.com/envoyproxy/envoy-openssl should be enabled for s390x and ppc64le architecture by first half of 2024. So I suggest we discuss about this issue of knative upstream CI post that availability? |
An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely. |
@upodroid FYI: we use prow to trigger jobs but infra for testing is provided by P/Z teams by provisioning capacity on ibm cloud. |
Yeah this sounds good |
@upodroid -- Google is the maintainer of boringssl, and they removed support for power explicitly (see google/boringssl@7d2338d). I asked one of the maintainers about adding our hardware back. It's not just a matter of upstreaming, or giving them hardware. It's complicated, but they let people know not to rely on it in their README:
So we're stuck between a rock and a hard place here because third parties are depending on it. All that said, thanks to everyone for the consideration and flexibility. |
If I'm understanding correctly, what prompted this issue is that it wasn't clear if these tests were being maintained. To my understanding they are being maintained, failures/flakes are being fixed, etc. although that maintenance may not have been communicated especially well. So given that, I don't think we need to drop them as long as they're being actively maintained. |
Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script. |
@dprotaso I can see all the tests are running & passing for ppc64le eventing tests (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#ppc64le-e2e-tests&width=20) |
@dprotaso the eventing jobs failure on s390x, we are actively debugging. Since we are sending 2 flags (--platfom=linux/s390x --insecure_registry) to KO_FLAGS, the platform is not getting recognised (I hope you can recreate the issue and check on your end if needed). If send only --platfom=linux/s390x test runs fine for all branches. We have similar set up in release-1.11, where we send both KO_FLAGS and its working fine. Hence its taking time to troubleshoot and understand the issue why its happening in later releases than 1.11. Since we have self-signed certificate on our registry we need the flag to exist. As @psschwei rightly pointed out, it's certainly the communication gap and we will address it in future. |
My understanding of the current status is:
Did I miss anything? A reasonable path forward here is to keep testing on the two additional HW architectures and allow the teams to remove such gaps and ensure that the community can use Knative as is on the additional architectures. We can reevaluate this in six months time to see the progress made. |
This issue is stale because it has been open for 90 days with no |
Do we still need to monitor this? |
/remove-lifecycle stale |
Yes we should monitor this. I haven't really seen IBM follow through on their commitments from that TOC meeting back in Feb.
I figured it's worth providing you the full 6 months as was discussed. There's still a month left or so. |
+1 for revisiting in Aug as discussed |
Hi all, there are efforts going on for enabling envoy-openssl for IBM Z and IBM P platform. You can see most of the work needed for boringssl and openssl compact library is completed. There are few more packages needs to be worked upon and post that we will have envoy-openssl available on our platform as well. Here are some PR on the same. |
We also have efforts on publishing Envoy (based on OpenSSL) IBM P/Z images alongside of x86 images. envoyproxy/envoy-openssl#221 |
installation document for IBM s390x/ IBM Power |
Following up here - I'm inclined to remove the jobs. Main reason is the jobs were faililng for over two weeks and it was only noticed a day ago. Clearly it's not a priority and I don't think Knative OSS project should be running these tests on behalf of a vendor. |
@dprotaso there were intermediate issues with infra in IBM cloud because of noise created because by infra it was noticed late when infra issues were fixed but still CI kept failing. We are actively monitoring the jobs and continue to focus on keeping CI healthy. At the moment there are users running knative on kubernetes on s390x and if we remove the support it impacts those users too. One of the ask from these users was for documentations and we have also update the docs as well. We are actively working to get envoy-openssl support for P/Z and we are almost at the point where we just have to publish images. |
We're not removing support - we're removing the CI jobs |
@dprotaso I'm not really getting how the binaries/images are delivered in future if CI is removed. Are we going to deliver binaries without CI? Can you propose for a call,so we can have better discussion to understand each other pov. what do you think? |
Please dont remove CI jobs. We have discusaed this in the past and we can discuss it again. Please however do not remove CI jobs without involving the relevent teams and getting to an agreement. |
Our release jobs produce the images and binaries. The s390x/ppc jobs this issue is about doesn't refer to that. |
We've already discussed this and have set some expectations that were documented in this comment: #344 (comment) Going over the expectations: 1. The ask for respective maintainers/teams to stabilize the runsAs of this morning the tests are still unstable 2. Regularly Monitor the results and proactively fix or raise issues with Knative bits or infraTests were broken for 2+ weeks before the issue was surfaced 3. Knative community asks for the P/Z teams to aim to contribute more in the Productivity WG tasksDidn't see further engagement in the productivity working unrelated to these jobs 4. In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z. Goal is to have community releases of Knative usable on respective architectures with 3rd party dependecies available as well.The goal hasn't been met. Patching the knative installation files with a redhat image that requires a RedHat account doesn't really meet the bar here. Again we are not dropping support for s390x/ppc - we're dropping these random CI jobs |
|
I concur with @dprotaso's assessment that s390x/ppc jobs should be removed.
https://testgrid.k8s.io/r/knative-own-testgrid/client#ppc64le-e2e-tests broken for 2 months till recently |
@upodroid As previosly mentioned by Dilip in the above comments, the failures were primarily due to the changes in Chain guard and infra-related problems. That was also a reason why it took us sometime to figure out the cause behind these failures. These issues were communicated and discussed in the Slack group. Although the jobs experienced intermittent flakiness, we diligently debugged & resolved these issues as they arose. It's important to clarify that we were actively working on these issues and did not wait till 2 months until the potential removal of the jobs was suggested to address them. Our efforts ensured that any disruptions were minimized, and fixes were implemented in a timely manner. |
I believe originally the ppc/s390x jobs were added to test knative on different architectures with hardware supplied by IBM.
I though this was to the benefit to the IBM CodeEngine folks. Confirming with @psschwei CodeEngine doesn't use these architectures (anymore?).
The other bit we don't have anyone really looking at the tests and fixing them
https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-contour-tests
Furthermore - it's not clear if users can even run Knative on s390x with OSS - eg. kourier & istio envoy images are only arm and amd64.
I'm thinking we should just drop testing these architectures, remove the prow jobs and inform IBM that we no longer need those prow clusters.
The text was updated successfully, but these errors were encountered: