Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are s390x/ppc jobs still valuable? #344

Closed
dprotaso opened this issue Feb 6, 2024 · 58 comments · Fixed by #497
Closed

Are s390x/ppc jobs still valuable? #344

dprotaso opened this issue Feb 6, 2024 · 58 comments · Fixed by #497
Milestone

Comments

@dprotaso
Copy link
Member

dprotaso commented Feb 6, 2024

I believe originally the ppc/s390x jobs were added to test knative on different architectures with hardware supplied by IBM.

I though this was to the benefit to the IBM CodeEngine folks. Confirming with @psschwei CodeEngine doesn't use these architectures (anymore?).

The other bit we don't have anyone really looking at the tests and fixing them
https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-contour-tests

Furthermore - it's not clear if users can even run Knative on s390x with OSS - eg. kourier & istio envoy images are only arm and amd64.

I'm thinking we should just drop testing these architectures, remove the prow jobs and inform IBM that we no longer need those prow clusters.

@upodroid
Copy link
Member

upodroid commented Feb 6, 2024

+1 to removing s390x/ppc64le jobs

@dprotaso
Copy link
Member Author

dprotaso commented Feb 6, 2024

Sorta related I'll bring this up with TOC - to even consider dropping s390x/ppc support in our releases.

I don't think our releases work on those arch's anyway - given kourier/istio envoy images don't support it (https://explore.ggcr.dev/?image=envoyproxy%2Fenvoy%3Av1.29.0)

I'm sourcing some data from the mailing lists
https://groups.google.com/g/knative-users/c/ORwp3KlFbds
https://groups.google.com/g/knative-dev/c/D-UkD3xPtFA

@rishikakedia
Copy link

rishikakedia commented Feb 7, 2024

We as part of enabling OpenShift Serverless for s390x and ppc64le architectures are actively working on knative upstream release to keep them updated. The members here working actively are @dilipgb (for s390x) and @valen-mascarenhas14 (for pp64le)

With respect to istio envoy images - we leverage maistra/envoy packages (midstream of istio envoy packages) for testing knative functionalities. There is active work happening on maintaining maistra/envoy for s390x and ppc64le architectures.

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

@rishikakedia Were you able to upstream the changes required to support s390x/pp64le architectures to Envoy and Istio?

I believe IBM/RH are key maintainers of Istio(not sure about Envoy)

@dilipgb
Copy link
Contributor

dilipgb commented Feb 7, 2024

@upodroid we pick the maistra/envoy images that are needed for knative upstream and patch the code through our ci scripts before we run the tests (refer here: https://github.com/knative/infra/blob/main/prow/jobs_config/knative/serving.yaml#L186).

There are some of tests arbitrarily failing for contour and Kourier and we are also trying to debug those issues. It takes some more time for us to figure this out. For example, in s390x today we have kourier job passed but it was failed yesterday, similarly we had contour run successfully on Monday. We need some more time to fix these issues.

Also when cron schedule for latest and main conflicts (when release happens), we will see lot of failure in our CI because jobs will compete for same resource to run tests. We make adjust cron schedule to fix them.

@valen-mascarenhas14
Copy link
Contributor

@upodroid
Recently, we've implemented significant changes to our testing infrastructure on the ppc64le side. This included migrating all Knative workloads to a different workspace within IBM Cloud. As a result, modifications were necessary, such as updating secrets, adjusting cronjob timings, and refining ppc64le-specific scripts. These changes led to a few failures during the transition period.
However we have successfully addressed these issues, and the system is now functioning smoothly.
Although we encountered some intermittent failures during the transition, we have diligently resolved them, ensuring that the platform is now performing as expected.

@cardil
Copy link
Contributor

cardil commented Feb 7, 2024

I agree with @upodroid. This work would be better utilized when done on Istio/Envoy directly, by adding a proper support for P/Z architecture there.

Doing it on Knative level is always going to be chasing a moving target...

@rishikakedia
Copy link

So, there are recent discussions started on having P/Z teams enabling upstream CI to publish images.

@ghatwala
Copy link

ghatwala commented Feb 7, 2024

Seems like this openshift CI - https://github.com/openshift/release/tree/master/ci-operator/step-registry/servicemesh is being used to run e2e tests.

@rishikakedia
Copy link

We are enabling istio/envoy under the hood of maistra/envoy for s390x and ppc64le architectures. There is roadmap discussion to enable envoy based on openssl for these architecture to be compatible with upstream.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 7, 2024

I agree with @upodroid

From my perspective none of our releases work on ppc/s390x without these patches. So I don't really see the utility of these jobs being in our CI from an OSS perspective. There's no benefit to end-users of Knative who consume the releases we produce.

We as part of enabling OpenShift Serverless for s390x and ppc64le architectures

Would it make more sense to add these tests to the RH/IBM midstream repos rather than here?

@rishikakedia
Copy link

Here is the associated PR for enabling envoyproxy/envoy to be openssl based for s390x: envoyproxy/envoy-openssl#128

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

Fyi, what you need to do is get s390x/ppc64le binaries added to https://github.com/envoyproxy/envoy/releases/tag/v1.29.0

@clnperez
Copy link

clnperez commented Feb 7, 2024

FYI @upodroid I'd love to, but, Google dropped us from their CI platform, so we can't get boring-ssl support back -- hence @rishikakedia's mention of the openssl roadmap. (She's on the s390x side of the IBM house. I'm on the ppc64le side.)

For reference: envoyproxy/envoy#28363

Given @valen-mascarenhas14's comment -- these issues seem to be worked on the ppc64le side. Does that mean there are no issues on Power? I'm trying both understand the situation and to line up all the folks with who they are and who they're referring to when they say "we." :D

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

I read the envoy PR and the solution is to fix it properly in BoringSSL.

It seems patches do exist but you need to upstream them and give maintainer/vendor X real IBM hardware to test against those architectures.

https://gitlab.com/gitlab-org/omnibus-gitlab/-/issues/6435
https://github.com/linux-on-ibm-z/docs/wiki/Building-TensorFlow

All of this stuff needs to be upstreamed

@upodroid
Copy link
Member

upodroid commented Feb 7, 2024

Fyi, I don't have anything against the s390x/ppcl64e platforms but I have to repeat these important best practices(which might be done but not visible to me/public).

I wonder how much hassle we'll go through RISC-V when it becomes a thing in the future.

@rishikakedia
Copy link

@upodroid : we did internal assessment and we believe that https://github.com/envoyproxy/envoy-openssl should be enabled for s390x and ppc64le architecture by first half of 2024. So I suggest we discuss about this issue of knative upstream CI post that availability?

@rhuss
Copy link
Contributor

rhuss commented Feb 8, 2024

An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.

@rishikakedia
Copy link

@upodroid FYI: we use prow to trigger jobs but infra for testing is provided by P/Z teams by provisioning capacity on ibm cloud.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 8, 2024

An idea could be to instead of removing the jobs, we could disable them (and also don't do any upstream release for those platforms) and reconsider to enable them when there are official ports for those archs for envoy ? We can set a date, let's say 2024-08-01 and when there is no P/Z port for envoy we then can remove the jobs completely.

Yeah this sounds good

@clnperez
Copy link

clnperez commented Feb 8, 2024

@upodroid -- Google is the maintainer of boringssl, and they removed support for power explicitly (see google/boringssl@7d2338d). I asked one of the maintainers about adding our hardware back. It's not just a matter of upstreaming, or giving them hardware. It's complicated, but they let people know not to rely on it in their README:

BoringSSL is a fork of OpenSSL that is designed to meet Google's needs.

Although BoringSSL is an open source project, it is not intended for general use, as OpenSSL is. We don't recommend that third parties depend upon it. 

So we're stuck between a rock and a hard place here because third parties are depending on it.

All that said, thanks to everyone for the consideration and flexibility.

@psschwei
Copy link
Contributor

If I'm understanding correctly, what prompted this issue is that it wasn't clear if these tests were being maintained. To my understanding they are being maintained, failures/flakes are being fixed, etc. although that maintenance may not have been communicated especially well. So given that, I don't think we need to drop them as long as they're being actively maintained.

@dprotaso
Copy link
Member Author

dprotaso commented Feb 12, 2024

To my understanding they are being maintained, failures/flakes are being fixed, etc.

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@valen-mascarenhas14
Copy link
Contributor

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@dprotaso I can see all the tests are running & passing for ppc64le eventing tests (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#ppc64le-e2e-tests&width=20)

@dilipgb
Copy link
Contributor

dilipgb commented Feb 13, 2024

To my understanding they are being maintained, failures/flakes are being fixed, etc.

Not really. For example, I just checked and s390x eventing tests aren't being actually run - but CI is showing green (https://testgrid.k8s.io/r/knative-own-testgrid/eventing#s390x-e2e-tests&width=20) because something is overriding the exit value of the script.

@dprotaso the eventing jobs failure on s390x, we are actively debugging. Since we are sending 2 flags (--platfom=linux/s390x --insecure_registry) to KO_FLAGS, the platform is not getting recognised (I hope you can recreate the issue and check on your end if needed). If send only --platfom=linux/s390x test runs fine for all branches. We have similar set up in release-1.11, where we send both KO_FLAGS and its working fine. Hence its taking time to troubleshoot and understand the issue why its happening in later releases than 1.11.

Since we have self-signed certificate on our registry we need the flag to exist. As @psschwei rightly pointed out, it's certainly the communication gap and we will address it in future.

@dilipgb
Copy link
Contributor

dilipgb commented Feb 13, 2024

@dprotaso I have moved the KO_DOCKER_REPO to IBM Cloud from self-hosted artifactory instance. Please approve the PR, this resolves the eventing issues we were facing. #351.

@davidhadas
Copy link
Contributor

@dprotaso,

My understanding of the current status is:

  1. There are teams working on the 2 additional architectures that clearly ask for the tests to continue.
  2. They are committed to support these tests and ensure Knative work on the additional architectures
  3. The costs of the additional hardware needed for testing is covered by IBM
  4. Significant parts of Knative can be used as is by the community on these additional HW architectures, but there are some identified gaps (envoy) that are presently being worked on by the teams.

Did I miss anything?

A reasonable path forward here is to keep testing on the two additional HW architectures and allow the teams to remove such gaps and ensure that the community can use Knative as is on the additional architectures.

We can reevaluate this in six months time to see the progress made.

@davidhadas davidhadas changed the title Are s390x/pcc jobs still valuable? Are s390x/ppc jobs still valuable? Feb 21, 2024
@dilipgb
Copy link
Contributor

dilipgb commented Mar 25, 2024

Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2024
@davidhadas
Copy link
Contributor

Do we still need to monitor this?
Or can we close it?

@davidhadas
Copy link
Contributor

/remove-lifecycle stale

@knative-prow knative-prow bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 24, 2024
@dprotaso
Copy link
Member Author

Do we still need to monitor this?

Yes we should monitor this. I haven't really seen IBM follow through on their commitments from that TOC meeting back in Feb.

Or can we close it?

I figured it's worth providing you the full 6 months as was discussed. There's still a month left or so.

@upodroid
Copy link
Member

+1 for revisiting in Aug as discussed

@dilipgb
Copy link
Contributor

dilipgb commented Jun 26, 2024

Hi all, there are efforts going on for enabling envoy-openssl for IBM Z and IBM P platform. You can see most of the work needed for boringssl and openssl compact library is completed. There are few more packages needs to be worked upon and post that we will have envoy-openssl available on our platform as well. Here are some PR on the same.
[1] envoyproxy/envoy-openssl#166.
[2] envoyproxy/envoy-openssl#219
[3] envoyproxy/envoy#34483

@rishikakedia
Copy link

We also have efforts on publishing Envoy (based on OpenSSL) IBM P/Z images alongside of x86 images. envoyproxy/envoy-openssl#221

@dilipgb
Copy link
Contributor

dilipgb commented Jul 8, 2024

installation document for IBM s390x/ IBM Power
knative/docs#6043

@dprotaso
Copy link
Member Author

dprotaso commented Jul 25, 2024

Following up here - I'm inclined to remove the jobs.

Main reason is the jobs were faililng for over two weeks and it was only noticed a day ago. Clearly it's not a priority and I don't think Knative OSS project should be running these tests on behalf of a vendor.

@dilipgb
Copy link
Contributor

dilipgb commented Jul 25, 2024

@dprotaso there were intermediate issues with infra in IBM cloud because of noise created because by infra it was noticed late when infra issues were fixed but still CI kept failing. We are actively monitoring the jobs and continue to focus on keeping CI healthy.

At the moment there are users running knative on kubernetes on s390x and if we remove the support it impacts those users too. One of the ask from these users was for documentations and we have also update the docs as well. We are actively working to get envoy-openssl support for P/Z and we are almost at the point where we just have to publish images.

@dprotaso
Copy link
Member Author

At the moment there are users running knative on kubernetes on s390x and if we remove the support it impacts those users too.

We're not removing support - we're removing the CI jobs

@dilipgb
Copy link
Contributor

dilipgb commented Jul 25, 2024

We're not removing support - we're removing the CI jobs

@dprotaso I'm not really getting how the binaries/images are delivered in future if CI is removed. Are we going to deliver binaries without CI? Can you propose for a call,so we can have better discussion to understand each other pov. what do you think?

@dilipgb
Copy link
Contributor

dilipgb commented Jul 25, 2024

#484

@davidhadas
Copy link
Contributor

davidhadas commented Jul 25, 2024

We're not removing support - we're removing the CI jobs

Please dont remove CI jobs. We have discusaed this in the past and we can discuss it again. Please however do not remove CI jobs without involving the relevent teams and getting to an agreement.

@dprotaso
Copy link
Member Author

@dprotaso I'm not really getting how the binaries/images are delivered in future if CI is removed.

Our release jobs produce the images and binaries. The s390x/ppc jobs this issue is about doesn't refer to that.

@dprotaso
Copy link
Member Author

Please dont remove CI jobs. We have discusaed this in the past and we can discuss it again. Please however do not remove CI jobs without involving the relevent teams and getting to an agreement.

We've already discussed this and have set some expectations that were documented in this comment: #344 (comment)

Going over the expectations:

1. The ask for respective maintainers/teams to stabilize the runs

As of this morning the tests are still unstable

2. Regularly Monitor the results and proactively fix or raise issues with Knative bits or infra

Tests were broken for 2+ weeks before the issue was surfaced

3. Knative community asks for the P/Z teams to aim to contribute more in the Productivity WG tasks

Didn't see further engagement in the productivity working unrelated to these jobs

4. In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z. Goal is to have community releases of Knative usable on respective architectures with 3rd party dependecies available as well.

The goal hasn't been met. Patching the knative installation files with a redhat image that requires a RedHat account doesn't really meet the bar here.

Again we are not dropping support for s390x/ppc - we're dropping these random CI jobs

@dilipgb
Copy link
Contributor

dilipgb commented Aug 1, 2024

Please dont remove CI jobs. We have discusaed this in the past and we can discuss it again. Please however do not remove CI jobs without involving the relevent teams and getting to an agreement.

We've already discussed this and have set some expectations that were documented in this comment: #344 (comment)

Going over the expectations:

1. The ask for respective maintainers/teams to stabilize the runs

[DP]: As of this morning the tests are still unstable
[DB]: Tests are failing because IBM Z/ IBM P images for serving images are not released for release-1.15. We are unable to reduce the noise because of that.

2. Regularly Monitor the results and proactively fix or raise issues with Knative bits or infra

[DP]: Tests were broken for 2+ weeks before the issue was surfaced
[DB]: Test broke because of changes in chain guard on July 16. We reported the issue on July 24. Approximately a week we took because there was some flakiness in infra scripts which I mentioned in slack as well.

3. Knative community asks for the P/Z teams to aim to contribute more in the Productivity WG tasks

[DP]: Didn't see further engagement in the productivity working unrelated to these jobs
[DB]: We are expanding functionality of knative on s390x/ppc64le by enabling required packages from different communities like buildpacks which is needed for knative functions buildpacks/lifecycle#1142. Trust manager is enabled for knative eventing tests cert-manager/trust-manager#315. At the moment we are trying work with paketo-buildpacks trying to get the support for multiarch. Also we are doing some analysis to enable keda officially. These all the toolings are necessary for running knative on s390x.

4. In a 6 months period revisit the topic to check progress on Envoy changes required for Istio on P and Z. Goal is to have community releases of Knative usable on respective architectures with 3rd party dependecies available as well.

[DP]: The goal hasn't been met. Patching the knative installation files with a redhat image that requires a RedHat account doesn't really meet the bar here.
[DB]: This is a stop gap solution, we are waiting for envoy-openssl release here envoyproxy/envoy-openssl#221. Once that is done we can support envoy without these images.

Again we are not dropping support for s390x/ppc - we're dropping these random CI jobs

@upodroid
Copy link
Member

upodroid commented Aug 2, 2024

I concur with @dprotaso's assessment that s390x/ppc jobs should be removed.

  1. The jobs are too flaky and aren't passing consistently. For a while, there were broken fully for 2 months which wasn't resolved till Dave mentioned that we are planning on removing the jobs
  2. The ecosystem changes we asked for haven't been shipped yet.
  3. I don't see any contributions to non s390x/ppc Knative Productivity Issues.
  4. The project's CI cost has exceeded its budget and I'm adjusting job frequencies. Look at reduce job frequencies for release branches and serving e2e's #494 and Remove geo replication for GCR hack#389

https://testgrid.k8s.io/r/knative-own-testgrid/client#ppc64le-e2e-tests broken for 2 months till recently
https://testgrid.k8s.io/r/knative-own-testgrid/serving#ppc64le-kourier-tests flakes too frequently
https://testgrid.k8s.io/r/knative-own-testgrid/serving#s390x-kourier-tests

@valen-mascarenhas14
Copy link
Contributor

  1. The jobs are too flaky and aren't passing consistently. For a while, there were broken fully for 2 months which wasn't resolved till Dave mentioned that we are planning on removing the jobs

@upodroid As previosly mentioned by Dilip in the above comments, the failures were primarily due to the changes in Chain guard and infra-related problems. That was also a reason why it took us sometime to figure out the cause behind these failures. These issues were communicated and discussed in the Slack group. Although the jobs experienced intermittent flakiness, we diligently debugged & resolved these issues as they arose. It's important to clarify that we were actively working on these issues and did not wait till 2 months until the potential removal of the jobs was suggested to address them. Our efforts ensured that any disruptions were minimized, and fixes were implemented in a timely manner.

@dilipgb
Copy link
Contributor

dilipgb commented Aug 6, 2024

#495

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.