Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrastructure for Orka (2024 and beyond) #3686

Open
11 of 12 tasks
UlisesGascon opened this issue Apr 19, 2024 · 42 comments
Open
11 of 12 tasks

Infrastructure for Orka (2024 and beyond) #3686

UlisesGascon opened this issue Apr 19, 2024 · 42 comments

Comments

@UlisesGascon
Copy link
Member

UlisesGascon commented Apr 19, 2024

I plan to work on it during the weekend, so I can provide a good overview on the next build meeting on Tuesday.

Current tasks on MacOS infra

Blocked until ARM nodes are provided

  • Confirm org decision regarding new ARM nodes (discussion ongoing in the mailing list)
  • Add new VMs for MacOS 13 ARM
  • Add new VMs for MacOS 11 ARM
@UlisesGascon
Copy link
Member Author

Current Orka state

updated on April 19, 2024

SSH port Node: macpro-4 Node: macpro-5 Node: macpro-6
8822 release-macos11-x64-1 empty test-macos11-x64-1
8823 empty empty test-macos11-x64-2
8824 empty test-macos1015-x64-2 test-macos1015-x64-1
8825 empty empty empty

@UlisesGascon
Copy link
Member Author

UlisesGascon commented Apr 19, 2024

Next Orka state

updated on April 22, 2024

Intel Nodes

SSH port Node: macpro-4 Node: macpro-5 Node: macpro-6
8822 release-macos11-x64-1 test-macos13-x64-2 test-macos11-x64-1
8823 test-macos13-x64-1 release-macos13-x64-1 test-macos11-x64-2
8824 empty test-macos1015-x64-2 test-macos1015-x64-1
8825 empty empty empty

ARM Nodes

We assume that ARM Nodes can handle only 2 VMs and not +4 as Intel in the past due license limitations. This needs to be confirmed with support AFAIK?

SSH port Node: arm-1 Node: arm-2 Node: arm-3
8822 test-macos11-arm64-1 release-macos13-arm64-1 empty
8823 release-macos11-arm64-1 test-macos13-arm64-1 test-macos13-arm64-2

How Nearform machines are "relocated"?

  • release-nearform-macos11.0-arm64-1 -> release-orka-macos11-arm64-1
  • test-nearform-macos11.0-arm64-1 -> test-orka-macos11-arm64-1

@targos
Copy link
Member

targos commented Apr 22, 2024

release-macos13-x64-2
release-macos13-arm64-2

I don't think it's necessary to have two identical release machines.

@targos
Copy link
Member

targos commented Apr 22, 2024

test-nearform-macos11.0-arm64-1

Are these typos?

@UlisesGascon
Copy link
Member Author

UlisesGascon commented Apr 22, 2024

Great feedback @targos! I updated the tables

I don't think it's necessary to have two identical release machines.

We have space for redundancy, but let's remove them for now.

Are these typos?

I made a better reference for the "relocated" machines

@targos targos pinned this issue May 2, 2024
@targos
Copy link
Member

targos commented May 2, 2024

release-macos13-x64-2
release-macos13-arm64-2

I don't think it's necessary to have two identical release machines.

Actually, I think we should have one x64 and two arm64 machines, because there are two jobs that run on macos-arm64 during a release (osx11-release-pkg and osx11-arm64-release-tar).

@ryanaslett
Copy link
Contributor

Some questions/thoughts/suggestions:

  1. Requirements Question: Do we still need to support 10.15 and/or 11? from (https://github.com/nodejs/node/blob/main/BUILDING.md#supported-platforms) I see:

Node.js does not support a platform version if a vendor has expired support for it. In other words, Node.js does not support running on End-of-Life (EoL) platforms. This is true regardless of entries in the table below.

And the table lists MacOS 11>.

And that table may be outdated as it seems as though MacOS 11 was EOL as of November 2023 ?

  1. ARM support in Orka:

We assume that ARM Nodes can handle only 2 VMs and not +4 as Intel in the past due license limitations. This needs to be confirmed with support AFAIK?

https://orkadocs.macstadium.com/docs/apple-arm-based-support confirms this:

IMPORTANT

You can deploy up to 2 VMs per Apple silicon-based node.

  1. From what I can gather macOS infra seems to be brittle, with nodes often running into disk issues/maintenance issues.

#3592
#3685
(https://github.com/nodejs/build/issues?q=is%3Aissue+macos+is%3Aclosed+disk) etc.

My suggestion to avoid Jenkins worker decay is to lean into an ephemeral node strategy so that each build has a fresh Orka instance to run on.

We can do that with the following Jenkins plugin for Orka:
https://plugins.jenkins.io/macstadium-orka/#plugin-content-ephemeral-agents

We would first need to set up a packer build process to create our VM images so that Orka would have a baseline image to create:
https://orkadocs.macstadium.com/docs/packer

The packer process can leverage our existing ansible playbooks:
https://developer.hashicorp.com/packer/integrations/hashicorp/ansible/latest/components/provisioner/ansible.

This strategy would require that we have an Orka3.0 cluster. Rather than trying to do an upgrade of the existing cluster, I propose that we ask macstadium to allow us to provision a new cluster with the resources we need in it (enough arm/intel backing nodes for our macos11/13 testing and release), get it built/provisioned and working, and then decommission/return all the existing macstadium/orka machines.

I believe this would end up with us using roughly the same amount of resources, so should be palatable for macstadium to support this transition.

@mhdawson
Copy link
Member

This strategy would require that we have an Orka3.0 cluster. Rather than trying to do an upgrade of the existing cluster, I propose that we ask macstadium to allow us to provision a new cluster with the resources we need in it (enough arm/intel backing nodes for our macos11/13 testing and release), get it built/provisioned and working, and then decommission/return all the existing macstadium/orka machines.

+1 from me if Macstadium will support that

@UlisesGascon

This comment has been minimized.

targos added a commit to targos/node that referenced this issue Dec 18, 2024
We are in the process of updating macOS to version 13 in the
Jenkins CI, but unfortunately this is taking longer than expected.
Add it to the GitHub actions test matrix so that we have some coverage.

Refs: nodejs/build#3686
nodejs-github-bot pushed a commit to nodejs/node that referenced this issue Dec 20, 2024
We are in the process of updating macOS to version 13 in the
Jenkins CI, but unfortunately this is taking longer than expected.
Add it to the GitHub actions test matrix so that we have some coverage.

Refs: nodejs/build#3686
PR-URL: #56307
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Chengzhong Wu <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
@targos
Copy link
Member

targos commented Dec 21, 2024

Some interesting news, coming from nodejs/node-v8#295 and a Slack chat with @joyeecheung:

That said, I suggest:

@targos
Copy link
Member

targos commented Dec 21, 2024

Note that officially (according to https://developer.apple.com/download/applications/), Xcode 16.1 requires at least macOS 14.5 to run, and according to Wikipedia, Xcode 16.0 did too. So I don't know how the osx13-x64-release-tar job is able to run, but it may be risky not to upgrade macOS to a supported version.

@joyeecheung
Copy link
Member

joyeecheung commented Dec 21, 2024

I left my machine that has macOS 13 + Apple Clang 14 now so can't provide more details until after the holidays but FWIW: when I tried to install the latest system update for 13, the only available update was upgrading to Sequoia, and nothing else showed up when I tried to look for last compatible update of XCode or command line tools with App Store or Software Update/softwareupdate --list. If somehow it is possible to run macOS 13 with XCode 16 we should likely need to document how to install it, or contributors on macOS 13 may have a hard time getting it to build (or if it just doesn't work then we need to tell contributors to upgrade to Sequoia).

@targos
Copy link
Member

targos commented Dec 21, 2024

This is how we manually install Xcode on the build machines: https://github.com/nodejs/build/blob/main/ansible/MANUAL_STEPS.md#full-xcode

@joyeecheung
Copy link
Member

joyeecheung commented Dec 21, 2024

Also my 2cents: V8 uses (almost) tip of tree clang, so that's currently clang 20, and they have been doing a lot of C++ modernization that lower versions of clang aren't very good at parsing. I did quite a few patching to make V8 build on macOS 13 and Clang 14 in https://github.com/joyeecheung/node/tree/fix-macos-13 and many of the fixes don't look very acceptable in the upstream because they basically just revert the modernization. If we are upgrading the build system the least friction route would probably be to just require Sequoia and XCode 16 to build, though we can keep targeting 11. The lower macOS version we need to support, the harder it is to install higher versions of Apple Clang on it, and the C++ feature gap will keep widening as V8 uses ToT Clang.

@UlisesGascon
Copy link
Member Author

If somehow it is possible to run macOS 13 with XCode 16 we should likely need to document how to install it, or contributors on macOS 13 may have a hard time getting it to build (or if it just doesn't work then we need to tell contributors to upgrade to Sequoia)

For the new Orka machines, we are using Packer, and the instructions include some manual steps on how to install it that are replicable for local machines as well:
https://github.com/nodejs/build/tree/main/orka/templates#manual-steps-for-the-release-images.

We probably want to update the commands and ensure that we are using the correct version 👍

aduh95 pushed a commit to nodejs/node that referenced this issue Jan 2, 2025
We are in the process of updating macOS to version 13 in the
Jenkins CI, but unfortunately this is taking longer than expected.
Add it to the GitHub actions test matrix so that we have some coverage.

Refs: nodejs/build#3686
PR-URL: #56307
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Chengzhong Wu <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
@anonrig
Copy link
Member

anonrig commented Jan 12, 2025

Is there any update/progress on this issue?

@UlisesGascon
Copy link
Member Author

Let me ping @ryanaslett! AFAIK we were testing the new ephemeral instances and waiting for a HW upgrade in the new cluster so we can decommission the old VMs and move all the workloads for both CI environments, but not sure if this was completed or not.

@anonrig
Copy link
Member

anonrig commented Jan 25, 2025

This issue is currently the only blocker for adding URLPattern to Node.js - nodejs/node#56452

@richardlau
Copy link
Member

Linking this to openjs-foundation/infrastructure#17.

@ryanaslett
Copy link
Contributor

Update:

The images that back the test instances on our cluster are in a state where I believe we can unblock our blocked PR's.

If somehow it is possible to run macOS 13 with XCode 16 we should likely need to document how to install it, or contributors on macOS 13 may have a hard time getting it to build (or if it just doesn't work then we need to tell contributors to upgrade to Sequoia).

The instances have osx13.0, with XCode 16 installed on them (both on the Arm and Intel images). XCode16 was installed via the command line (despite apple's compatibility matrix) and it appears to be working fine for our purposes.

I agree that's likely not ideal from a contributor perspective -> to trick a Ventura machine into running Xcode 16, which works fine for compiling, but likely doesnt run at all in the GUI.

In the release CI, we have two different Xcode versions

That was an oversight where xcode16 didn't fully get installed on the Arm image before it was deployed. I have an updated release image that has xcode16 prepared, but am hesitant to deploy change that since today I also updated those release images to renew the expired signing certificates, and I wouldn't want to impact the current release schedule any further than it already has been . I can deploy that when the releases have actually happened.

The testing images can be enabled as soon as I get enough consensus that its acceptable to run tests on macos13 with xcode 16.

If we should decide those ought to be sequoia and xcode16, I can make images for those, but that'll likely require more time because we'll likely want all four images (test/release X arm/intel) to be updated.

@ryanaslett
Copy link
Contributor

I noticed several attempts at re-running some PR's on osx.. I hadnt yet turned on the osx13 labels on the Jenkins jobs. I went ahead and did that so we can see the results.

@targos
Copy link
Member

targos commented Jan 30, 2025

I see osx13-arm64 and osx13-x64 labels in both node-test-commit-osx and node-test-commit-osx-arm jobs.
This seems redundant. Should we remove node-test-commit-osx-arm from node-test-pull-request?

@ryanaslett
Copy link
Contributor

I see osx13-arm64 and osx13-x64 labels in both node-test-commit-osx and node-test-commit-osx-arm jobs. This seems redundant. Should we remove node-test-commit-osx-arm from node-test-pull-request?

Agreed. I've configured node-test-commit-osx to run both osx13-arm64 and osx13-x64 labels, and have disabled the node-test-commit-osx-arm on node-test-pull-request. We can remove it entirely once all the dust settles.

@ryanaslett
Copy link
Contributor

Separate related question to the OSX instances: Should these jobs be using ccache? We have an available shared storage that we can use so that when an ephemeral instance is launched, it can have an existing warmed ccache, but it doesnt look like the osx11/osx10.15 machines were using ccache either, so Im unsure if that's not applicable to builds on osx or not.

@richardlau
Copy link
Member

They're supposed to be.

e.g. from today's https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-v18.x-staging/453/
https://ci.nodejs.org/job/node-test-commit-osx/63373/nodes=osx1015/consoleFull

10:01:27 + export CCACHE_BASEDIR=/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015
10:01:27 + CCACHE_BASEDIR=/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015
10:01:27 + export 'CC=/usr/local/bin/ccache cc'
10:01:27 + CC='/usr/local/bin/ccache cc'
10:01:27 + export 'CXX=/usr/local/bin/ccache c++'
10:01:27 + CXX='/usr/local/bin/ccache c++'
10:01:27 ++ getconf _NPROCESSORS_ONLN
10:01:27 + export JOBS=4
10:01:28 + JOBS=4
10:01:32 + NODE_TEST_DIR=/Users/iojs/node-tmp
10:01:32 + FLAKY_TESTS=dontcare
10:01:32 + make run-ci -j 4
10:01:32 python3 ./configure --verbose 
10:01:32 Node.js configure: Found Python 3.7.7...
10:01:32 Detected clang C++ compiler (CXX=/usr/local/bin/ccache c++) version: 11.0.3
10:01:32 Detected clang C compiler (CC=/usr/local/bin/ccache cc) version: 11.0.3
...

Admittedly the build time would suggest that either build times on macOS are much worse than other platforms or the ccache set up on macOS is broken.

@ryanaslett
Copy link
Contributor

They're supposed to be.

Ah, yes. I see that. They were not using that on the tests I ran, so not sure what's different between the 'real jobs' and my smoke test setup.

I'm looking into why its not respecting the ccache env vars. I think I may need to use a .ccache config instead of env vars.

Also, we're only getting half the instances for our arm builds as they were set to 6 cpu's and those nodes only had 10 available. I've reset those to 5 but the queue will have to clear out before it shuts down the 6 cpu vm's and re-creates 2 5cpu vms.

@mhdawson
Copy link
Member

mhdawson commented Jan 30, 2025

If we should decide those ought to be sequoia and xcode16, I can make images for those, but that'll likely require more time because we'll likely want all four images (test/release X arm/intel) to be updated.

I think we should target moving to sequoia, and since we have depended on the flag to set the lowest target OS to run on, versus building on the lowest target OS for macOS successfully over the year I don't think we expect any surprises when we make the move.

In the interests of unblocking the CIs, I'm also ok running on osx13 as you have it set up until new images can be built and deployed.

@targos
Copy link
Member

targos commented Jan 31, 2025

The builds generally look good, but yeah they definitely need ccache. At least for x64, which takes around 3 hours (vs 20 minutes for arm64!)

@ryanaslett
Copy link
Contributor

I had set up the x64 machines to use the ccache earlier today, and when I checked the shared cache directory, it had created all of the hash prefix subdirs, so I assumed it was working.

But I just dug deeper with the ccache debug logs and turns out it created those dirs without group perms, but the processes can only write as group members.

I just did a chmod g+w on the whole cache dir and now all the files are starting to cache. Hopefully this speeds things up a bit.

There does still seem to be something awry with the x64 machines connecting to the shared drive, as doing simple things like an ls take 22 seconds (wheras on the arm machines it acts like a normal HD)

aduh95 pushed a commit to nodejs/node that referenced this issue Jan 31, 2025
We are in the process of updating macOS to version 13 in the
Jenkins CI, but unfortunately this is taking longer than expected.
Add it to the GitHub actions test matrix so that we have some coverage.

Refs: nodejs/build#3686
PR-URL: #56307
Reviewed-By: Yagiz Nizipli <[email protected]>
Reviewed-By: Richard Lau <[email protected]>
Reviewed-By: Chengzhong Wu <[email protected]>
Reviewed-By: Joyee Cheung <[email protected]>
Reviewed-By: Luigi Pinca <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants