-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infrastructure for Orka (2024 and beyond) #3686
Comments
Current Orka stateupdated on April 19, 2024
|
Next Orka stateupdated on April 22, 2024 Intel Nodes
ARM Nodes We assume that ARM Nodes can handle only 2 VMs and not +4 as Intel in the past due license limitations. This needs to be confirmed with support AFAIK?
How Nearform machines are "relocated"?
|
I don't think it's necessary to have two identical release machines. |
Are these typos? |
Great feedback @targos! I updated the tables
We have space for redundancy, but let's remove them for now.
I made a better reference for the "relocated" machines |
Actually, I think we should have one x64 and two arm64 machines, because there are two jobs that run on macos-arm64 during a release (osx11-release-pkg and osx11-arm64-release-tar). |
Some questions/thoughts/suggestions:
And the table lists And that table may be outdated as it seems as though MacOS 11 was EOL as of November 2023 ?
https://orkadocs.macstadium.com/docs/apple-arm-based-support confirms this:
#3592 My suggestion to avoid Jenkins worker decay is to lean into an ephemeral node strategy so that each build has a fresh Orka instance to run on. We can do that with the following Jenkins plugin for Orka: We would first need to set up a packer build process to create our VM images so that Orka would have a baseline image to create: The packer process can leverage our existing ansible playbooks: This strategy would require that we have an Orka3.0 cluster. Rather than trying to do an upgrade of the existing cluster, I propose that we ask macstadium to allow us to provision a new cluster with the resources we need in it (enough arm/intel backing nodes for our macos11/13 testing and release), get it built/provisioned and working, and then decommission/return all the existing macstadium/orka machines. I believe this would end up with us using roughly the same amount of resources, so should be palatable for macstadium to support this transition. |
+1 from me if Macstadium will support that |
This comment has been minimized.
This comment has been minimized.
We are in the process of updating macOS to version 13 in the Jenkins CI, but unfortunately this is taking longer than expected. Add it to the GitHub actions test matrix so that we have some coverage. Refs: nodejs/build#3686
We are in the process of updating macOS to version 13 in the Jenkins CI, but unfortunately this is taking longer than expected. Add it to the GitHub actions test matrix so that we have some coverage. Refs: nodejs/build#3686 PR-URL: #56307 Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Joyee Cheung <[email protected]> Reviewed-By: Luigi Pinca <[email protected]>
Some interesting news, coming from nodejs/node-v8#295 and a Slack chat with @joyeecheung:
That said, I suggest:
|
Note that officially (according to https://developer.apple.com/download/applications/), Xcode 16.1 requires at least macOS 14.5 to run, and according to Wikipedia, Xcode 16.0 did too. So I don't know how the |
I left my machine that has macOS 13 + Apple Clang 14 now so can't provide more details until after the holidays but FWIW: when I tried to install the latest system update for 13, the only available update was upgrading to Sequoia, and nothing else showed up when I tried to look for last compatible update of XCode or command line tools with App Store or Software Update/softwareupdate --list. If somehow it is possible to run macOS 13 with XCode 16 we should likely need to document how to install it, or contributors on macOS 13 may have a hard time getting it to build (or if it just doesn't work then we need to tell contributors to upgrade to Sequoia). |
This is how we manually install Xcode on the build machines: https://github.com/nodejs/build/blob/main/ansible/MANUAL_STEPS.md#full-xcode |
Also my 2cents: V8 uses (almost) tip of tree clang, so that's currently clang 20, and they have been doing a lot of C++ modernization that lower versions of clang aren't very good at parsing. I did quite a few patching to make V8 build on macOS 13 and Clang 14 in https://github.com/joyeecheung/node/tree/fix-macos-13 and many of the fixes don't look very acceptable in the upstream because they basically just revert the modernization. If we are upgrading the build system the least friction route would probably be to just require Sequoia and XCode 16 to build, though we can keep targeting 11. The lower macOS version we need to support, the harder it is to install higher versions of Apple Clang on it, and the C++ feature gap will keep widening as V8 uses ToT Clang. |
For the new Orka machines, we are using Packer, and the instructions include some manual steps on how to install it that are replicable for local machines as well: We probably want to update the commands and ensure that we are using the correct version 👍 |
We are in the process of updating macOS to version 13 in the Jenkins CI, but unfortunately this is taking longer than expected. Add it to the GitHub actions test matrix so that we have some coverage. Refs: nodejs/build#3686 PR-URL: #56307 Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Joyee Cheung <[email protected]> Reviewed-By: Luigi Pinca <[email protected]>
Is there any update/progress on this issue? |
Let me ping @ryanaslett! AFAIK we were testing the new ephemeral instances and waiting for a HW upgrade in the new cluster so we can decommission the old VMs and move all the workloads for both CI environments, but not sure if this was completed or not. |
This issue is currently the only blocker for adding URLPattern to Node.js - nodejs/node#56452 |
Linking this to openjs-foundation/infrastructure#17. |
Update: The images that back the test instances on our cluster are in a state where I believe we can unblock our blocked PR's.
The instances have osx13.0, with XCode 16 installed on them (both on the Arm and Intel images). XCode16 was installed via the command line (despite apple's compatibility matrix) and it appears to be working fine for our purposes.
I agree that's likely not ideal from a contributor perspective -> to trick a Ventura machine into running Xcode 16, which works fine for compiling, but likely doesnt run at all in the GUI.
That was an oversight where xcode16 didn't fully get installed on the Arm image before it was deployed. I have an updated release image that has xcode16 prepared, but am hesitant to deploy change that since today I also updated those release images to renew the expired signing certificates, and I wouldn't want to impact the current release schedule any further than it already has been . I can deploy that when the releases have actually happened. The testing images can be enabled as soon as I get enough consensus that its acceptable to run tests on macos13 with xcode 16. If we should decide those ought to be sequoia and xcode16, I can make images for those, but that'll likely require more time because we'll likely want all four images (test/release X arm/intel) to be updated. |
I noticed several attempts at re-running some PR's on osx.. I hadnt yet turned on the osx13 labels on the Jenkins jobs. I went ahead and did that so we can see the results. |
I see |
Agreed. I've configured |
Separate related question to the OSX instances: Should these jobs be using ccache? We have an available shared storage that we can use so that when an ephemeral instance is launched, it can have an existing warmed ccache, but it doesnt look like the osx11/osx10.15 machines were using ccache either, so Im unsure if that's not applicable to builds on osx or not. |
They're supposed to be. e.g. from today's https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-v18.x-staging/453/ 10:01:27 + export CCACHE_BASEDIR=/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015
10:01:27 + CCACHE_BASEDIR=/Users/iojs/build/workspace/node-test-commit-osx/nodes/osx1015
10:01:27 + export 'CC=/usr/local/bin/ccache cc'
10:01:27 + CC='/usr/local/bin/ccache cc'
10:01:27 + export 'CXX=/usr/local/bin/ccache c++'
10:01:27 + CXX='/usr/local/bin/ccache c++'
10:01:27 ++ getconf _NPROCESSORS_ONLN
10:01:27 + export JOBS=4
10:01:28 + JOBS=4
10:01:32 + NODE_TEST_DIR=/Users/iojs/node-tmp
10:01:32 + FLAKY_TESTS=dontcare
10:01:32 + make run-ci -j 4
10:01:32 python3 ./configure --verbose
10:01:32 Node.js configure: Found Python 3.7.7...
10:01:32 Detected clang C++ compiler (CXX=/usr/local/bin/ccache c++) version: 11.0.3
10:01:32 Detected clang C compiler (CC=/usr/local/bin/ccache cc) version: 11.0.3
... Admittedly the build time would suggest that either build times on macOS are much worse than other platforms or the ccache set up on macOS is broken. |
Ah, yes. I see that. They were not using that on the tests I ran, so not sure what's different between the 'real jobs' and my smoke test setup. I'm looking into why its not respecting the ccache env vars. I think I may need to use a .ccache config instead of env vars. Also, we're only getting half the instances for our arm builds as they were set to 6 cpu's and those nodes only had 10 available. I've reset those to 5 but the queue will have to clear out before it shuts down the 6 cpu vm's and re-creates 2 5cpu vms. |
I think we should target moving to sequoia, and since we have depended on the flag to set the lowest target OS to run on, versus building on the lowest target OS for macOS successfully over the year I don't think we expect any surprises when we make the move. In the interests of unblocking the CIs, I'm also ok running on osx13 as you have it set up until new images can be built and deployed. |
The builds generally look good, but yeah they definitely need ccache. At least for x64, which takes around 3 hours (vs 20 minutes for arm64!) |
I had set up the x64 machines to use the ccache earlier today, and when I checked the shared cache directory, it had created all of the hash prefix subdirs, so I assumed it was working. But I just dug deeper with the ccache debug logs and turns out it created those dirs without group perms, but the processes can only write as group members. I just did a chmod g+w on the whole cache dir and now all the files are starting to cache. Hopefully this speeds things up a bit. There does still seem to be something awry with the x64 machines connecting to the shared drive, as doing simple things like an ls take 22 seconds (wheras on the arm machines it acts like a normal HD) |
We are in the process of updating macOS to version 13 in the Jenkins CI, but unfortunately this is taking longer than expected. Add it to the GitHub actions test matrix so that we have some coverage. Refs: nodejs/build#3686 PR-URL: #56307 Reviewed-By: Yagiz Nizipli <[email protected]> Reviewed-By: Richard Lau <[email protected]> Reviewed-By: Chengzhong Wu <[email protected]> Reviewed-By: Joyee Cheung <[email protected]> Reviewed-By: Luigi Pinca <[email protected]>
I plan to work on it during the weekend, so I can provide a good overview on the next build meeting on Tuesday.
Current tasks on MacOS infra
Blocked until ARM nodes are provided
The text was updated successfully, but these errors were encountered: