-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more aix machines #1623
Comments
Mine always seem to be blocked on Arm and Windows. Did something change recently? Not that getting more AIX machines wouldn't be great if that's the new bottleneck. Anything that speeds up the CI is fantastic, its pretty painful ATM. |
I think AIX becomes a bottleneck when someone runs CITGM because that ties up one of (I think) only two hosts for hours. Tying up one Windows host for hours for a CITGM run is no big deal because we have so many Windows hosts. And we don't run CITGM on Raspberry Pi at all, so that's not an issue either. So yeah, under ordinary conditions, Windows and Raspberry Pi tend to be the bottlenecks, but not too bad. But, I think, once a CITGM job or two get kicked off, AIX ends up taking literally hours. |
@BridgeAR I am currently in the process of donating a much more powerful AIX machine to the community which will allow us to have many more machines and help with the CI backlog issues. |
@gdams Is this worth putting on the Build WG agenda just so that in case this stalls out, it's there if obstacles or problems need to be discussed next week? |
I don't know what's changed in the last month or whatever, but AIX is definitely the big bottleneck, much more so than Windows and Raspberry Pi. In the past, IIRC, this would only be during CITGM runs. But it's a constant now. Part of it might be that AIX is so susceptible to whatever is causing the rash of failing tests lately that it needs to be re-run a lot, causing it to fall further behind in the queue than other platforms that don't need to be re-run during a Resume Build. Nothing unusual is going on in terms of building right now--it's a typical quiet-ish Friday--but the CI queue is totally backed up and it's entirely due to waiting for AIX hosts to be available for work. |
I'll also add that the tests used to run really fast on the AIX hosts. The build/compile took a long time, but once the tests were going, it was impressive. Not so much in CI anymore. Now everything is slow on AIX. I don't know if we swapped in hosts with less memory/CPU or something, but it sure seems like something significant changed. |
All that complaining...er, I mean providing information....above aside, I do believe one or two additional hosts would resolve the issue entirely. |
@gdams in parallel with getting the new machine to OSU can you also talk to David to see if anything with respect to the configuration changed? We might also want to double check that the ramdisk is still in place and working. |
While I have access to one of the CI, I see that it has 16 CPUs, so wondering why we are running gmake in single thread? |
@gireeshpunathil where are you seeing it run as a single? |
say for example if you take the current run - |
it's supplied to the parent make call and it's supposed to coordinate across all child invocations as well to make sure the total isn't more than I was watching a job live and noted that |
ok, without knowing the full details on the CI script, I can state that slowness on I did some studies on CI runtimes of different platforms and my inference is that AIX runs are system state does not show anything significant:
translating to: looks good to me, but the same -j5 , if percolated to |
https://ci.nodejs.org/job/node-test-commit-aix/19890/nodes=aix61-ppc64/ made the -j explicitly flow down into the child process and it hasn't made a difference (in fact it's running longer than some recent builts). Without measuring precisely, the test executions look like they are about as slow as on Raspberry Pi 2 or 3's running SD cards or via NFS. The compiles take ~20 minutes and I confirmed that ccache is engaging, so that's got to be slow disk, surely. Tests take the remainder of the time, 40+ minutes. |
@rvagg - I see your point, and agree - parallelizing compilation did not seem to have much effect. |
@rvagg - I ran this below program in three boxes, at the vortex of the node source tree, and confirmed your disk latency theory conclusively: var fs = require('fs')
function run(entry) {
const ret = fs.readdirSync(entry, {withFileTypes: true})
ret.forEach((item) => {
if(item.name[0] !== '.') {
try {
const path = fs.realpathSync(entry) + '/' + item.name
if(item.isDirectory())
run(path)
else fs.writeFileSync(path + '.foo', 'deadbeef')
} catch(e) {console.log(e)}
}
})
}
run('.') local Linux
top 5 consumers from strace
local AIX:
top 5 consumers from truss:
AIX CI:
top 5 consumers from truss:
So evidently disk is slow. |
We have known that disk I/O was sub-optimal, but I wonder might have changed recently. @gdams is working on adding a 3rd machine and then getting a whole new box which has a lot more resources to OSU. This is the current list of things @gdams is working through. If there are other things you think he should check please suggest them:
|
install |
@gireeshpunathil thanks for the suggestion of adding gdb, added to the list above. |
test-osuosl-aix61-ppc64_be-3 added to the farm |
I think this can be closed now. |
@BridgeAR @Trott @mhdawson @nodejs/testing @nodejs/build We are trying to get resources to add more modern AIX to the CI, but are hitting blockers because we have so many AIX machines already. The release machines are mostly unused, but are wanted for privilege seperation, so I'll not mention them again. https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ however, which was added here, has been disabled since October 11th, and no one has complained, so here I am asking: has anyone noticed problems with AIX build queues? If I could recycle -3 as a 7.2 host it would give us the two 7.2 test machines. We'd only lack a release. |
I haven't noticed AIX being a bottleneck anymore. |
It became the bottleneck again. I guess it's reproducible by starting many CIs all at the same time. That way the queue is build up and all other machines finish significantly faster than AIX. New CIs therefore have to wait for AIX to finish. Example: https://ci.nodejs.org/job/node-test-commit-aix/26551/ |
I don't think its simply lots of CI, looking at below, builds seem to finish quickly, then there is a recent one that just doesn't finish:
the test -1 machine is almost unresponsive logged in over ssh, there is something wrong with it. |
blocked on disk i/o. but not sure why, we are supposed to be building in a ramdisk. or maybe its ccache, I'll check.
|
I'm not sure, but the fact that we are using ramdisk to build because the fs is so slow, but then we have the .ccache on the regular disk, outside the ramdisk, seems suspicious to me. |
Can the problematic machine be rebooted or temporarily taken out of the rotation? https://ci.nodejs.org/job/node-test-pull-request/ is extremely backed up at the moment, as are other jobs. |
I brought https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ back online, its already got a job: I am hesitant to bring the other 2 offline, because both of them are having trouble. That they should go slow at the same time is one reason I'm staring at the ccache setup, and wondering if the 12.x release builds are different enough to push ccache over the edge. Now that -3 is building, it takes me 3-4 seconds to ssh into it... it used to be instantaneous. I'd like to move the ccache onto the RAMFS, but doing that is slow and fights for I/O with the backed up jobs, and if I don't move it, and just symlink it, I'll invalidate the entire cache, which won't help. I might have to make it worse to see if moving ccache makes it better. |
https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3 has ~/.ccache symlinked to /home/iojs/build/.ccache, its been offline for a month, it doesn't have much cache anyhow. I just zeroed the stats, and everything is a cache miss. On -1, I'm rsyncing /home/iojs/.ccache to /home/iojs/build/.ccache. Once its done, I'll symlink the original to the new ones. I wonder if the 5Gig default cache size is too small, my local cache is 100G/120gig. |
My grasp of jenkins is shakier than I thought. https://ci.nodejs.org/job/node-test-commit-aix/26570/ https://ci.nodejs.org/job/node-test-commit-aix/26569/ Two builds, but not the actual builds on the executors, for that I go down to Configurations, click "default", and both lead to the same build: Does that make sense? |
The links to 20:55:56 Configuration node-test-commit-aix » aix61-ppc64 is still in the queue: Waiting for next available executor on ‘aix61-ppc64’ The faded link links to the the most recent If it's any easier to visualize: |
-3 seems to be building well. I'm not seeing why disk is at 100% on -1: topas
...
Disk Busy% KBPS TPS KB-Read KB-Writ
hdisk0 100.0 6.0 1.0 0.0 6.0
hdisk1 59.0 2.0 0.0 0.0 2.0 I don't understand why builds are hitting disk. Both -1 and -3 now have -3 is building well, which is good (better if I knew why). We'll see if -1 starts building well, if so, might have been the ccache, if not, then more research required. I didn't touch -2, its the control group. |
Well, the backlog is gone, or moved, now its over on arm: https://ci.nodejs.org/job/node-test-commit-arm Its too early, no one is building anything ATM, so I kicked off 3 builds, they spread across the three test machines, I'll watch to see their build times. |
e.g. https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/26672/console 14:19:29 Caused by: java.io.IOException: error=12, There is not enough memory available now. |
This issue is getting unwealdily long, pls open new incident issues if AIX problems arise. |
Is it possible to add a new aix server for the CI? Currently we have a queue of CIs to finish that all only wait on aix.
The text was updated successfully, but these errors were encountered: