Add more aix machines #1623

BridgeAR · 2018-12-05T19:54:35Z

Is it possible to add a new aix server for the CI? Currently we have a queue of CIs to finish that all only wait on aix.

Trott · 2018-12-05T20:14:31Z

Co-sign.

sam-github · 2018-12-05T22:33:45Z

Mine always seem to be blocked on Arm and Windows. Did something change recently? Not that getting more AIX machines wouldn't be great if that's the new bottleneck. Anything that speeds up the CI is fantastic, its pretty painful ATM.

Trott · 2018-12-05T23:17:23Z

Mine always seem to be blocked on Arm and Windows. Did something change recently? Not that getting more AIX machines wouldn't be great if that's the new bottleneck. Anything that speeds up the CI is fantastic, its pretty painful ATM.

I think AIX becomes a bottleneck when someone runs CITGM because that ties up one of (I think) only two hosts for hours. Tying up one Windows host for hours for a CITGM run is no big deal because we have so many Windows hosts. And we don't run CITGM on Raspberry Pi at all, so that's not an issue either.

So yeah, under ordinary conditions, Windows and Raspberry Pi tend to be the bottlenecks, but not too bad. But, I think, once a CITGM job or two get kicked off, AIX ends up taking literally hours.

gdams · 2018-12-06T09:06:40Z

@BridgeAR I am currently in the process of donating a much more powerful AIX machine to the community which will allow us to have many more machines and help with the CI backlog issues.

Trott · 2018-12-12T22:07:29Z

@gdams Is this worth putting on the Build WG agenda just so that in case this stalls out, it's there if obstacles or problems need to be discussed next week?

Trott · 2018-12-14T21:57:31Z

Mine always seem to be blocked on Arm and Windows. Did something change recently? Not that getting more AIX machines wouldn't be great if that's the new bottleneck. Anything that speeds up the CI is fantastic, its pretty painful ATM.

I think AIX becomes a bottleneck when someone runs CITGM because that ties up one of (I think) only two hosts for hours

I don't know what's changed in the last month or whatever, but AIX is definitely the big bottleneck, much more so than Windows and Raspberry Pi. In the past, IIRC, this would only be during CITGM runs. But it's a constant now. Part of it might be that AIX is so susceptible to whatever is causing the rash of failing tests lately that it needs to be re-run a lot, causing it to fall further behind in the queue than other platforms that don't need to be re-run during a Resume Build.

Nothing unusual is going on in terms of building right now--it's a typical quiet-ish Friday--but the CI queue is totally backed up and it's entirely due to waiting for AIX hosts to be available for work.

Trott · 2018-12-14T21:58:41Z

I'll also add that the tests used to run really fast on the AIX hosts. The build/compile took a long time, but once the tests were going, it was impressive. Not so much in CI anymore. Now everything is slow on AIX. I don't know if we swapped in hosts with less memory/CPU or something, but it sure seems like something significant changed.

Trott · 2018-12-14T21:59:54Z

All that complaining...er, I mean providing information....above aside, I do believe one or two additional hosts would resolve the issue entirely.

mhdawson · 2018-12-17T22:30:40Z

@gdams in parallel with getting the new machine to OSU can you also talk to David to see if anything with respect to the configuration changed? We might also want to double check that the ramdisk is still in place and working.

gireeshpunathil · 2018-12-18T12:50:17Z

While I have access to one of the CI, I see that it has 16 CPUs, so wondering why we are running gmake in single thread?

rvagg · 2018-12-19T06:38:39Z

@gireeshpunathil where are you seeing it run as a single?

gireeshpunathil · 2018-12-19T06:43:36Z

say for example if you take the current run -
https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/19870/consoleFull
and grep for 01:30:13 gmake -C out BUILDTYPE=Release V=1 I don't see a -j flag to it?

rvagg · 2018-12-19T07:25:55Z

it's supplied to the parent make call and it's supposed to coordinate across all child invocations as well to make sure the total isn't more than N processes for -j N

I was watching a job live and noted that git clean -fdx took an unexpectedly long time, could there be disk problems with these machines?

gireeshpunathil · 2018-12-19T07:43:58Z

ok, without knowing the full details on the CI script, I can state that gmake -C out BUILDTYPE=Release V=1 is slower than gmake -C out BUILDTYPE=Release V=1 -jN, but I see your point, if there are other tasks that need to run parallel to gmake the N goes for that (although I don't know what those tasks are).

slowness on git clean -fdx was noticed earlier, but then it was on a CIGTM run which installed a large number of modules and was taking lot of time.

I did some studies on CI runtimes of different platforms and my inference is that AIX runs are SLOW (average 50 minutes) , and the slowness is distributed across the run time, no throttling at any point.

system state does not show anything significant:

Topas Monitor for host:    power8-nodejs2       EVENTS/QUEUES    FILE/TTY                                
Wed Dec 19 01:40:08 2018   Interval:  2         Cswitch     906  Readch   277.5K                         
                                                Syscall   21286  Writech   19627                         
CPU  User%  Kern%  Wait%  Idle%  Physc   Entc   Reads       218  Rawin         0                         
ALL   13.9   44.1    0.0   42.0   1.32  132.4   Writes       58  Ttyout     1067                         
                                                Forks         3  Igets         4                         
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Execs         4  Namei      1360                         
Total    11.6     33.7    18.1     4.1     7.5  Runqueue    4.1  Dirblk        0                         
                                                Waitqueue   0.0                                          
Disk    Busy%     KBPS     TPS KB-Read KB-Writ                   MEMORY                                  
Total     1.3     26.1     6.0     0.0    26.1  PAGING           Real,MB   32768                         
                                                Faults    22513  % Comp     38                           
FileSystem        KBPS     TPS KB-Read KB-Writ  Steals        0  % Noncomp  38                           
Total            264.6   139.0  264.3    0.2    PgspIn        0  % Client   38                           
                                                PgspOut       0                                          
Name            PID  CPU%  PgSp Owner           PageIn        0  PAGING SPACE                            
python2.   45809776   2.3   0.3 iojs            PageOut       0  Size,MB     512                         
python2.   40042666   0.6   8.4 iojs            Sios          0  % Used      7                           
python2.   34996348   0.3   0.3 iojs                             % Free     93                           
sshd       39977210   0.1   0.5 root            NFS (calls/sec)                                          
java       38994050   0.1  51.6 iojs            SerV2         0  WPAR Activ     0                        
topas      12320868   0.0   2.1 iojs            CliV2         0  WPAR Total     0                        
sched        196614   0.0   0.4 root            SerV3         0  Press: "h"-help                         
syncd        720986   0.0   0.6 root            CliV3         0         "q"-quit                         
j2pg        2818138   0.0   8.1 root            SerV4         0                                          
getty       6029500   0.0   0.7 root            CliV4         0

iojs 30343316 58785818 0 00:55:52 - 0:00 gmake run-ci -j 5 JOBS=5

translating to:
iojs 40042666 21823510 2 01:16:39 - 0:09 /usr/bin/python tools/test.py -j 5 -p tap --logfil

looks good to me, but the same -j5 , if percolated to gmake as well, I think it will reduce the overall build time?

rvagg · 2018-12-19T23:31:44Z

https://ci.nodejs.org/job/node-test-commit-aix/19890/nodes=aix61-ppc64/ made the -j explicitly flow down into the child process and it hasn't made a difference (in fact it's running longer than some recent builts).

Without measuring precisely, the test executions look like they are about as slow as on Raspberry Pi 2 or 3's running SD cards or via NFS. The compiles take ~20 minutes and I confirmed that ccache is engaging, so that's got to be slow disk, surely. Tests take the remainder of the time, 40+ minutes.

gireeshpunathil · 2018-12-20T03:30:10Z

@rvagg - I see your point, and agree - parallelizing compilation did not seem to have much effect.
I will see if I can make some concrete observations on disk access latency.

gireeshpunathil · 2018-12-20T07:35:52Z

@rvagg - I ran this below program in three boxes, at the vortex of the node source tree, and confirmed your disk latency theory conclusively:

var fs = require('fs')

function run(entry) {
  const ret = fs.readdirSync(entry, {withFileTypes: true})
  ret.forEach((item) => {
    if(item.name[0] !== '.') {
      try {
        const path = fs.realpathSync(entry) + '/' + item.name
        if(item.isDirectory())
          run(path)
        else fs.writeFileSync(path + '.foo', 'deadbeef')
      } catch(e) {console.log(e)}
    }
  })
}

run('.')

local Linux

real	0m42.020s
user	0m2.712s
sys	0m6.431s

top 5 consumers from strace

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.77    5.209627           5    964115           lstat
 31.70    3.190098          43     74858        26 open
  8.07    0.812078          10     78687           close
  5.57    0.560190           7     74820           pwrite64
  1.43    0.143754          19      7718           getdents

local AIX:

real    1m18.075s
user    0m1.648s
sys     0m5.521s

top 5 consumers from truss:

syscall       seconds   calls  errors
kopen           3.83   75368
statx           2.43  917074      1
close           2.23   75377
thread_setmy     .15   75546
kfcntl           .11   79285      6

AIX CI:

real    10m52.996s
user    0m2.290s
sys     0m4.096s

top 5 consumers from truss:

syscall        seconds   calls  errors
kopen           8.26   78515
kpwrite          .02   74661
statx            .15  857901
close            .00   78525
kfcntl           .00   82429      8

So evidently disk is slow.

mhdawson · 2018-12-20T17:08:45Z

We have known that disk I/O was sub-optimal, but I wonder might have changed recently. @gdams is working on adding a 3rd machine and then getting a whole new box which has a lot more resources to OSU. This is the current list of things @gdams is working through. If there are other things you think he should check please suggest them:

Increase monitoring
validate we are running with 5 threads (for all jobs)
check C cache is working (sounds like Rod may have already confirmed that)
add 3rd CI machine
check with David at OSU to see if there have been any config changes recently
enable ram disks?
investigate aix consistently failing / stalling with CITGM #1625 - the regular failure of CITGM on AIX
New machine (Hopefully in place by Feb)
Install gdb

gireeshpunathil · 2018-12-21T06:30:55Z

install gdb . Folks are more comfortable with it than the in-built dbx, plus gdb is fully ported on to AIX.

mhdawson · 2019-01-02T18:21:24Z

@gireeshpunathil thanks for the suggestion of adding gdb, added to the list above.

gdams · 2019-02-05T21:23:03Z

test-osuosl-aix61-ppc64_be-3 added to the farm

refack · 2019-02-05T23:28:34Z

I think this can be closed now.
@gdams thanks!

sam-github · 2019-10-30T17:57:18Z

@BridgeAR @Trott @mhdawson @nodejs/testing @nodejs/build

We are trying to get resources to add more modern AIX to the CI, but are hitting blockers because we have so many AIX machines already. The release machines are mostly unused, but are wanted for privilege seperation, so I'll not mention them again.

https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ however, which was added here, has been disabled since October 11th, and no one has complained, so here I am asking: has anyone noticed problems with AIX build queues?

If I could recycle -3 as a 7.2 host it would give us the two 7.2 test machines. We'd only lack a release.

Trott · 2019-10-30T22:36:26Z

I haven't noticed AIX being a bottleneck anymore.

BridgeAR · 2019-11-19T16:23:33Z

It became the bottleneck again. I guess it's reproducible by starting many CIs all at the same time. That way the queue is build up and all other machines finish significantly faster than AIX.

New CIs therefore have to wait for AIX to finish.

Example: https://ci.nodejs.org/job/node-test-commit-aix/26551/
Started 4 hr 47 min ago

sam-github · 2019-11-19T17:23:59Z

I don't think its simply lots of CI, looking at below, builds seem to finish quickly, then there is a recent one that just doesn't finish:

the test -1 machine is almost unresponsive logged in over ssh, there is something wrong with it.

sam-github · 2019-11-19T17:34:52Z

blocked on disk i/o. but not sure why, we are supposed to be building in a ramdisk. or maybe its ccache, I'll check.

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  Steals        0  % Noncomp   4
hdisk1   99.0     4.0      1.0    0.0     4.0   PgspIn        0  % Client    4

sam-github · 2019-11-19T18:39:02Z

I'm not sure, but the fact that we are using ramdisk to build because the fs is so slow, but then we have the .ccache on the regular disk, outside the ramdisk, seems suspicious to me.

cjihrig · 2019-11-20T00:27:08Z

Can the problematic machine be rebooted or temporarily taken out of the rotation? https://ci.nodejs.org/job/node-test-pull-request/ is extremely backed up at the moment, as are other jobs.

sam-github · 2019-11-20T02:05:15Z

I brought https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ back online, its already got a job:

https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/26566/

I am hesitant to bring the other 2 offline, because both of them are having trouble. That they should go slow at the same time is one reason I'm staring at the ccache setup, and wondering if the 12.x release builds are different enough to push ccache over the edge.

Now that -3 is building, it takes me 3-4 seconds to ssh into it... it used to be instantaneous.

I'd like to move the ccache onto the RAMFS, but doing that is slow and fights for I/O with the backed up jobs, and if I don't move it, and just symlink it, I'll invalidate the entire cache, which won't help. I might have to make it worse to see if moving ccache makes it better.

sam-github · 2019-11-20T02:23:29Z

https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3 has ~/.ccache symlinked to /home/iojs/build/.ccache, its been offline for a month, it doesn't have much cache anyhow. I just zeroed the stats, and everything is a cache miss.

On -1, I'm rsyncing /home/iojs/.ccache to /home/iojs/build/.ccache. Once its done, I'll symlink the original to the new ones.

I wonder if the 5Gig default cache size is too small, my local cache is 100G/120gig.

sam-github · 2019-11-20T02:33:08Z

My grasp of jenkins is shakier than I thought.

https://ci.nodejs.org/job/node-test-commit-aix/26570/

https://ci.nodejs.org/job/node-test-commit-aix/26569/

Two builds, but not the actual builds on the executors, for that I go down to Configurations, click "default", and both lead to the same build:

https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/26567/

Does that make sense?

richardlau · 2019-11-20T02:51:14Z

The links to default in https://ci.nodejs.org/job/node-test-commit-aix/26570/ and https://ci.nodejs.org/job/node-test-commit-aix/26569/ are faded out because they haven't actually scheduled to run yet. For example, if you look at https://ci.nodejs.org/job/node-test-commit-aix/26570/console it says

20:55:56 Configuration node-test-commit-aix » aix61-ppc64 is still in the queue: Waiting for next available executor on ‘aix61-ppc64’

The faded link links to the the most recent default that is running, i.e. the one from https://ci.nodejs.org/job/node-test-commit-aix/26567/. I agree that it's not easy to spot a faded link from a non-faded one and this is kind of confusing.

If it's any easier to visualize:
https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/

sam-github · 2019-11-20T06:20:27Z

-3 seems to be building well. I'm not seeing why disk is at 100% on -1:

topas
...
Disk    Busy%     KBPS     TPS KB-Read KB-Writ
hdisk0  100.0     6.0      1.0    0.0     6.0 
hdisk1   59.0     2.0      0.0    0.0     2.0

I don't understand why builds are hitting disk.

Both -1 and -3 now have ~iojs/.ccache symlinked to ~iojs/build/.ccache.

-3 is building well, which is good (better if I knew why).

We'll see if -1 starts building well, if so, might have been the ccache, if not, then more research required.

I didn't touch -2, its the control group.

sam-github · 2019-11-20T15:01:05Z

Well, the backlog is gone, or moved, now its over on arm: https://ci.nodejs.org/job/node-test-commit-arm

Its too early, no one is building anything ATM, so I kicked off 3 builds, they spread across the three test machines, I'll watch to see their build times.

richardlau · 2019-11-22T19:26:04Z

test-osuosl-aix61-ppc64_be-2 is now complaining about not enough memory:

e.g. https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/26672/console

14:19:29 Caused by: java.io.IOException: error=12, There is not enough memory available now.

sam-github · 2019-11-25T16:20:49Z

This issue is getting unwealdily long, pls open new incident issues if AIX problems arise.

gdams added the build-agenda label Dec 13, 2018

mhdawson mentioned this issue Jan 2, 2019

Node.js Foundation Build WorkGroup Meeting 2019-01-08 #1654

Closed

mhdawson mentioned this issue Jan 23, 2019

Node.js Foundation Build WorkGroup Meeting 2019-01-29 #1667

Closed

refack removed the build-agenda label Feb 5, 2019

refack added enhancement platform:aix labels Feb 5, 2019

refack closed this as completed Feb 5, 2019

richardlau mentioned this issue Mar 15, 2019

Skip AIX when using yarn nodejs/citgm#688

Open

sam-github reopened this Oct 30, 2019

BridgeAR mentioned this issue Nov 19, 2019

v13.2.0 proposal nodejs/node#30547

Merged

sam-github closed this as completed Nov 25, 2019

richardlau mentioned this issue Dec 19, 2019

Several OSUOSL AIX boxes offline #2105

Closed

Add more aix machines #1623

Add more aix machines #1623

Comments

BridgeAR commented Dec 5, 2018

Trott commented Dec 5, 2018

sam-github commented Dec 5, 2018

Trott commented Dec 5, 2018

gdams commented Dec 6, 2018

Trott commented Dec 12, 2018

Trott commented Dec 14, 2018 • edited Loading

Trott commented Dec 14, 2018

Trott commented Dec 14, 2018

mhdawson commented Dec 17, 2018

gireeshpunathil commented Dec 18, 2018

rvagg commented Dec 19, 2018

gireeshpunathil commented Dec 19, 2018

rvagg commented Dec 19, 2018

gireeshpunathil commented Dec 19, 2018

rvagg commented Dec 19, 2018

gireeshpunathil commented Dec 20, 2018

gireeshpunathil commented Dec 20, 2018

mhdawson commented Dec 20, 2018 • edited by gdams Loading

gireeshpunathil commented Dec 21, 2018

mhdawson commented Jan 2, 2019

gdams commented Feb 5, 2019

refack commented Feb 5, 2019

sam-github commented Oct 30, 2019

Trott commented Oct 30, 2019

BridgeAR commented Nov 19, 2019

sam-github commented Nov 19, 2019

sam-github commented Nov 19, 2019

sam-github commented Nov 19, 2019

cjihrig commented Nov 20, 2019 • edited Loading

sam-github commented Nov 20, 2019

sam-github commented Nov 20, 2019

sam-github commented Nov 20, 2019

richardlau commented Nov 20, 2019

sam-github commented Nov 20, 2019 • edited Loading

sam-github commented Nov 20, 2019

richardlau commented Nov 22, 2019

sam-github commented Nov 25, 2019 • edited Loading

Trott commented Dec 14, 2018 •

edited

Loading

mhdawson commented Dec 20, 2018 •

edited by gdams

Loading

cjihrig commented Nov 20, 2019 •

edited

Loading

sam-github commented Nov 20, 2019 •

edited

Loading

sam-github commented Nov 25, 2019 •

edited

Loading