Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more aix machines #1623

Closed
BridgeAR opened this issue Dec 5, 2018 · 37 comments
Closed

Add more aix machines #1623

BridgeAR opened this issue Dec 5, 2018 · 37 comments

Comments

@BridgeAR
Copy link
Member

BridgeAR commented Dec 5, 2018

Is it possible to add a new aix server for the CI? Currently we have a queue of CIs to finish that all only wait on aix.

@Trott
Copy link
Member

Trott commented Dec 5, 2018

Co-sign.

screen shot 2018-12-05 at 12 13 40 pm

@sam-github
Copy link
Contributor

Mine always seem to be blocked on Arm and Windows. Did something change recently? Not that getting more AIX machines wouldn't be great if that's the new bottleneck. Anything that speeds up the CI is fantastic, its pretty painful ATM.

@Trott
Copy link
Member

Trott commented Dec 5, 2018

Mine always seem to be blocked on Arm and Windows. Did something change recently? Not that getting more AIX machines wouldn't be great if that's the new bottleneck. Anything that speeds up the CI is fantastic, its pretty painful ATM.

I think AIX becomes a bottleneck when someone runs CITGM because that ties up one of (I think) only two hosts for hours. Tying up one Windows host for hours for a CITGM run is no big deal because we have so many Windows hosts. And we don't run CITGM on Raspberry Pi at all, so that's not an issue either.

So yeah, under ordinary conditions, Windows and Raspberry Pi tend to be the bottlenecks, but not too bad. But, I think, once a CITGM job or two get kicked off, AIX ends up taking literally hours.

@gdams
Copy link
Member

gdams commented Dec 6, 2018

@BridgeAR I am currently in the process of donating a much more powerful AIX machine to the community which will allow us to have many more machines and help with the CI backlog issues.

@Trott
Copy link
Member

Trott commented Dec 12, 2018

@gdams Is this worth putting on the Build WG agenda just so that in case this stalls out, it's there if obstacles or problems need to be discussed next week?

@Trott
Copy link
Member

Trott commented Dec 14, 2018

Mine always seem to be blocked on Arm and Windows. Did something change recently? Not that getting more AIX machines wouldn't be great if that's the new bottleneck. Anything that speeds up the CI is fantastic, its pretty painful ATM.

I think AIX becomes a bottleneck when someone runs CITGM because that ties up one of (I think) only two hosts for hours

I don't know what's changed in the last month or whatever, but AIX is definitely the big bottleneck, much more so than Windows and Raspberry Pi. In the past, IIRC, this would only be during CITGM runs. But it's a constant now. Part of it might be that AIX is so susceptible to whatever is causing the rash of failing tests lately that it needs to be re-run a lot, causing it to fall further behind in the queue than other platforms that don't need to be re-run during a Resume Build.

Nothing unusual is going on in terms of building right now--it's a typical quiet-ish Friday--but the CI queue is totally backed up and it's entirely due to waiting for AIX hosts to be available for work.

@Trott
Copy link
Member

Trott commented Dec 14, 2018

I'll also add that the tests used to run really fast on the AIX hosts. The build/compile took a long time, but once the tests were going, it was impressive. Not so much in CI anymore. Now everything is slow on AIX. I don't know if we swapped in hosts with less memory/CPU or something, but it sure seems like something significant changed.

@Trott
Copy link
Member

Trott commented Dec 14, 2018

All that complaining...er, I mean providing information....above aside, I do believe one or two additional hosts would resolve the issue entirely.

@mhdawson
Copy link
Member

@gdams in parallel with getting the new machine to OSU can you also talk to David to see if anything with respect to the configuration changed? We might also want to double check that the ramdisk is still in place and working.

@gireeshpunathil
Copy link
Member

While I have access to one of the CI, I see that it has 16 CPUs, so wondering why we are running gmake in single thread?

@rvagg
Copy link
Member

rvagg commented Dec 19, 2018

@gireeshpunathil where are you seeing it run as a single?

@gireeshpunathil
Copy link
Member

say for example if you take the current run -
https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/19870/consoleFull
and grep for 01:30:13 gmake -C out BUILDTYPE=Release V=1 I don't see a -j flag to it?

@rvagg
Copy link
Member

rvagg commented Dec 19, 2018

it's supplied to the parent make call and it's supposed to coordinate across all child invocations as well to make sure the total isn't more than N processes for -j N

I was watching a job live and noted that git clean -fdx took an unexpectedly long time, could there be disk problems with these machines?

@gireeshpunathil
Copy link
Member

ok, without knowing the full details on the CI script, I can state that gmake -C out BUILDTYPE=Release V=1 is slower than gmake -C out BUILDTYPE=Release V=1 -jN, but I see your point, if there are other tasks that need to run parallel to gmake the N goes for that (although I don't know what those tasks are).

slowness on git clean -fdx was noticed earlier, but then it was on a CIGTM run which installed a large number of modules and was taking lot of time.

I did some studies on CI runtimes of different platforms and my inference is that AIX runs are SLOW (average 50 minutes) , and the slowness is distributed across the run time, no throttling at any point.

system state does not show anything significant:

Topas Monitor for host:    power8-nodejs2       EVENTS/QUEUES    FILE/TTY                                
Wed Dec 19 01:40:08 2018   Interval:  2         Cswitch     906  Readch   277.5K                         
                                                Syscall   21286  Writech   19627                         
CPU  User%  Kern%  Wait%  Idle%  Physc   Entc   Reads       218  Rawin         0                         
ALL   13.9   44.1    0.0   42.0   1.32  132.4   Writes       58  Ttyout     1067                         
                                                Forks         3  Igets         4                         
Network  KBPS   I-Pack  O-Pack   KB-In  KB-Out  Execs         4  Namei      1360                         
Total    11.6     33.7    18.1     4.1     7.5  Runqueue    4.1  Dirblk        0                         
                                                Waitqueue   0.0                                          
Disk    Busy%     KBPS     TPS KB-Read KB-Writ                   MEMORY                                  
Total     1.3     26.1     6.0     0.0    26.1  PAGING           Real,MB   32768                         
                                                Faults    22513  % Comp     38                           
FileSystem        KBPS     TPS KB-Read KB-Writ  Steals        0  % Noncomp  38                           
Total            264.6   139.0  264.3    0.2    PgspIn        0  % Client   38                           
                                                PgspOut       0                                          
Name            PID  CPU%  PgSp Owner           PageIn        0  PAGING SPACE                            
python2.   45809776   2.3   0.3 iojs            PageOut       0  Size,MB     512                         
python2.   40042666   0.6   8.4 iojs            Sios          0  % Used      7                           
python2.   34996348   0.3   0.3 iojs                             % Free     93                           
sshd       39977210   0.1   0.5 root            NFS (calls/sec)                                          
java       38994050   0.1  51.6 iojs            SerV2         0  WPAR Activ     0                        
topas      12320868   0.0   2.1 iojs            CliV2         0  WPAR Total     0                        
sched        196614   0.0   0.4 root            SerV3         0  Press: "h"-help                         
syncd        720986   0.0   0.6 root            CliV3         0         "q"-quit                         
j2pg        2818138   0.0   8.1 root            SerV4         0                                          
getty       6029500   0.0   0.7 root            CliV4         0                       

iojs 30343316 58785818 0 00:55:52 - 0:00 gmake run-ci -j 5 JOBS=5

translating to:
iojs 40042666 21823510 2 01:16:39 - 0:09 /usr/bin/python tools/test.py -j 5 -p tap --logfil

looks good to me, but the same -j5 , if percolated to gmake as well, I think it will reduce the overall build time?

@rvagg
Copy link
Member

rvagg commented Dec 19, 2018

https://ci.nodejs.org/job/node-test-commit-aix/19890/nodes=aix61-ppc64/ made the -j explicitly flow down into the child process and it hasn't made a difference (in fact it's running longer than some recent builts).

Without measuring precisely, the test executions look like they are about as slow as on Raspberry Pi 2 or 3's running SD cards or via NFS. The compiles take ~20 minutes and I confirmed that ccache is engaging, so that's got to be slow disk, surely. Tests take the remainder of the time, 40+ minutes.

@gireeshpunathil
Copy link
Member

@rvagg - I see your point, and agree - parallelizing compilation did not seem to have much effect.
I will see if I can make some concrete observations on disk access latency.

@gireeshpunathil
Copy link
Member

@rvagg - I ran this below program in three boxes, at the vortex of the node source tree, and confirmed your disk latency theory conclusively:

var fs = require('fs')

function run(entry) {
  const ret = fs.readdirSync(entry, {withFileTypes: true})
  ret.forEach((item) => {
    if(item.name[0] !== '.') {
      try {
        const path = fs.realpathSync(entry) + '/' + item.name
        if(item.isDirectory())
          run(path)
        else fs.writeFileSync(path + '.foo', 'deadbeef')
      } catch(e) {console.log(e)}
    }
  })
}

run('.')

local Linux

real	0m42.020s
user	0m2.712s
sys	0m6.431s

top 5 consumers from strace

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.77    5.209627           5    964115           lstat
 31.70    3.190098          43     74858        26 open
  8.07    0.812078          10     78687           close
  5.57    0.560190           7     74820           pwrite64
  1.43    0.143754          19      7718           getdents

local AIX:

real    1m18.075s
user    0m1.648s
sys     0m5.521s

top 5 consumers from truss:

syscall       seconds   calls  errors
kopen           3.83   75368
statx           2.43  917074      1
close           2.23   75377
thread_setmy     .15   75546
kfcntl           .11   79285      6

AIX CI:

real    10m52.996s
user    0m2.290s
sys     0m4.096s

top 5 consumers from truss:

syscall        seconds   calls  errors
kopen           8.26   78515
kpwrite          .02   74661
statx            .15  857901
close            .00   78525
kfcntl           .00   82429      8

So evidently disk is slow.

@mhdawson
Copy link
Member

mhdawson commented Dec 20, 2018

We have known that disk I/O was sub-optimal, but I wonder might have changed recently. @gdams is working on adding a 3rd machine and then getting a whole new box which has a lot more resources to OSU. This is the current list of things @gdams is working through. If there are other things you think he should check please suggest them:

  • Increase monitoring
  • validate we are running with 5 threads (for all jobs)
  • check C cache is working (sounds like Rod may have already confirmed that)
  • add 3rd CI machine
  • check with David at OSU to see if there have been any config changes recently
  • enable ram disks?
  • investigate aix consistently failing / stalling with CITGM #1625 - the regular failure of CITGM on AIX
  • New machine (Hopefully in place by Feb)
  • Install gdb

@gireeshpunathil
Copy link
Member

install gdb . Folks are more comfortable with it than the in-built dbx, plus gdb is fully ported on to AIX.

@mhdawson
Copy link
Member

mhdawson commented Jan 2, 2019

@gireeshpunathil thanks for the suggestion of adding gdb, added to the list above.

@gdams
Copy link
Member

gdams commented Feb 5, 2019

test-osuosl-aix61-ppc64_be-3 added to the farm

@refack
Copy link
Contributor

refack commented Feb 5, 2019

I think this can be closed now.
@gdams thanks!

@refack refack closed this as completed Feb 5, 2019
@sam-github
Copy link
Contributor

@BridgeAR @Trott @mhdawson @nodejs/testing @nodejs/build

We are trying to get resources to add more modern AIX to the CI, but are hitting blockers because we have so many AIX machines already. The release machines are mostly unused, but are wanted for privilege seperation, so I'll not mention them again.

https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ however, which was added here, has been disabled since October 11th, and no one has complained, so here I am asking: has anyone noticed problems with AIX build queues?

If I could recycle -3 as a 7.2 host it would give us the two 7.2 test machines. We'd only lack a release.

@Trott
Copy link
Member

Trott commented Oct 30, 2019

I haven't noticed AIX being a bottleneck anymore.

@BridgeAR
Copy link
Member Author

It became the bottleneck again. I guess it's reproducible by starting many CIs all at the same time. That way the queue is build up and all other machines finish significantly faster than AIX.

New CIs therefore have to wait for AIX to finish.

Example: https://ci.nodejs.org/job/node-test-commit-aix/26551/
Started 4 hr 47 min ago

@sam-github
Copy link
Contributor

I don't think its simply lots of CI, looking at below, builds seem to finish quickly, then there is a recent one that just doesn't finish:

the test -1 machine is almost unresponsive logged in over ssh, there is something wrong with it.

@sam-github
Copy link
Contributor

blocked on disk i/o. but not sure why, we are supposed to be building in a ramdisk. or maybe its ccache, I'll check.

Disk    Busy%     KBPS     TPS KB-Read KB-Writ  Steals        0  % Noncomp   4
hdisk1   99.0     4.0      1.0    0.0     4.0   PgspIn        0  % Client    4

@sam-github
Copy link
Contributor

I'm not sure, but the fact that we are using ramdisk to build because the fs is so slow, but then we have the .ccache on the regular disk, outside the ramdisk, seems suspicious to me.

@cjihrig
Copy link
Contributor

cjihrig commented Nov 20, 2019

Can the problematic machine be rebooted or temporarily taken out of the rotation? https://ci.nodejs.org/job/node-test-pull-request/ is extremely backed up at the moment, as are other jobs.

@sam-github
Copy link
Contributor

I brought https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3/ back online, its already got a job:

I am hesitant to bring the other 2 offline, because both of them are having trouble. That they should go slow at the same time is one reason I'm staring at the ccache setup, and wondering if the 12.x release builds are different enough to push ccache over the edge.

Now that -3 is building, it takes me 3-4 seconds to ssh into it... it used to be instantaneous.

I'd like to move the ccache onto the RAMFS, but doing that is slow and fights for I/O with the backed up jobs, and if I don't move it, and just symlink it, I'll invalidate the entire cache, which won't help. I might have to make it worse to see if moving ccache makes it better.

@sam-github
Copy link
Contributor

https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-3 has ~/.ccache symlinked to /home/iojs/build/.ccache, its been offline for a month, it doesn't have much cache anyhow. I just zeroed the stats, and everything is a cache miss.

On -1, I'm rsyncing /home/iojs/.ccache to /home/iojs/build/.ccache. Once its done, I'll symlink the original to the new ones.

I wonder if the 5Gig default cache size is too small, my local cache is 100G/120gig.

@sam-github
Copy link
Contributor

My grasp of jenkins is shakier than I thought.

https://ci.nodejs.org/job/node-test-commit-aix/26570/

https://ci.nodejs.org/job/node-test-commit-aix/26569/

Two builds, but not the actual builds on the executors, for that I go down to Configurations, click "default", and both lead to the same build:

Does that make sense?

@richardlau
Copy link
Member

The links to default in https://ci.nodejs.org/job/node-test-commit-aix/26570/ and https://ci.nodejs.org/job/node-test-commit-aix/26569/ are faded out because they haven't actually scheduled to run yet. For example, if you look at https://ci.nodejs.org/job/node-test-commit-aix/26570/console it says

20:55:56 Configuration node-test-commit-aix » aix61-ppc64 is still in the queue: Waiting for next available executor on ‘aix61-ppc64’

The faded link links to the the most recent default that is running, i.e. the one from https://ci.nodejs.org/job/node-test-commit-aix/26567/. I agree that it's not easy to spot a faded link from a non-faded one and this is kind of confusing.

If it's any easier to visualize:
https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/
image

@sam-github
Copy link
Contributor

sam-github commented Nov 20, 2019

-3 seems to be building well. I'm not seeing why disk is at 100% on -1:

topas
...
Disk    Busy%     KBPS     TPS KB-Read KB-Writ
hdisk0  100.0     6.0      1.0    0.0     6.0 
hdisk1   59.0     2.0      0.0    0.0     2.0

I don't understand why builds are hitting disk.

Both -1 and -3 now have ~iojs/.ccache symlinked to ~iojs/build/.ccache.

-3 is building well, which is good (better if I knew why).

We'll see if -1 starts building well, if so, might have been the ccache, if not, then more research required.

I didn't touch -2, its the control group.

@sam-github
Copy link
Contributor

Well, the backlog is gone, or moved, now its over on arm: https://ci.nodejs.org/job/node-test-commit-arm

Its too early, no one is building anything ATM, so I kicked off 3 builds, they spread across the three test machines, I'll watch to see their build times.

@richardlau
Copy link
Member

test-osuosl-aix61-ppc64_be-2 is now complaining about not enough memory:

e.g. https://ci.nodejs.org/job/node-test-commit-aix/nodes=aix61-ppc64/26672/console

14:19:29 Caused by: java.io.IOException: error=12, There is not enough memory available now.

@sam-github
Copy link
Contributor

sam-github commented Nov 25, 2019

This issue is getting unwealdily long, pls open new incident issues if AIX problems arise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants