Determine NFS Disk strategy #2815

yuvipanda · 2021-09-30T09:36:28Z

We currently have three NFS disks attached to the NFS server at nfs-server-01.

Disk	Size	Used %	Type	Hubs
datahubhomes-2020-07-29	20T	54%	pd-balanced	datahub, r hub, julia hub
data100homes-2020-08-04	10T	70%	pd-balanced	data100 hub
ischool-2021-07-01	500G	42%	pd-standard	ischool hub
homedirs-other-2020-07-29	7.5T	69%	pd-standard	everything else

We had a partial outage earlier because homedir-others ran out of storage space,
and needed to be sized up. This is a bit frustrating, because there was space
in the other disks that we are paying for and not using! So if we figure out a
good strategy for figuring out what disks to use for what hubs, we can save money
and reduce such outages.

What are the advantages of datahub and data100 having their own disk?

Insulated against outages caused by disk filling up in other hubs. The outage
was probably caused by eecs hub using up a lot of disk, but affected all other
hubs - except datahub and data100 hub!
Disk IOPS and throughput are calculated per-disk, and so hubs with their own
disks get consistent performance regardless of how much disk use other hubs
are doing. This might allow us to get away with using much cheaper standard
or balanced disks than more expensive SSD disks.

The primary disadvantage is the lack of pooling - we have unused space that we
pay for in datahub and data100 disks, and that can't really be used by the
other hubs.

Finding a solution to this will really help us provide more stable service while
also minimizing cost.

Approaches to try

Use one big volume, formatted with XFS. Use XFS quotas to isolate hubs from
each other. This lets us orverprovision as well, so we can utilize space more
effectively.
Evaluate ZFS, and its subvolume support. This also gives us other benefits of
ZFS.

For both of these, we need to evaluate how real an advantage (2) from above is.
We capture IOPS metrics on our nfs-server, and need to evaluate how much performance
we really need. IOPS metrics are a little hard to understand for me, so I don't have
a clear idea of what we'll lose if we just move to one big maxed out disk.

We also need to have an understanding of wether a single individual user can
hammer our NFS so hard that it just disrupts everyone - NFS is our true
single point of failure.

yuvipanda · 2021-09-30T09:44:05Z

Particularly for EECS and genomics hubs, which are using up a lot of space now - we can try setting xfs_quota on them and see how it goes? I think it's already mounted with project quota enabled, which is what we will need to use.

balajialg · 2021-09-30T16:46:47Z

@yuvipanda Thanks for this detailed write-up. Super useful!

Is there any reason for not having separate NFS disks for Data 8 and EECS? Considering that both are computation-intensive hubs with a lot of application data!
Also, Is it possible to set up alerts when xfs-quota gets reached in specific NFS disks?

yuvipanda · 2021-10-01T14:03:02Z

@balajialg for (1), EECS hub started out pretty small :) Wether they should get new disks now, or if we should consolidate everything into one should be determined by looking at IOPS metrics as mentioned in the issue.

We can definitely have alerts for this once we have alerting mechanisms in place...

felder · 2021-10-20T00:49:46Z

One thing to note here is that google persistent disks have a max capacity of 64TB.

Given that I think it's a bad idea to raid the persistent disks together and we cannot shrink disks, I think a disk per hub approach with quotas imposed as necessary makes the most sense.

ryanlovett · 2022-12-22T00:17:04Z

We are moving everything to Google Filestore for Spring '23!

yuvipanda mentioned this issue Sep 30, 2021

EECS Hub: No Space left on Device [BLOCKER] #2808

Closed

balajialg added the enhancement Issues around improving existing functionality label Sep 30, 2021

balajialg assigned yuvipanda Sep 30, 2021

yuvipanda removed their assignment Nov 29, 2022

ryanlovett closed this as completed Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine NFS Disk strategy #2815

Determine NFS Disk strategy #2815

yuvipanda commented Sep 30, 2021 •

edited

Loading

yuvipanda commented Sep 30, 2021

balajialg commented Sep 30, 2021

yuvipanda commented Oct 1, 2021

felder commented Oct 20, 2021

ryanlovett commented Dec 22, 2022

Determine NFS Disk strategy #2815

Determine NFS Disk strategy #2815

Comments

yuvipanda commented Sep 30, 2021 • edited Loading

Approaches to try

yuvipanda commented Sep 30, 2021

balajialg commented Sep 30, 2021

yuvipanda commented Oct 1, 2021

felder commented Oct 20, 2021

ryanlovett commented Dec 22, 2022

yuvipanda commented Sep 30, 2021 •

edited

Loading