Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine NFS Disk strategy #2815

Closed
yuvipanda opened this issue Sep 30, 2021 · 5 comments
Closed

Determine NFS Disk strategy #2815

yuvipanda opened this issue Sep 30, 2021 · 5 comments
Labels
enhancement Issues around improving existing functionality

Comments

@yuvipanda
Copy link
Contributor

yuvipanda commented Sep 30, 2021

We currently have three NFS disks attached to the NFS server at nfs-server-01.

Disk Size Used % Type Hubs
datahubhomes-2020-07-29 20T 54% pd-balanced datahub, r hub, julia hub
data100homes-2020-08-04 10T 70% pd-balanced data100 hub
ischool-2021-07-01 500G 42% pd-standard ischool hub
homedirs-other-2020-07-29 7.5T 69% pd-standard everything else

We had a partial outage earlier because homedir-others ran out of storage space,
and needed to be sized up. This is a bit frustrating, because there was space
in the other disks that we are paying for and not using! So if we figure out a
good strategy for figuring out what disks to use for what hubs, we can save money
and reduce such outages.

What are the advantages of datahub and data100 having their own disk?

  1. Insulated against outages caused by disk filling up in other hubs. The outage
    was probably caused by eecs hub using up a lot of disk, but affected all other
    hubs - except datahub and data100 hub!
  2. Disk IOPS and throughput are calculated per-disk, and so hubs with their own
    disks get consistent performance regardless of how much disk use other hubs
    are doing. This might allow us to get away with using much cheaper standard
    or balanced disks than more expensive SSD disks.

The primary disadvantage is the lack of pooling - we have unused space that we
pay for in datahub and data100 disks, and that can't really be used by the
other hubs.

Finding a solution to this will really help us provide more stable service while
also minimizing cost.

Approaches to try

  1. Use one big volume, formatted with XFS. Use XFS quotas to isolate hubs from
    each other. This lets us orverprovision as well, so we can utilize space more
    effectively.
  2. Evaluate ZFS, and its subvolume support. This also gives us other benefits of
    ZFS.

For both of these, we need to evaluate how real an advantage (2) from above is.
We capture IOPS metrics on our nfs-server, and need to evaluate how much performance
we really need. IOPS metrics are a little hard to understand for me, so I don't have
a clear idea of what we'll lose if we just move to one big maxed out disk.

We also need to have an understanding of wether a single individual user can
hammer our NFS so hard that it just disrupts everyone - NFS is our true
single point of failure.

@yuvipanda
Copy link
Contributor Author

Particularly for EECS and genomics hubs, which are using up a lot of space now - we can try setting xfs_quota on them and see how it goes? I think it's already mounted with project quota enabled, which is what we will need to use.

@balajialg
Copy link
Contributor

@yuvipanda Thanks for this detailed write-up. Super useful!

  1. Is there any reason for not having separate NFS disks for Data 8 and EECS? Considering that both are computation-intensive hubs with a lot of application data!
  2. Also, Is it possible to set up alerts when xfs-quota gets reached in specific NFS disks?

@balajialg balajialg added the enhancement Issues around improving existing functionality label Sep 30, 2021
@yuvipanda
Copy link
Contributor Author

@balajialg for (1), EECS hub started out pretty small :) Wether they should get new disks now, or if we should consolidate everything into one should be determined by looking at IOPS metrics as mentioned in the issue.

We can definitely have alerts for this once we have alerting mechanisms in place...

@felder
Copy link
Contributor

felder commented Oct 20, 2021

One thing to note here is that google persistent disks have a max capacity of 64TB.

Given that I think it's a bad idea to raid the persistent disks together and we cannot shrink disks, I think a disk per hub approach with quotas imposed as necessary makes the most sense.

@yuvipanda yuvipanda removed their assignment Nov 29, 2022
@ryanlovett
Copy link
Collaborator

We are moving everything to Google Filestore for Spring '23!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues around improving existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants