Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update archival policy for Datahub users #3377

Merged
merged 2 commits into from
May 12, 2022
Merged

Update archival policy for Datahub users #3377

merged 2 commits into from
May 12, 2022

Conversation

balajialg
Copy link
Contributor

@balajialg balajialg commented May 7, 2022

I would like to add an additional policy proposal that involves not archiving storage for users who exceed a certain storage threshold while being inactive beyond 6 months. You can read more about the rationale as part of this issue #3376

This would be an additional change to our existing archival policy document. I also updated our new archived data
storage timeline to be around 6 months post which the data gets deleted (based on our sprint planning discussion on 5/5).

Open to all your comments to clarify the language and make a robust policy that helps communicate to our users clearly while also bringing cloud costs down.

@yuvipanda
Copy link
Contributor

So this means there are two processes:

  1. Run a script, find users who have home dirs >100GB, send them an email. What should this email say? Will we delete their homes, or just not archive them until it goes under 100?
  2. Run our achival script, as usual.

Is that right?

@balajialg
Copy link
Contributor Author

balajialg commented May 9, 2022

Hey @yuvipanda, My suggestion for the email would be to stress the point that we will delete their data if they don't back up in the 30 days. @felder @ryanlovett any suggestions here?

If we don't archive any users' home directories, by when do their files get deleted? If it stays for a longer duration of time (greater than 6 months), it makes sense to delete the files after giving users a generous window to download their files. If it is a short duration, not archiving makes sense to me. Open to all your thoughts on the best way forward,

This is the email template I used in my recent communication with users who stored files with sizes greater than 100 GB. I specifically wanted to understand why they chose to store greater than 100 GB so that I can include some pointers to staff during the onboarding email during Fall 2022!

_I am reaching out to you with a couple of requests related to the amount of storage you have in your Datahub instance. We found that you have stored more than 100 GB worth of data in your home directory which is far higher than the storage required by most of our users.

  1. Can you tell us if there are specific reasons for storing such a large amount of data?
  2. Can you also back up your files locally in the next 30 days? We plan to clean your home directories by the end of a 30-day timeline as higher storage is correlated with higher costs at our end. Note that you may not be able to retrieve those files post the mentioned timeline._

@yuvipanda
Copy link
Contributor

Ok, so I think this is a different policy than the user storage archival policy. It sounds like the proposed policy would be:

  1. If your home directory gets over 100GB, we will email you.
  2. You have 30 days to bring it under 100GB, or we will delete all of it

Does this sound right? I'm happy with it as a policy (with tweaks for amount of notice, and how to get exceptions), but I think it's a 'large home directories' policy distinct from our archival policy and we should write it as such.

Can you make this into a new file and title it as a 'home directory size policy' or similar?

THANK YOU SO MUCH FOR WORKING ON THIS!

@felder
Copy link
Contributor

felder commented May 10, 2022

@balajialg I'd like to use # of days instead of # of months as a general case. So instead of 6 months, lets specify 180 days. Instead of 12 months lets specify 365 days.

Also there may be race conditions between discovering directories that are too large and archival. Probably not a big deal, really, but it's possible we may archive larger directories if the directories happen to grow between the last size check and the archival process.

@yuvipanda One thing I've noticed, which is probably not surprising, is that all the standard tier persistent disks take a really long time to do size calculations on. Like it may take over a day to run those calculations on all of the existing storage. Not a huge deal since I can probably deprioritize the process (which will make it take longer), but something to consider.

@balajialg
Copy link
Contributor Author

balajialg commented May 11, 2022

@yuvipanda I liked the idea of developing a separate policy for storage limits and @felder will keep in mind your articulation around defining limits in terms of the number of days instead of months.

One thing I realized during my discussion with @felder is the perverse incentives created in the long run, by defining the storage limit as 100 gigs for every user. We may be allowing scope creep with our communication to the instructors that anything lesser than 100 GB would be a reasonable limit (yes, we can choose not to communicate this policy which is a reasonable pathway).

What are our goals with regards to storage? Allow our users to run computationally intensive workflow appropriate to their needs with minimal cost spending for unused storage. Obviously, the primary action point is to remove the outliers who are storing enormous amounts of data for their coursework-related/research-related stuff without accessing their storage regularly. Just by this action, we will be able to reduce almost 2 TB of unused storage, which can approximately save us up to $2000 - $2500 per year. (Obviously, we can debate whether these are worthy cost savings for the project if the cost of man-hours invested to optimize is greater than this) So, so this in theory seems like a reasonable short-term action with better returns.

However, thinking about storage from a long-term perspective, Setting 100 gigs may not be a reasonable limit considering that the storage needs may vary depending on the nature of the course/dataset used/type of use cases used, etc.. which makes it hard to generalize across hubs. Hypothetically, For genomics, a storage limit of 100 GB could be a reasonable limit but for a political science course, 5 GB could be the reasonable limit. One of the other important problems it creates is the edge cases Eg: If the user has a home directory storage of 97 GB - will our policy apply in such scenarios? Do we still consider their storage as over the limit and include them as part of the storage limit policy? Should we reduce the storage limit further? This seems like a Nash equilibrium problem.

One suggestion is to define a storage policy that is slightly dynamic for both per course hubs (Data 8, Data 100, potentially I school hub, etc..) and generic hubs (Datahub, R hub, Julia Hub, etc..). Open to debate whether the arbitrary number defined below makes sense.

  1. At a per-course hub level which may include computationally complex use cases, we can identify the median size of all the user home directories and evaluate whether the user's storage is below 3x the median size of all home directories pertaining to their hub. (This could also be learned based on interaction with faculty).
  2. At a generic hub level, which is expected to consist of foundational use cases, setting the limits that users' home directories should be lesser than 2x the median size of all home directories specific to the hub.

Let me know what you all think about the above policy suggestions? Obviously, we can debate whether we need to optimize at this level for this policy? Are the efforts involved to define the policy worth the cost savings?

@yuvipanda
Copy link
Contributor

@balajialg What do you think of just starting with a 100GB limit that we enforce, and see how that goes? I understand that what we want to express is not 'each student can use upto 100GB', but I worry about added complexity of defining limits for one user based on usage pattern of other users. This becomes messy in big hubs like datahub as you mention, as there isn't a clear way to delineate who should be 'comparable users'. I think simplifying policies so they are the same across hubs is a good way to start, and we can make things more complex if needed.

@balajialg
Copy link
Contributor Author

balajialg commented May 12, 2022

@yuvipanda Sounds reasonable. I had a conversation with @felder about the archival process based on a few data points he was able to retrieve about the home directories. I will incorporate that as part of the storage policy PR. It might open up a new discussion on the archival process. For now, I will remove the entries related to the storage policy in this policy doc and make a new commit.

Copy link
Contributor

@yuvipanda yuvipanda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@balajialg balajialg merged commit 7ec0b26 into berkeley-dsep-infra:staging May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants