-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update archival policy for Datahub users #3377
Conversation
So this means there are two processes:
Is that right? |
Hey @yuvipanda, My suggestion for the email would be to stress the point that we will delete their data if they don't back up in the 30 days. @felder @ryanlovett any suggestions here? If we don't archive any users' home directories, by when do their files get deleted? If it stays for a longer duration of time (greater than 6 months), it makes sense to delete the files after giving users a generous window to download their files. If it is a short duration, not archiving makes sense to me. Open to all your thoughts on the best way forward, This is the email template I used in my recent communication with users who stored files with sizes greater than 100 GB. I specifically wanted to understand why they chose to store greater than 100 GB so that I can include some pointers to staff during the onboarding email during Fall 2022! _I am reaching out to you with a couple of requests related to the amount of storage you have in your Datahub instance. We found that you have stored more than 100 GB worth of data in your home directory which is far higher than the storage required by most of our users.
|
Ok, so I think this is a different policy than the user storage archival policy. It sounds like the proposed policy would be:
Does this sound right? I'm happy with it as a policy (with tweaks for amount of notice, and how to get exceptions), but I think it's a 'large home directories' policy distinct from our archival policy and we should write it as such. Can you make this into a new file and title it as a 'home directory size policy' or similar? THANK YOU SO MUCH FOR WORKING ON THIS! |
@balajialg I'd like to use # of days instead of # of months as a general case. So instead of 6 months, lets specify 180 days. Instead of 12 months lets specify 365 days. Also there may be race conditions between discovering directories that are too large and archival. Probably not a big deal, really, but it's possible we may archive larger directories if the directories happen to grow between the last size check and the archival process. @yuvipanda One thing I've noticed, which is probably not surprising, is that all the standard tier persistent disks take a really long time to do size calculations on. Like it may take over a day to run those calculations on all of the existing storage. Not a huge deal since I can probably deprioritize the process (which will make it take longer), but something to consider. |
@yuvipanda I liked the idea of developing a separate policy for storage limits and @felder will keep in mind your articulation around defining limits in terms of the number of days instead of months. One thing I realized during my discussion with @felder is the perverse incentives created in the long run, by defining the storage limit as 100 gigs for every user. We may be allowing scope creep with our communication to the instructors that anything lesser than 100 GB would be a reasonable limit (yes, we can choose not to communicate this policy which is a reasonable pathway). What are our goals with regards to storage? Allow our users to run computationally intensive workflow appropriate to their needs with minimal cost spending for unused storage. Obviously, the primary action point is to remove the outliers who are storing enormous amounts of data for their coursework-related/research-related stuff without accessing their storage regularly. Just by this action, we will be able to reduce almost 2 TB of unused storage, which can approximately save us up to $2000 - $2500 per year. (Obviously, we can debate whether these are worthy cost savings for the project if the cost of man-hours invested to optimize is greater than this) So, so this in theory seems like a reasonable short-term action with better returns. However, thinking about storage from a long-term perspective, Setting 100 gigs may not be a reasonable limit considering that the storage needs may vary depending on the nature of the course/dataset used/type of use cases used, etc.. which makes it hard to generalize across hubs. Hypothetically, For genomics, a storage limit of 100 GB could be a reasonable limit but for a political science course, 5 GB could be the reasonable limit. One of the other important problems it creates is the edge cases Eg: If the user has a home directory storage of 97 GB - will our policy apply in such scenarios? Do we still consider their storage as over the limit and include them as part of the storage limit policy? Should we reduce the storage limit further? This seems like a Nash equilibrium problem. One suggestion is to define a storage policy that is slightly dynamic for both per course hubs (Data 8, Data 100, potentially I school hub, etc..) and generic hubs (Datahub, R hub, Julia Hub, etc..). Open to debate whether the arbitrary number defined below makes sense.
Let me know what you all think about the above policy suggestions? Obviously, we can debate whether we need to optimize at this level for this policy? Are the efforts involved to define the policy worth the cost savings? |
@balajialg What do you think of just starting with a 100GB limit that we enforce, and see how that goes? I understand that what we want to express is not 'each student can use upto 100GB', but I worry about added complexity of defining limits for one user based on usage pattern of other users. This becomes messy in big hubs like datahub as you mention, as there isn't a clear way to delineate who should be 'comparable users'. I think simplifying policies so they are the same across hubs is a good way to start, and we can make things more complex if needed. |
@yuvipanda Sounds reasonable. I had a conversation with @felder about the archival process based on a few data points he was able to retrieve about the home directories. I will incorporate that as part of the storage policy PR. It might open up a new discussion on the archival process. For now, I will remove the entries related to the storage policy in this policy doc and make a new commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
I would like to add an additional policy proposal that involves not archiving storage for users who exceed a certain storage threshold while being inactive beyond 6 months. You can read more about the rationale as part of this issue #3376
This would be an additional change to our existing archival policy document. I also updated our new archived data
storage timeline to be around 6 months post which the data gets deleted (based on our sprint planning discussion on 5/5).
Open to all your comments to clarify the language and make a robust policy that helps communicate to our users clearly while also bringing cloud costs down.