-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770
Comments
i'm generally very much not a fan of 'deploy windows' or 'do not deploy on these times'. The core of the problem for data100 #2677 is that the hub pod restarted due to a deploy. However, there are a lot of reasons a pod can restart! The hub pod can restart at different times for many reasons - memory limit, 5 concurrent spawn failures, node failure, etc. These should all be considered fairly 'normal', and we should engineer our services to match. This is generally true already - data100 hub did recover by itself. However, it is still important for deploys to be as risk free as possible. I opened #2781 to figure out a way to not require hub pod restarts when doing deploys that change the image. That would've prevented this specific issue. Each red arrow at the bottom is also a deployment. I think we've done so many without them causing any such issues. If we continue to see disruptions due to deployments, then we can reconsider. What I would like to do though is to write a proper incident report |
There is a lot of literature around the SRE world about deploy windows like this. https://dev.to/quii/why-you-should-deploy-on-friday-afternoon-285h is one example. |
The other issue is one of co-ordination. We serve a lot of users across many many spaces - I'm sure there's always assignments for someone or the other at any given time. So usually these end up as 'you are allowed to deploy during these times' rather than 'you are not allowed to deploy during these times'. So this adds another dimension of complex co-ordination that is required, and I guarantee that not every class will have an idea of what those times are. |
@yuvipanda, I am completely aligned in principle with what you are arguing for. Adding too many processes will complicate the software delivery process and throttle the team's innovation.
Considering #2781 and your inputs, I am closing this issue. if you have input on an easy way to cherry-pick commits then please do let us know. |
@balajialg in theory this is what the staging environment is for. The problem of course is how do we automate testing of the wide variety of items courses require? The staging environment generally guarantees for the most part the image builds and deployments will succeed, but we can't tell if a particular notebook that used to work will continue to do so. Item 2 there is an issue that we're currently aware of and want to change. |
@felder Agreed. The current setup is fantastic from the standpoint that we don't deploy anything that has bugs. However, unless we have automated testing of varied components users use during staging, it is hard to find outage causing PRs without the user's feedback. We need to prioritize these automated testing use cases as soon as possible. |
We learned the hard way that making upgrades to the system (even unintentionally) during assignment deadlines can sometimes be challenging for our users and for us. In order to avoid such challenges in the future, In discussion with @felder, I am suggesting a policy action from a commit merging standpoint,
@yuvipanda Is this a fair request? I know introducing such a policy can be bureaucratic, but I am thinking of this policy considering the irate Data 100 users yesterday. Thoughts?
Suppose we are aligned with the above point; what would be the easiest way to cherry-pick a commit to merge as part of a PR? It will definitely be helpful for us in the future where when there are multiple commits with few involving package requests or other repetitive requests;
The text was updated successfully, but these errors were encountered: