Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770

balajialg · 2021-09-17T02:43:35Z

We learned the hard way that making upgrades to the system (even unintentionally) during assignment deadlines can sometimes be challenging for our users and for us. In order to avoid such challenges in the future, In discussion with @felder, I am suggesting a policy action from a commit merging standpoint,

Cherry-pick and merge only repetitive commits like package/admin requests into prod during the assignment deadlines dates for Data 8 and 100.

@yuvipanda Is this a fair request? I know introducing such a policy can be bureaucratic, but I am thinking of this policy considering the irate Data 100 users yesterday. Thoughts?

Suppose we are aligned with the above point; what would be the easiest way to cherry-pick a commit to merge as part of a PR? It will definitely be helpful for us in the future where when there are multiple commits with few involving package requests or other repetitive requests;

yuvipanda · 2021-09-20T16:39:17Z

i'm generally very much not a fan of 'deploy windows' or 'do not deploy on these times'. The core of the problem for data100 #2677 is that the hub pod restarted due to a deploy. However, there are a lot of reasons a pod can restart! The hub pod can restart at different times for many reasons - memory limit, 5 concurrent spawn failures, node failure, etc. These should all be considered fairly 'normal', and we should engineer our services to match. This is generally true already - data100 hub did recover by itself.

However, it is still important for deploys to be as risk free as possible. I opened #2781 to figure out a way to not require hub pod restarts when doing deploys that change the image. That would've prevented this specific issue.

Each red arrow at the bottom is also a deployment. I think we've done so many without them causing any such issues. If we continue to see disruptions due to deployments, then we can reconsider.

What I would like to do though is to write a proper incident report

yuvipanda · 2021-09-20T16:42:52Z

There is a lot of literature around the SRE world about deploy windows like this. https://dev.to/quii/why-you-should-deploy-on-friday-afternoon-285h is one example.

yuvipanda · 2021-09-20T16:43:28Z

The other issue is one of co-ordination. We serve a lot of users across many many spaces - I'm sure there's always assignments for someone or the other at any given time. So usually these end up as 'you are allowed to deploy during these times' rather than 'you are not allowed to deploy during these times'. So this adds another dimension of complex co-ordination that is required, and I guarantee that not every class will have an idea of what those times are.

balajialg · 2021-09-20T21:56:48Z

@yuvipanda, I am completely aligned in principle with what you are arguing for. Adding too many processes will complicate the software delivery process and throttle the team's innovation.
However, I personally would love us to get to this point #2 soon (highlighted by the blog's author).

_Write an end to end test. These are expensive to write and run and should only be reserved for your most important journeys
Have monitoring with threshold alerts for when things go wrong
Set up your pipeline so that when your code is pushed all the automated tests are run, if they pass go to production.
Have some kind of green/blue release mechanism. Run your automated tests on the deployed release candidate and if they dont pass, dont ship it._

Considering #2781 and your inputs, I am closing this issue. if you have input on an easy way to cherry-pick commits then please do let us know.

felder · 2021-09-20T22:02:56Z

@balajialg in theory this is what the staging environment is for. The problem of course is how do we automate testing of the wide variety of items courses require? The staging environment generally guarantees for the most part the image builds and deployments will succeed, but we can't tell if a particular notebook that used to work will continue to do so.

Item 2 there is an issue that we're currently aware of and want to change.

balajialg · 2021-09-20T22:57:06Z

@felder Agreed. The current setup is fantastic from the standpoint that we don't deploy anything that has bugs. However, unless we have automated testing of varied components users use during staging, it is hard to find outage causing PRs without the user's feedback. We need to prioritize these automated testing use cases as soon as possible.

balajialg changed the title ~~[Advice] Cherry Picking a commit to merge from PR~~ Deciding on a Policy to make hub improvements without affecting hubs during crucial dates Sep 18, 2021

balajialg changed the title ~~Deciding on a Policy to make hub improvements without affecting hubs during crucial dates~~ Policy Decision regarding merging commits during Data 8 and 100 assignment deadlines Sep 18, 2021

balajialg changed the title ~~Policy Decision regarding merging commits during Data 8 and 100 assignment deadlines~~ Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines Sep 18, 2021

felder mentioned this issue Sep 20, 2021

Weekly Update - Monday, September 20th #2780

Closed

balajialg closed this as completed Sep 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770

Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770

balajialg commented Sep 17, 2021 •

edited

Loading

yuvipanda commented Sep 20, 2021

yuvipanda commented Sep 20, 2021

yuvipanda commented Sep 20, 2021

balajialg commented Sep 20, 2021 •

edited

Loading

felder commented Sep 20, 2021

balajialg commented Sep 20, 2021

Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770

Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770

Comments

balajialg commented Sep 17, 2021 • edited Loading

yuvipanda commented Sep 20, 2021

yuvipanda commented Sep 20, 2021

yuvipanda commented Sep 20, 2021

balajialg commented Sep 20, 2021 • edited Loading

felder commented Sep 20, 2021

balajialg commented Sep 20, 2021

balajialg commented Sep 17, 2021 •

edited

Loading

balajialg commented Sep 20, 2021 •

edited

Loading