Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines #2770

Closed
balajialg opened this issue Sep 17, 2021 · 6 comments

Comments

@balajialg
Copy link
Contributor

balajialg commented Sep 17, 2021

We learned the hard way that making upgrades to the system (even unintentionally) during assignment deadlines can sometimes be challenging for our users and for us. In order to avoid such challenges in the future, In discussion with @felder, I am suggesting a policy action from a commit merging standpoint,

  • Cherry-pick and merge only repetitive commits like package/admin requests into prod during the assignment deadlines dates for Data 8 and 100.

@yuvipanda Is this a fair request? I know introducing such a policy can be bureaucratic, but I am thinking of this policy considering the irate Data 100 users yesterday. Thoughts?

Suppose we are aligned with the above point; what would be the easiest way to cherry-pick a commit to merge as part of a PR? It will definitely be helpful for us in the future where when there are multiple commits with few involving package requests or other repetitive requests;

@balajialg balajialg changed the title [Advice] Cherry Picking a commit to merge from PR Deciding on a Policy to make hub improvements without affecting hubs during crucial dates Sep 18, 2021
@balajialg balajialg changed the title Deciding on a Policy to make hub improvements without affecting hubs during crucial dates Policy Decision regarding merging commits during Data 8 and 100 assignment deadlines Sep 18, 2021
@balajialg balajialg changed the title Policy Decision regarding merging commits during Data 8 and 100 assignment deadlines Policy Decision regarding merging commits into Prod during Data 8 and 100 assignment deadlines Sep 18, 2021
@yuvipanda
Copy link
Contributor

i'm generally very much not a fan of 'deploy windows' or 'do not deploy on these times'. The core of the problem for data100 #2677 is that the hub pod restarted due to a deploy. However, there are a lot of reasons a pod can restart! The hub pod can restart at different times for many reasons - memory limit, 5 concurrent spawn failures, node failure, etc. These should all be considered fairly 'normal', and we should engineer our services to match. This is generally true already - data100 hub did recover by itself.

However, it is still important for deploys to be as risk free as possible. I opened #2781 to figure out a way to not require hub pod restarts when doing deploys that change the image. That would've prevented this specific issue.

image

Each red arrow at the bottom is also a deployment. I think we've done so many without them causing any such issues. If we continue to see disruptions due to deployments, then we can reconsider.

What I would like to do though is to write a proper incident report

@yuvipanda
Copy link
Contributor

There is a lot of literature around the SRE world about deploy windows like this. https://dev.to/quii/why-you-should-deploy-on-friday-afternoon-285h is one example.

@yuvipanda
Copy link
Contributor

The other issue is one of co-ordination. We serve a lot of users across many many spaces - I'm sure there's always assignments for someone or the other at any given time. So usually these end up as 'you are allowed to deploy during these times' rather than 'you are not allowed to deploy during these times'. So this adds another dimension of complex co-ordination that is required, and I guarantee that not every class will have an idea of what those times are.

@balajialg
Copy link
Contributor Author

balajialg commented Sep 20, 2021

@yuvipanda, I am completely aligned in principle with what you are arguing for. Adding too many processes will complicate the software delivery process and throttle the team's innovation.
However, I personally would love us to get to this point #2 soon (highlighted by the blog's author).

  1. _Write an end to end test. These are expensive to write and run and should only be reserved for your most important journeys
  2. Have monitoring with threshold alerts for when things go wrong
  3. Set up your pipeline so that when your code is pushed all the automated tests are run, if they pass go to production.
  4. Have some kind of green/blue release mechanism. Run your automated tests on the deployed release candidate and if they dont pass, dont ship it._

Considering #2781 and your inputs, I am closing this issue. if you have input on an easy way to cherry-pick commits then please do let us know.

@felder
Copy link
Contributor

felder commented Sep 20, 2021

@balajialg in theory this is what the staging environment is for. The problem of course is how do we automate testing of the wide variety of items courses require? The staging environment generally guarantees for the most part the image builds and deployments will succeed, but we can't tell if a particular notebook that used to work will continue to do so.

Item 2 there is an issue that we're currently aware of and want to change.

@balajialg
Copy link
Contributor Author

@felder Agreed. The current setup is fantastic from the standpoint that we don't deploy anything that has bugs. However, unless we have automated testing of varied components users use during staging, it is hard to find outage causing PRs without the user's feedback. We need to prioritize these automated testing use cases as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants