Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stat 20 server start time #3836

Closed
ryanlovett opened this issue Oct 13, 2022 · 15 comments
Closed

Stat 20 server start time #3836

ryanlovett opened this issue Oct 13, 2022 · 15 comments
Assignees
Labels

Comments

@ryanlovett
Copy link
Collaborator

Bug description

On the evening of Wed. Oct. 12, stat20 students reported that their servers were slow to start. NFS activity showed no heavy test_stateid ops, but steady getattr and one read spike at 8:07p:

Screen Shot 2022-10-12 at 10 40 30 PM

It looks like server startup time increased beginning at about 8:45p where servers would take up to 2 minutes to start:
Screen Shot 2022-10-12 at 10 46 22 PM

I'll collect more info in this issue.

Environment & setup

  • Hub:
  • Language:

How to reproduce

@ryanlovett
Copy link
Collaborator Author

It looks like the hub pod was restarted 86m ago (~9:38p), which would have affected students starting up their servers, but not those who were already running.

hub-5d497fffb-wtx8f             1/1     Running   3 (86m ago)   2d7h

@ryanlovett
Copy link
Collaborator Author

Here's a 24h snapshot:
Screen Shot 2022-10-12 at 11 05 24 PM

@ryanlovett
Copy link
Collaborator Author

@andrewpbray shared this student's screenshot:
unnamed (1)

@ryanlovett
Copy link
Collaborator Author

ryanlovett commented Oct 13, 2022

This happened one hour ago.

60m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/42 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 28 Insufficient memory. preemption: 0/42 nodes are available: 16 Insufficient memory, 26 Preemption is not helpful for scheduling.
60m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/43 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 28 Insufficient memory. preemption: 0/43 nodes are available: 16 Insufficient memory, 27 Preemption is not helpful for scheduling.
59m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/43 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 29 Insufficient memory. preemption: 0/43 nodes are available: 16 Insufficient memory, 27 Preemption is not helpful for scheduling.
59m         Normal    Scheduled          pod/jupyter-[snip]                  Successfully assigned stat20-prod/jupyter-[snip] to gke-fall-2019-user-beta-2021-11-06-d520c2e1-6m77

@ryanlovett
Copy link
Collaborator Author

Non running pods. It seems that none failed, but many were pending and eventually started.

Screen Shot 2022-10-12 at 11 26 27 PM

@yuvipanda
Copy link
Contributor

#3839 I'm giving stat20 its own nodepool, with a 5 node minimum

@ryanlovett
Copy link
Collaborator Author

Screen Shot 2022-10-13 at 9 46 38 AM

Screen Shot 2022-10-13 at 9 46 56 AM

Screen Shot 2022-10-13 at 9 47 35 AM

Screen Shot 2022-10-13 at 9 49 28 AM

@ryanlovett
Copy link
Collaborator Author

The beta pool went down to 14 at 8:42p, then went up, up, down, up, up, down, up to 10:18p. Non running pods spiked, probably as a result of this, over that same time period.

@yuvipanda created the epsilon pool in #3839 and move the stat20 hub into it.

@yuvipanda @balajialg Should there be some heuristic about how many different hub images a node pool should support, if this issue was caused due to image-pulling delays?

stat20's image is based on rocker, and we can probably consolidate all rocker-based images into one. However I don't know if there are incompatibilities between libraries in the different rocker images.

Before tonight, the set of hubs with rocker images using the beta pool were biology, ischool, publichealth, and stat20.

@balajialg
Copy link
Contributor

balajialg commented Oct 13, 2022

@ryanlovett Thanks for the comprehensive report! I guess it will be helpful if and when we choose to write an After-Action report. Maybe @shaneknapp and @felder in discussion with you can review the existing images across the different node pools and make the final call with regard to heuristics? I will also add this as an agenda item for our upcoming team meeting.

I see that we have been moving other images across node pools in response to issues - Check this and this. Might as well take a proactive stance going forward!

@shaneknapp
Copy link
Contributor

shaneknapp commented Oct 13, 2022 via email

@ryanlovett
Copy link
Collaborator Author

stat20 has gone from ~10 users to ~190 over the past couple of hours and it looks like the epsilon node count is going up, and not fluctuating.
Screen Shot 2022-10-14 at 10 19 47 AM
Screen Shot 2022-10-14 at 10 20 01 AM

@ryanlovett
Copy link
Collaborator Author

There have been a few occurrances where servers took as long as ~1m to start though:
Screen Shot 2022-10-14 at 10 21 15 AM

@yuvipanda
Copy link
Contributor

yeah, 1min is fine. 10min is when hub starts counting them as dead. I think this would've caused a similar outage as last time if we hadn't moved them to their own pool

@ryanlovett
Copy link
Collaborator Author

ryanlovett commented Oct 14, 2022

When I display the data for a lot of -prod hubs, there are a lot of other server start times of just less than 1m (not stat20-specific), and they're clustered together. It seems like they're roughly 15m apart. (11:01a, 10:46a, 10:31a, 10:16a, ...)
Screen Shot 2022-10-14 at 11 15 22 AM

I picked a day two weeks back to view, and they're mostly around 10s, and they're not clustered together.
Screen Shot 2022-10-14 at 11 32 48 AM

So perhaps the NFS restarts are pushing back some set of server start times to 1m.

@ryanlovett
Copy link
Collaborator Author

stat20 has been moved to a filestore, and will also retain its own node pool going forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants