Stat 20 server start time #3836

ryanlovett · 2022-10-13T05:49:41Z

Bug description

On the evening of Wed. Oct. 12, stat20 students reported that their servers were slow to start. NFS activity showed no heavy test_stateid ops, but steady getattr and one read spike at 8:07p:

It looks like server startup time increased beginning at about 8:45p where servers would take up to 2 minutes to start:

I'll collect more info in this issue.

Environment & setup

Hub:
Language:

How to reproduce

ryanlovett · 2022-10-13T05:55:05Z

It looks like the hub pod was restarted 86m ago (~9:38p), which would have affected students starting up their servers, but not those who were already running.

hub-5d497fffb-wtx8f             1/1     Running   3 (86m ago)   2d7h

ryanlovett · 2022-10-13T06:06:22Z

Here's a 24h snapshot:

ryanlovett · 2022-10-13T06:12:54Z

@andrewpbray shared this student's screenshot:

ryanlovett · 2022-10-13T06:24:11Z

This happened one hour ago.

60m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/42 nodes are available: 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 28 Insufficient memory. preemption: 0/42 nodes are available: 16 Insufficient memory, 26 Preemption is not helpful for scheduling.
60m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/43 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 28 Insufficient memory. preemption: 0/43 nodes are available: 16 Insufficient memory, 27 Preemption is not helpful for scheduling.
59m         Warning   FailedScheduling   pod/jupyter-[snip]                  0/43 nodes are available: 1 node(s) had untolerated taint {node.kubernetes.io/not-ready: }, 2 node(s) had untolerated taint {node.kubernetes.io/unschedulable: }, 2 node(s) were unschedulable, 26 node(s) didn't match Pod's node affinity/selector, 29 Insufficient memory. preemption: 0/43 nodes are available: 16 Insufficient memory, 27 Preemption is not helpful for scheduling.
59m         Normal    Scheduled          pod/jupyter-[snip]                  Successfully assigned stat20-prod/jupyter-[snip] to gke-fall-2019-user-beta-2021-11-06-d520c2e1-6m77

ryanlovett · 2022-10-13T06:28:01Z

Non running pods. It seems that none failed, but many were pending and eventually started.

yuvipanda · 2022-10-13T07:13:27Z

#3839 I'm giving stat20 its own nodepool, with a 5 node minimum

ryanlovett · 2022-10-13T16:50:02Z

ryanlovett · 2022-10-13T17:09:29Z

The beta pool went down to 14 at 8:42p, then went up, up, down, up, up, down, up to 10:18p. Non running pods spiked, probably as a result of this, over that same time period.

@yuvipanda created the epsilon pool in #3839 and move the stat20 hub into it.

@yuvipanda @balajialg Should there be some heuristic about how many different hub images a node pool should support, if this issue was caused due to image-pulling delays?

stat20's image is based on rocker, and we can probably consolidate all rocker-based images into one. However I don't know if there are incompatibilities between libraries in the different rocker images.

Before tonight, the set of hubs with rocker images using the beta pool were biology, ischool, publichealth, and stat20.

balajialg · 2022-10-13T18:24:13Z

@ryanlovett Thanks for the comprehensive report! I guess it will be helpful if and when we choose to write an After-Action report. Maybe @shaneknapp and @felder in discussion with you can review the existing images across the different node pools and make the final call with regard to heuristics? I will also add this as an agenda item for our upcoming team meeting.

I see that we have been moving other images across node pools in response to issues - Check this and this. Might as well take a proactive stance going forward!

shaneknapp · 2022-10-13T22:16:25Z

yep, all sounds good.

…

On Thu, Oct 13, 2022 at 11:24 AM Balaji Alwar ***@***.***> wrote: @ryanlovett <https://github.com/ryanlovett> Thanks for the comprehensive report! I guess it will be helpful if and when we choose to write an After-Action report. Maybe @shaneknapp <https://github.com/shaneknapp> and @felder <https://github.com/felder> in discussion with you can review the existing images across the different node pools and make the final call? I will also add this as an agenda item for our upcoming team meeting. I see that we have been moving other images across node pools in response to issues - this <#3705> and this <#3615>. Might as well take a proactive stance going forward! — Reply to this email directly, view it on GitHub <#3836 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMIHLF5WSSDAGBHWQZGWPDWDBHVTANCNFSM6AAAAAARD5F7KY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ryanlovett · 2022-10-14T17:20:32Z

stat20 has gone from ~10 users to ~190 over the past couple of hours and it looks like the epsilon node count is going up, and not fluctuating.

ryanlovett · 2022-10-14T17:21:42Z

There have been a few occurrances where servers took as long as ~1m to start though:

yuvipanda · 2022-10-14T17:52:07Z

yeah, 1min is fine. 10min is when hub starts counting them as dead. I think this would've caused a similar outage as last time if we hadn't moved them to their own pool

ryanlovett · 2022-10-14T18:36:41Z

When I display the data for a lot of -prod hubs, there are a lot of other server start times of just less than 1m (not stat20-specific), and they're clustered together. It seems like they're roughly 15m apart. (11:01a, 10:46a, 10:31a, 10:16a, ...)

I picked a day two weeks back to view, and they're mostly around 10s, and they're not clustered together.

So perhaps the NFS restarts are pushing back some set of server start times to 1m.

ryanlovett · 2022-12-22T00:11:17Z

stat20 has been moved to a filestore, and will also retain its own node pool going forward.

ryanlovett added the bug label Oct 13, 2022

ryanlovett assigned felder and balajialg Oct 13, 2022

ryanlovett mentioned this issue Oct 13, 2022

Stat 20 hub getting hung up over the weekend and this morning #3701

Closed

ryanlovett mentioned this issue Oct 14, 2022

Adjust spawner messaging when servers are slow to start #3845

Closed

balajialg mentioned this issue Nov 28, 2022

Collating information about outages for Incident Reports #2791

Open

ryanlovett closed this as completed Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stat 20 server start time #3836

Stat 20 server start time #3836

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022 •

edited

Loading

ryanlovett commented Oct 13, 2022

yuvipanda commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

balajialg commented Oct 13, 2022 •

edited

Loading

shaneknapp commented Oct 13, 2022 via email

ryanlovett commented Oct 14, 2022

ryanlovett commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

ryanlovett commented Oct 14, 2022 •

edited

Loading

ryanlovett commented Dec 22, 2022

Stat 20 server start time #3836

Stat 20 server start time #3836

Comments

ryanlovett commented Oct 13, 2022

Bug description

Environment & setup

How to reproduce

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022 • edited Loading

ryanlovett commented Oct 13, 2022

yuvipanda commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

ryanlovett commented Oct 13, 2022

balajialg commented Oct 13, 2022 • edited Loading

shaneknapp commented Oct 13, 2022 via email

ryanlovett commented Oct 14, 2022

ryanlovett commented Oct 14, 2022

yuvipanda commented Oct 14, 2022

ryanlovett commented Oct 14, 2022 • edited Loading

ryanlovett commented Dec 22, 2022

ryanlovett commented Oct 13, 2022 •

edited

Loading

balajialg commented Oct 13, 2022 •

edited

Loading

ryanlovett commented Oct 14, 2022 •

edited

Loading