-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stat 20 server start time #3836
Comments
It looks like the hub pod was restarted 86m ago (~9:38p), which would have affected students starting up their servers, but not those who were already running.
|
@andrewpbray shared this student's screenshot: |
This happened one hour ago.
|
#3839 I'm giving stat20 its own nodepool, with a 5 node minimum |
The beta pool went down to 14 at 8:42p, then went up, up, down, up, up, down, up to 10:18p. Non running pods spiked, probably as a result of this, over that same time period. @yuvipanda created the epsilon pool in #3839 and move the stat20 hub into it. @yuvipanda @balajialg Should there be some heuristic about how many different hub images a node pool should support, if this issue was caused due to image-pulling delays? stat20's image is based on rocker, and we can probably consolidate all rocker-based images into one. However I don't know if there are incompatibilities between libraries in the different rocker images. Before tonight, the set of hubs with rocker images using the beta pool were biology, ischool, publichealth, and stat20. |
@ryanlovett Thanks for the comprehensive report! I guess it will be helpful if and when we choose to write an After-Action report. Maybe @shaneknapp and @felder in discussion with you can review the existing images across the different node pools and make the final call with regard to heuristics? I will also add this as an agenda item for our upcoming team meeting. I see that we have been moving other images across node pools in response to issues - Check this and this. Might as well take a proactive stance going forward! |
yep, all sounds good.
…On Thu, Oct 13, 2022 at 11:24 AM Balaji Alwar ***@***.***> wrote:
@ryanlovett <https://github.com/ryanlovett> Thanks for the comprehensive
report! I guess it will be helpful if and when we choose to write an
After-Action report. Maybe @shaneknapp <https://github.com/shaneknapp>
and @felder <https://github.com/felder> in discussion with you can review
the existing images across the different node pools and make the final
call? I will also add this as an agenda item for our upcoming team meeting.
I see that we have been moving other images across node pools in response
to issues - this
<#3705> and this
<#3615>. Might as well
take a proactive stance going forward!
—
Reply to this email directly, view it on GitHub
<#3836 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMIHLF5WSSDAGBHWQZGWPDWDBHVTANCNFSM6AAAAAARD5F7KY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
yeah, 1min is fine. 10min is when hub starts counting them as dead. I think this would've caused a similar outage as last time if we hadn't moved them to their own pool |
stat20 has been moved to a filestore, and will also retain its own node pool going forward. |
Bug description
On the evening of Wed. Oct. 12, stat20 students reported that their servers were slow to start. NFS activity showed no heavy test_stateid ops, but steady getattr and one read spike at 8:07p:
It looks like server startup time increased beginning at about 8:45p where servers would take up to 2 minutes to start:

I'll collect more info in this issue.
Environment & setup
How to reproduce
The text was updated successfully, but these errors were encountered: