-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CRITICAL: D100 hub completely dead/unusable for a handful of users #2688
Comments
@fperez chenh's server was being affected by something that was being automatically launched by the workspace. The right most tab said |
The same thing happened with the second student, however they had two workspace files. I moved them both aside as well and was able to start their server. I don't know why the problematic tabs were hanging, or why they were impacting the jupyter server in that way. If this happens again, the gsi should go directly to the student's /tree listing so that it doesn't start lab and the problematic assignment tab. They can then move aside (or just delete) the workspace and then visit /lab. |
Looks like lab can reset the default workspace automatically via /lab/workspaces/lab?reset, https://jupyterlab.readthedocs.io/en/stable/user/urls.html#resetting-a-workspace. Might be worth trying next time. |
Thx a lot @ryanlovett! I think that spinner we saw was a Voila run - we don't need Voila on this hub and I'm not sure why it's even installed in the first place... Thanks for the debugging tips! Quick q - how did you manage their files with a stopped server? Were you working on their filesystem from another location? |
@fperez No problem. Yeah, I used the Google cloud console to ssh into the NFS server containing everyone's files. |
Ah, understood - I don't think that hook is available to us. We have JupyterHub admin accounts, but not GCP console access. And that's fine! I'm not asking for it, just understanding how you worked out the file cleanup with a deadlocked hub. Thanks! |
I think you can start the user's server by visiting the user's |
Those are great tips - and I wasn't aware of the lab reset URL api, big thanks for pointing that out! And thx for the reminder on Excellent mini-primer on debugging hosed Lab sessions! I'd still like to know why Voilà was causing that blockage... Honestly we don't need it at all in our image for this course, so I'm totally OK if you want to remove it. I did some quick testing with the Voilà button and I don't get the same blockage, but who knows... In any case, I'm going to close this issue as the main problem is now resolved. But thanks both for the quick response and the debugging tips! |
It looks like data100 uses the same user image as the main hub and the r hub, and Voilà was recently added via #784. |
Thx, no problem - we can monitor and if it causes problems again (for us or any other hub) we can see if anyone is actually using it this term. But if it doesn't happen again, it's no biggie. I suspect what happened was that the student accidentally clicked on the Voilà button or did so out of curiosity. It's not like we ask them to set up a Voilà server on Lab 1 of the semester :) |
Hello! Thanks for helping handle this, @ryanlovett! My suspicion is that something happened in voila that locked up the main thread and this stopped the server from being able to handle any requests (yay async programming?). I don't know if anyone is actively using voila but let's turn it off for just now |
We suspect it caused the main thread to block, preventing all other server actions (like file saves) from working. Ref berkeley-dsep-infra#2688 Undoes berkeley-dsep-infra#784
Please report again if you run into this? I'd also appreciate help in:
|
Thanks, @ryanlovett, and @yuvipanda for fixing this critical issue! |
Thx @yuvipanda - I'll open an issue in Voilà, but for now unfortunately it will be super-vague, as I don't know exactly what triggered it, and my attempts to cause it with the same files the student had open didn't reproduce the problem. But it might tickle the brain of one of the Voilà devs, who knows... |
We suspect it caused the main thread to block, preventing all other server actions (like file saves) from working. Ref berkeley-dsep-infra/datahub#2688 Undoes berkeley-dsep-infra/datahub#784
We suspect it caused the main thread to block, preventing all other server actions (like file saves) from working. Ref berkeley-dsep-infra/datahub#2688 Undoes berkeley-dsep-infra/datahub#784
We suspect it caused the main thread to block, preventing all other server actions (like file saves) from working. Ref berkeley-dsep-infra/datahub#2688 Undoes berkeley-dsep-infra/datahub#784
We suspect it caused the main thread to block, preventing all other server actions (like file saves) from working. Ref berkeley-dsep-infra/datahub#2688 Undoes berkeley-dsep-infra/datahub#784
We suspect it caused the main thread to block, preventing all other server actions (like file saves) from working. Ref berkeley-dsep-infra/datahub#2688 Undoes berkeley-dsep-infra/datahub#784
We suspect it caused the main thread to block, preventing all other server actions (like file saves) from working. Ref berkeley-dsep-infra/datahub#2688 Undoes berkeley-dsep-infra/datahub#784
Bug description
Hi team,
We have a critical problem (fortunately only affecting right now very few students). As seen on this screenshot, the student's home directory appears empty, and no code can be run from the currently open notebooks:
We tested opening Classic too and it similarly spins forever. It seems as it access to the student's home directory was lost. This screenshot was taken by @abadrinath947 who was able to access the student's server, but now when I try to do the same, I can't even get the server load to complete.
For this other student who had reported similar problems:
right now we can connect and I was able to create the Untitled notebook shown above, but after running the
1+1
cell, the next one never completes.The students' names can be read in the URL in the images above.
We had unconfirmed reports of a couple of others with similar problems, we'll update here if we can confirm those as well. Hopefully if the problem shows up in your logs, you can identify any others for whom the issue has occurred.
We'd appreciate a response on this - right now these students are completely stuck not being able to do any work at all with the system. If private communication is required, you can use my Berkeley email.
Thanks!
Environment & setup
How to reproduce
See above.
The text was updated successfully, but these errors were encountered: