Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CRITICAL: D100 hub completely dead/unusable for a handful of users #2688

Closed
fperez opened this issue Sep 3, 2021 · 14 comments
Closed

CRITICAL: D100 hub completely dead/unusable for a handful of users #2688

fperez opened this issue Sep 3, 2021 · 14 comments
Labels

Comments

@fperez
Copy link
Collaborator

fperez commented Sep 3, 2021

Bug description

Hi team,

We have a critical problem (fortunately only affecting right now very few students). As seen on this screenshot, the student's home directory appears empty, and no code can be run from the currently open notebooks:

image

We tested opening Classic too and it similarly spins forever. It seems as it access to the student's home directory was lost. This screenshot was taken by @abadrinath947 who was able to access the student's server, but now when I try to do the same, I can't even get the server load to complete.

For this other student who had reported similar problems:

image

right now we can connect and I was able to create the Untitled notebook shown above, but after running the 1+1 cell, the next one never completes.

The students' names can be read in the URL in the images above.

We had unconfirmed reports of a couple of others with similar problems, we'll update here if we can confirm those as well. Hopefully if the problem shows up in your logs, you can identify any others for whom the issue has occurred.

We'd appreciate a response on this - right now these students are completely stuck not being able to do any work at all with the system. If private communication is required, you can use my Berkeley email.

Thanks!

Environment & setup

  • Hub: Data 100
  • Language: Python

How to reproduce

See above.

@fperez fperez added the bug label Sep 3, 2021
@ryanlovett
Copy link
Collaborator

@fperez chenh's server was being affected by something that was being automatically launched by the workspace. The right most tab said Executing 35 of 72 so I thought it might be DOS'ing itself. I stopped their server and moved aside their workspace file from ~/.jupyter/lab/workspaces/ to ~/. On restart, their workspace was vanilla, but the files were showing in the left side listing. I'll leave the server running.

@ryanlovett
Copy link
Collaborator

The same thing happened with the second student, however they had two workspace files. I moved them both aside as well and was able to start their server.

I don't know why the problematic tabs were hanging, or why they were impacting the jupyter server in that way.

If this happens again, the gsi should go directly to the student's /tree listing so that it doesn't start lab and the problematic assignment tab. They can then move aside (or just delete) the workspace and then visit /lab.

@ryanlovett
Copy link
Collaborator

Looks like lab can reset the default workspace automatically via /lab/workspaces/lab?reset, https://jupyterlab.readthedocs.io/en/stable/user/urls.html#resetting-a-workspace. Might be worth trying next time.

@fperez
Copy link
Collaborator Author

fperez commented Sep 3, 2021

Thx a lot @ryanlovett! I think that spinner we saw was a Voila run - we don't need Voila on this hub and I'm not sure why it's even installed in the first place...

Thanks for the debugging tips! Quick q - how did you manage their files with a stopped server? Were you working on their filesystem from another location?

@ryanlovett
Copy link
Collaborator

@fperez No problem. Yeah, I used the Google cloud console to ssh into the NFS server containing everyone's files.

@fperez
Copy link
Collaborator Author

fperez commented Sep 3, 2021

Ah, understood - I don't think that hook is available to us. We have JupyterHub admin accounts, but not GCP console access. And that's fine! I'm not asking for it, just understanding how you worked out the file cleanup with a deadlocked hub. Thanks!

@ryanlovett
Copy link
Collaborator

I think you can start the user's server by visiting the user's /lab/workspaces/lab?reset to reset the workspace. (just based on the docs -- I haven't tried it) Alternatively you can start the user's server by visiting their /tree so that lab doesn't start, then do the file manipulation.

@fperez
Copy link
Collaborator Author

fperez commented Sep 3, 2021

Those are great tips - and I wasn't aware of the lab reset URL api, big thanks for pointing that out!

And thx for the reminder on /tree- I used to always get classic like that, then got used to hopping to it from the Lab menu and forgot about that option! And today, with a wedged server (where the menu wasn't working), I didn't think of going to that URL manually :)

Excellent mini-primer on debugging hosed Lab sessions!

I'd still like to know why Voilà was causing that blockage... Honestly we don't need it at all in our image for this course, so I'm totally OK if you want to remove it. I did some quick testing with the Voilà button and I don't get the same blockage, but who knows...

In any case, I'm going to close this issue as the main problem is now resolved. But thanks both for the quick response and the debugging tips!

@fperez fperez closed this as completed Sep 3, 2021
@ryanlovett
Copy link
Collaborator

It looks like data100 uses the same user image as the main hub and the r hub, and Voilà was recently added via #784.

@fperez
Copy link
Collaborator Author

fperez commented Sep 3, 2021

Thx, no problem - we can monitor and if it causes problems again (for us or any other hub) we can see if anyone is actually using it this term. But if it doesn't happen again, it's no biggie. I suspect what happened was that the student accidentally clicked on the Voilà button or did so out of curiosity. It's not like we ask them to set up a Voilà server on Lab 1 of the semester :)

@yuvipanda
Copy link
Contributor

Hello! Thanks for helping handle this, @ryanlovett!

My suspicion is that something happened in voila that locked up the main thread and this stopped the server from being able to handle any requests (yay async programming?). I don't know if anyone is actively using voila but let's turn it off for just now

@yuvipanda yuvipanda reopened this Sep 3, 2021
yuvipanda added a commit to yuvipanda/datahub that referenced this issue Sep 3, 2021
We suspect it caused the main thread to block, preventing
all other server actions (like file saves) from working.

Ref berkeley-dsep-infra#2688
Undoes berkeley-dsep-infra#784
This was referenced Sep 3, 2021
@yuvipanda
Copy link
Contributor

Please report again if you run into this? I'd also appreciate help in:

  1. Reproducing this,
  2. Reporting this upstream to voila

@balajialg
Copy link
Contributor

Thanks, @ryanlovett, and @yuvipanda for fixing this critical issue!

@fperez
Copy link
Collaborator Author

fperez commented Sep 3, 2021

Thx @yuvipanda - I'll open an issue in Voilà, but for now unfortunately it will be super-vague, as I don't know exactly what triggered it, and my attempts to cause it with the same files the student had open didn't reproduce the problem.

But it might tickle the brain of one of the Voilà devs, who knows...

shaneknapp pushed a commit to berkeley-dsep-infra/biology-user-image that referenced this issue Sep 6, 2024
We suspect it caused the main thread to block, preventing
all other server actions (like file saves) from working.

Ref berkeley-dsep-infra/datahub#2688
Undoes berkeley-dsep-infra/datahub#784
shaneknapp pushed a commit to berkeley-dsep-infra/publichealth-user-image that referenced this issue Sep 11, 2024
We suspect it caused the main thread to block, preventing
all other server actions (like file saves) from working.

Ref berkeley-dsep-infra/datahub#2688
Undoes berkeley-dsep-infra/datahub#784
shaneknapp pushed a commit to berkeley-dsep-infra/eecs-user-image that referenced this issue Sep 11, 2024
We suspect it caused the main thread to block, preventing
all other server actions (like file saves) from working.

Ref berkeley-dsep-infra/datahub#2688
Undoes berkeley-dsep-infra/datahub#784
shaneknapp pushed a commit to berkeley-dsep-infra/julia-user-image that referenced this issue Sep 12, 2024
We suspect it caused the main thread to block, preventing
all other server actions (like file saves) from working.

Ref berkeley-dsep-infra/datahub#2688
Undoes berkeley-dsep-infra/datahub#784
shaneknapp pushed a commit to berkeley-dsep-infra/ischool-user-image that referenced this issue Sep 24, 2024
We suspect it caused the main thread to block, preventing
all other server actions (like file saves) from working.

Ref berkeley-dsep-infra/datahub#2688
Undoes berkeley-dsep-infra/datahub#784
shaneknapp pushed a commit to berkeley-dsep-infra/datahub-user-image that referenced this issue Sep 25, 2024
We suspect it caused the main thread to block, preventing
all other server actions (like file saves) from working.

Ref berkeley-dsep-infra/datahub#2688
Undoes berkeley-dsep-infra/datahub#784
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants