Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid server restarting when disconnecting from browser #422

Open
sebastien-rettie opened this issue Jan 20, 2025 · 9 comments
Open

Avoid server restarting when disconnecting from browser #422

sebastien-rettie opened this issue Jan 20, 2025 · 9 comments

Comments

@sebastien-rettie
Copy link

To what extent are the coffea-casa images dependent on one staying “connected” via the web browser? E.g. if I start a process/command in the terminal from a web browser, I wonder how long I can stay away/disconnected from the web browser before the process is killed/shutdown? I’ve had both cases on the UChicago cluster, where e.g. I close my browser, reopen https://coffea-dev.af.uchicago.edu/hub/login and the process is still running, but also cases where I log back in and it asks me to login/restart the server, so I’m wondering at what point this actually happens?

Chatting with @oshadura the duration should be 2 weeks, but a few reports seem to indicate this is not the case currently (at least on the UChicago cluster).

@marcus-vgr
Copy link

Hi, I wanted to ping this again.
On friday I launched the UChicago server, put some jobs to run, and it was working just fine while I was still using my computer. On the next day when I opened my computer again (although didn't close the cluster-tab), I saw the dashboard has been disconnected in the meantime. So of course the job didn't run till the end. Is there anything we are missing on our side to assure the job will continue running?

Tagging @oshadura

@oshadura
Copy link
Member

oshadura commented Feb 3, 2025

@fengpinghu could you please take a look on this issue?

Honestly it should not be an issue and I checked with @jthiltges and @clundst and restarting flux should also not trigger shutting down server.

From zero-tojupyterhub developpers they suggest to check "culling" setup https://z2jh.jupyter.org/en/latest/jupyterhub/customizing/user-management.html

@fengpinghu
Copy link

Indeed, we have the default culling setup enabled for the Coffea-Casa instances running at UC AF. This means that if a notebook kernel remains idle for approximately one hour, the server will be stopped. While the in-memory session will be lost, notebooks are automatically saved at regular intervals (checkpoints). You should be able to recover your last saved state when you log in again.

@sebastien-rettie
Copy link
Author

Thanks for following up @fengpinghu! We also have several people using terminals instead of notebooks; would it be possible to have such a recovery mechanism for terminals too? (I'll let @marcus-vgr comment on if this is the use-case he's thinking of)

@fengpinghu
Copy link

Thanks @sebastien-rettie for the clarification.
Checkpoints are only available for notebooks. However, we do have login nodes that you can access via SSH, which might be a good alternative for enabling session recovery.

@marcus-vgr
Copy link

Hi @fengpinghu , sorry I am a bit confused. We connect to UChicago via https://coffea-dev.af.uchicago.edu and run a "simple" python script in the terminal. It however can take many hours (perhaps ~day) to finish when we have to run over many datasets / variations. From what I understood, if the server disconnects after some idle time, the python script is killed. How can we make sure the job continues running even though we are no longer connected? Would we need to setup the job via e.g. HTCondor? If yes, would we need some extra configuration to assure the dask cluster will be used properly?

@fengpinghu
Copy link

Hi @marcus-vgr, From your description, it sounds like you'd like to keep the Dask cluster alive and run a Python script from the terminal that interacts with it. In that case, the best option would be to extend the notebook server timeout, even if it appears idle. While we don’t want it to run indefinitely, would extending it to 24 hours work for your use case?

@marcus-vgr
Copy link

Hi @fengpinghu , yes, I think this would work perfectly! Is this something configured globally by you folks, or can we users modify it for our personal user-cases ?

@fengpinghu
Copy link

Hi @marcus-vgr , great. It's already configured globally. You don't need to do anything. Let us know if you encounter any issues. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants