Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wsagent abruptly stops requiring workspace restart #11932

Closed
mjshashank opened this issue Nov 14, 2018 · 9 comments
Closed

wsagent abruptly stops requiring workspace restart #11932

mjshashank opened this issue Nov 14, 2018 · 9 comments
Labels
kind/question Questions that haven't been identified as being feature requests or bugs.

Comments

@mjshashank
Copy link
Contributor

Description

I have Che running on kubernetes with 100+ workspaces. Lately I have been noticing that for a few workspaces the ws-agent becomes unreachable leading to a notification prompting the user to restart his/her workspace.

This is a significant number of workspaces (10% of running workspaces daily) and restarts are cumbersome for our use-case.

We have ruled out the resource constraints. Mem/CPU/Disk peak usages are all below 50%.

Looking through the bootstrapper/catalina logs hasn't yielded anything so far.

This is a sporadic issue, seems to occur randomly.

Are there any leads regarding a possible root cause for this issue? Any help is much appreciated.

Env
Che: 6.9.0
GKE Kubernetes(1.9)
Ingress controller: Traefik:1.6.6

@ghost
Copy link

ghost commented Nov 15, 2018

@mjshashank i wonder if it's related to ingresses. Client runs periodical checks if ws agent is alive, if it's unreachable then server tries to reach it. I'd take a look at workspace ingresses when the issue occurs again.

@ghost ghost added the kind/question Questions that haven't been identified as being feature requests or bugs. label Nov 15, 2018
@mjshashank
Copy link
Contributor Author

mjshashank commented Nov 16, 2018

@eivantsov Thank you for your reply. I looked into this further and realised workspace pod are being rescheduled as part of kubernetes autoscaling and the new pod that comes up does not have the bootstrapper binary or the command from the che server to execute the same. Hence the agents are all down and the workspace is unusable.

Is this a known issue? If so, is there an expected way to handle this at scale so that the necessary startup steps (like bootstrapper) are done to keep the workspace usable?

Note: This seems to be an issue since workspaces were converted to deployments since cluster autoscalers don't reschedule individual pods not backed by deployments.

@skabashnyuk
Copy link
Contributor

can you try fresh Che version?

@mjshashank
Copy link
Contributor Author

@skabashnyuk Will do and get back to you. This issue has been addressed is it?

@ghost
Copy link

ghost commented Nov 19, 2018

@skabashnyuk I do not think it has been addressed, as least with Che 6.

If you try Che 6 without ws-agent and other agents, the issue is gone since provisioning happens in the entrypoint, thus, pod restart will not cause anything disrupting for a workspace.

@mjshashank
Copy link
Contributor Author

@eivantsov Got it. But ws-agent is necessary for a usable IDE right?

Is it just the bootstrapping process that needs to be performed on the restarted pod? If that is the case, would a hacky hotfix to detect the same and perform it through something like a kubectl exec help us keep the workspace usable?

@ghost
Copy link

ghost commented Nov 19, 2018

@mjshashank yes, it's a must for Che 6. In Che 7, all tooling is launched as sidecars (containers in one pod), and there are no execs and interfering with runtime. Everything happens in containers entrypoints.

@gnoejuan
Copy link

gnoejuan commented Nov 29, 2018

I'll update this comment around 8 a.m Central Time tomorrow morning with more information. My local time is currently midnight, but I'm at work.

I experienced an unreachable ws-agent as well. Che then prompted for a restart and proceeded as expected.

Ubuntu 18.04.1 LTS
Pretty sure Che 6.14.2 Will update.
Docker version 18.09.0, build 4d60db4

It's my own server at home, nothing special.

Additional Update:

My docker installation is affected by docker/for-linux#476 (comment) and I applied the solution found later in the thread


remove the -H fd:// from the ExecStart (if you don't have other -H options set)
change to -H unix:// instead.```

@skabashnyuk
Copy link
Contributor

@mjshashank @gnoejuan I can suggest to set up centralized logs collection for workspaces. In this way, we will have some data about what has happened.

@gorkem gorkem closed this as completed Aug 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Questions that haven't been identified as being feature requests or bugs.
Projects
None yet
Development

No branches or pull requests

4 participants