Investigate reasons for Request timeouts during `AttachVisualizationHandler` #7005

hubertp · 2023-06-09T16:31:05Z

#6724 (comment) lists that Request timeouts happen frequently under heavy workload. Increasing timeouts (as in #6972) is unlikely to fix the underlying problem.

A request for attaching visualization should be handled rather swiftly (compared to sending data for the visualization).

The text was updated successfully, but these errors were encountered:

hubertp · 2023-06-09T16:35:00Z

One thing which stands out when I'm reading the code is why are we requesting writeCompilation lock for that job. This will contend with for example compilation job which can certainly take a lot of time.

JaroslavTulach · 2023-06-12T05:05:39Z

Having a job that, as a first operation, tries to obtain a compiler lock, is a sign of poor design. If the lock isn't available, the job shall not be scheduled at all. Jobs that require some information from compiler or need the compiler to be idle are so common, that they need support in the infrastructure. It shouldn't be the job that requires the lock - the infrastructure shall realize that compilation is over and invoke the job at that moment.

NetBeans editing infrastructure has an intricate system of "phases". Jobs register for individual phases and get invoked when such (compiler) phase is finished. For example identifying local variables happens sooner than resolving and binding imports.

Regardless of the infrastructure we select, the more lock free code of regular tasks is, the better. Locks belong to the infrastructure.

enso-bot · 2023-06-13T07:54:28Z

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-12):

Progress: Investigating timeouts for visualizations. Having trouble with reproduction as the problem does not happen reliably. Adding more logs to find out the source of the delay. It should be finished by 2023-06-14.

Next Day: Next day I will be working on the #7005 task. Continue investigation.

hubertp · 2023-06-13T10:03:59Z

I can confirm that acquisition of locks in

is the real reason for the delays in handling those requests, ending up eventually as timeouts.

hubertp · 2023-06-13T15:12:41Z

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.
E.g.
A request comes from IDE for executionContext/attachVisualisation. It is handled by the handlers and eventually lands in AttachVisualisationCmd as executed by CommandExecutionEngine. The latter is executed and submits UpsertVisualisationJob. But the job is only started 5 seconds(!) later and this is the job that produces a response for the IDE. Hence the timeout in handler.

hubertp · 2023-06-13T15:29:34Z

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.

And just to add more info. On startup, at some point we can have 60+ threads in the command pool and have only 1 thread in job pool:

That can certainly lead to some bottleneck.

JaroslavTulach · 2023-06-13T15:51:12Z

FYI: Mutex - my implementation of read/write lock with "post-when unlocked" methods. Probably more flexible than our current Locking , but still my preferred architectural solution is to hide all locks.

JaroslavTulach · 2023-06-13T15:53:29Z

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.

And just to add more info. On startup, at some point we can have 60+ threads in the command pool and have only 1

Having 60 threads that exclude each other by synchronizing on a lock is a sign of a poor design.

hubertp · 2023-06-13T16:16:06Z

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.

And just to add more info. On startup, at some point we can have 60+ threads in the command pool and have only 1

Having 60 threads that exclude each other by synchronizing on a lock is a sign of a poor design.

Command pool is (almost) unlimited while job pool apparently defaults to 1. So in the case of sudden bursts of visualization requests, we run a ton of command handlers which can submit various jobs that are then queued until resources are available.

Won't comment on the design as I wasn't the one who did it and there might have been different requirements 3 years ago.

enso-bot · 2023-06-14T08:51:37Z

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-13):

Progress: There are 2 problems leading to request timeouts. 1) Lock contention between commands and jobs 2) Thread starvation for job executor. We can have tens of commands submitted, most of them handled relatively quickly but they are submitting further jobs. Then we end up with a single executor for the jobs, which is the real bottleneck and the source of the problem. Will need figure how to improve the current status quo, hopefully without rewriting all that logic. It should be finished by 2023-06-14.

Next Day: Next day I will be working on the #7005 task. Figure out changes to improve the design.

hubertp · 2023-06-14T09:17:52Z

Note that "job parallelism" is adapted during bootstrapping to read the number of available processors. That code should just work except when I actually log the value that is retrieved I get 1(!). ~~The availableProcessors() method seems to be rather sensitive to the current state of OS, so probably a poor choice.~~

Edit: I was wrong, Runtime.getRuntime().availableProcessors() returns the desired value. It's the value that ends up in the context's options appears to be wrong and always 1. This is a problem somehow option descriptors.

enso-bot · 2023-06-15T09:19:01Z

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-14):

Progress: Been debugging our usage of Truffle's OptionDescriptors and came to the conclusion that our job processing never had more than 1 thread in the created threadpool. Once I fixed that, contention for locks became more visible. Considering sending replies earlier to avoid request timeouts. Needs more debugging. It should be finished by 2023-06-14.

Next Day: Next day I will be working on the #7005 task. Continue perf improvements and reducing lock contention.

enso-bot · 2023-06-16T10:18:38Z

Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2023-06-15):

Summary: There is 2 days delay in implementation of the Investigate reasons for Request timeouts during AttachVisualizationHandler (#7005) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Discovered some serious flaws in the existing design of job execution that needed to be addressed.

enso-bot · 2023-06-16T10:19:31Z

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-15):

Progress: Continue my investigation with speeding up execution/preventing timeouts. Sending replies to IDE from commands (rather than jobs triggered by them) both, improved and deteroriated the situation. We could have a request to modify the visualization before it was fully attached. Had to figure how to detect in-progress visualizations and add the additional checks. Created draft PR in which no request timeouts happen to me. It should be finished by 2023-06-16.

Next Day: Next day I will be working on the #7005 task. Cleanup up the PR, address review, more testing.

#7042) Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests: 1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem. 2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation. Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would **only** be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation. Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details. Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that `Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it. This brings another problematic scenario: 1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled. 2. In the meantime `ModifyVisualisationCmd` comes and fails; command cannot find the visualization that will only be added by `UpsertVisualisationJob`, which might have not yet been scheduled to run. Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently on IDE startup. This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with. With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore. I added the possibility of logging wait times for locks to better investigate further problems. Closes #7005

enso-bot · 2023-06-19T08:44:11Z

Hubert Plociniczak reports a new STANDUP for the provided date (2023-06-16):

Progress: Cleaned up the PR, more testing and (eventually) merged it. Rebased #6774 (for #6875) but realized that it is probably not an issue. Looking at our logging service and thinking if we can simplify it with plain log4j loggers. Started looking at adding retries to GraalVM updater. It should be finished by 2023-06-16.

Next Day: Next day I will be working on the #6880 task. Continue with adding retries to the updater.

#7042) Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests: 1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem. 2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation. Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would **only** be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation. Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details. Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that `Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it. This brings another problematic scenario: 1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled. 2. In the meantime `ModifyVisualisationCmd` comes and fails; command cannot find the visualization that will only be added by `UpsertVisualisationJob`, which might have not yet been scheduled to run. Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently on IDE startup. This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with. With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore. I added the possibility of logging wait times for locks to better investigate further problems. Closes #7005

hubertp added x-chore Type: chore -language-server labels Jun 9, 2023

github-project-automation bot added this to Issues Board Jun 9, 2023

github-project-automation bot moved this to ❓New in Issues Board Jun 9, 2023

Akirathan mentioned this issue Jun 9, 2023

Regressions in visualizations and drop-downs behavior #7001

Closed

2 tasks

MichaelMauderer mentioned this issue Jun 10, 2023

Increase engine timeout for visualisations. #6972

Closed

5 tasks

hubertp self-assigned this Jun 12, 2023

hubertp moved this from ❓New to 🔧 Implementation in Issues Board Jun 12, 2023

hubertp mentioned this issue Jun 16, 2023

Improvements that significantly reduce the chances of request timeouts #7042

Merged

5 tasks

hubertp moved this from 🔧 Implementation to 👁️ Code review in Issues Board Jun 16, 2023

mergify bot closed this as completed in #7042 Jun 16, 2023

github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Jun 16, 2023

hubertp mentioned this issue Jun 19, 2023

Improve execution of jobs and commands #7059

Closed

MichaelMauderer mentioned this issue Jun 19, 2023

Node preview occasionally doesn't work - closes before showing results #6724

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate reasons for Request timeouts during `AttachVisualizationHandler` #7005

Investigate reasons for Request timeouts during `AttachVisualizationHandler` #7005

hubertp commented Jun 9, 2023

hubertp commented Jun 9, 2023

JaroslavTulach commented Jun 12, 2023

enso-bot bot commented Jun 13, 2023

hubertp commented Jun 13, 2023 •

edited

Loading

hubertp commented Jun 13, 2023

hubertp commented Jun 13, 2023

JaroslavTulach commented Jun 13, 2023

JaroslavTulach commented Jun 13, 2023

hubertp commented Jun 13, 2023

enso-bot bot commented Jun 14, 2023

hubertp commented Jun 14, 2023 •

edited

Loading

enso-bot bot commented Jun 15, 2023

enso-bot bot commented Jun 16, 2023

enso-bot bot commented Jun 16, 2023

enso-bot bot commented Jun 19, 2023

Investigate reasons for Request timeouts during AttachVisualizationHandler #7005

Investigate reasons for Request timeouts during AttachVisualizationHandler #7005

Comments

hubertp commented Jun 9, 2023

hubertp commented Jun 9, 2023

JaroslavTulach commented Jun 12, 2023

enso-bot bot commented Jun 13, 2023

hubertp commented Jun 13, 2023 • edited Loading

hubertp commented Jun 13, 2023

hubertp commented Jun 13, 2023

JaroslavTulach commented Jun 13, 2023

JaroslavTulach commented Jun 13, 2023

hubertp commented Jun 13, 2023

enso-bot bot commented Jun 14, 2023

hubertp commented Jun 14, 2023 • edited Loading

enso-bot bot commented Jun 15, 2023

enso-bot bot commented Jun 16, 2023

enso-bot bot commented Jun 16, 2023

enso-bot bot commented Jun 19, 2023

Investigate reasons for Request timeouts during `AttachVisualizationHandler` #7005

Investigate reasons for Request timeouts during `AttachVisualizationHandler` #7005

hubertp commented Jun 13, 2023 •

edited

Loading

hubertp commented Jun 14, 2023 •

edited

Loading