Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate reasons for Request timeouts during AttachVisualizationHandler #7005

Closed
Tracked by #7001
hubertp opened this issue Jun 9, 2023 · 15 comments · Fixed by #7042
Closed
Tracked by #7001

Investigate reasons for Request timeouts during AttachVisualizationHandler #7005

hubertp opened this issue Jun 9, 2023 · 15 comments · Fixed by #7042
Assignees
Labels

Comments

@hubertp
Copy link
Collaborator

hubertp commented Jun 9, 2023

#6724 (comment) lists that Request timeouts happen frequently under heavy workload. Increasing timeouts (as in #6972) is unlikely to fix the underlying problem.

A request for attaching visualization should be handled rather swiftly (compared to sending data for the visualization).

@hubertp
Copy link
Collaborator Author

hubertp commented Jun 9, 2023

One thing which stands out when I'm reading the code is why are we requesting writeCompilation lock for that job. This will contend with for example compilation job which can certainly take a lot of time.

@JaroslavTulach
Copy link
Member

Having a job that, as a first operation, tries to obtain a compiler lock, is a sign of poor design. If the lock isn't available, the job shall not be scheduled at all. Jobs that require some information from compiler or need the compiler to be idle are so common, that they need support in the infrastructure. It shouldn't be the job that requires the lock - the infrastructure shall realize that compilation is over and invoke the job at that moment.

NetBeans editing infrastructure has an intricate system of "phases". Jobs register for individual phases and get invoked when such (compiler) phase is finished. For example identifying local variables happens sooner than resolving and binding imports.

Regardless of the infrastructure we select, the more lock free code of regular tasks is, the better. Locks belong to the infrastructure.

@hubertp hubertp self-assigned this Jun 12, 2023
@hubertp hubertp moved this from ❓New to 🔧 Implementation in Issues Board Jun 12, 2023
@enso-bot
Copy link

enso-bot bot commented Jun 13, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-12):

Progress: Investigating timeouts for visualizations. Having trouble with reproduction as the problem does not happen reliably. Adding more logs to find out the source of the delay. It should be finished by 2023-06-14.

Next Day: Next day I will be working on the #7005 task. Continue investigation.

@hubertp
Copy link
Collaborator Author

hubertp commented Jun 13, 2023

I can confirm that acquisition of locks in

is the real reason for the delays in handling those requests, ending up eventually as timeouts.

@hubertp
Copy link
Collaborator Author

hubertp commented Jun 13, 2023

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.
E.g.
A request comes from IDE for executionContext/attachVisualisation. It is handled by the handlers and eventually lands in AttachVisualisationCmd as executed by CommandExecutionEngine. The latter is executed and submits UpsertVisualisationJob. But the job is only started 5 seconds(!) later and this is the job that produces a response for the IDE. Hence the timeout in handler.

@hubertp
Copy link
Collaborator Author

hubertp commented Jun 13, 2023

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.

And just to add more info. On startup, at some point we can have 60+ threads in the command pool and have only 1 thread in job pool:
Screenshot from 2023-06-13 17-25-24
Screenshot from 2023-06-13 17-25-37

That can certainly lead to some bottleneck.

@JaroslavTulach
Copy link
Member

FYI: Mutex - my implementation of read/write lock with "post-when unlocked" methods. Probably more flexible than our current Locking , but still my preferred architectural solution is to hide all locks.

@JaroslavTulach
Copy link
Member

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.

And just to add more info. On startup, at some point we can have 60+ threads in the command pool and have only 1

Having 60 threads that exclude each other by synchronizing on a lock is a sign of a poor design.

@hubertp
Copy link
Collaborator Author

hubertp commented Jun 13, 2023

Another insight: it seems we can also easily experience thread starvation in our JobExecutionEngine.

And just to add more info. On startup, at some point we can have 60+ threads in the command pool and have only 1

Having 60 threads that exclude each other by synchronizing on a lock is a sign of a poor design.

Command pool is (almost) unlimited while job pool apparently defaults to 1. So in the case of sudden bursts of visualization requests, we run a ton of command handlers which can submit various jobs that are then queued until resources are available.

Won't comment on the design as I wasn't the one who did it and there might have been different requirements 3 years ago.

@enso-bot
Copy link

enso-bot bot commented Jun 14, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-13):

Progress: There are 2 problems leading to request timeouts. 1) Lock contention between commands and jobs 2) Thread starvation for job executor. We can have tens of commands submitted, most of them handled relatively quickly but they are submitting further jobs. Then we end up with a single executor for the jobs, which is the real bottleneck and the source of the problem. Will need figure how to improve the current status quo, hopefully without rewriting all that logic. It should be finished by 2023-06-14.

Next Day: Next day I will be working on the #7005 task. Figure out changes to improve the design.

@hubertp
Copy link
Collaborator Author

hubertp commented Jun 14, 2023

Note that "job parallelism" is adapted during bootstrapping to read the number of available processors. That code should just work except when I actually log the value that is retrieved I get 1(!). The availableProcessors() method seems to be rather sensitive to the current state of OS, so probably a poor choice.

Edit: I was wrong, Runtime.getRuntime().availableProcessors() returns the desired value. It's the value that ends up in the context's options appears to be wrong and always 1. This is a problem somehow option descriptors.

@enso-bot
Copy link

enso-bot bot commented Jun 15, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-14):

Progress: Been debugging our usage of Truffle's OptionDescriptors and came to the conclusion that our job processing never had more than 1 thread in the created threadpool. Once I fixed that, contention for locks became more visible. Considering sending replies earlier to avoid request timeouts. Needs more debugging. It should be finished by 2023-06-14.

Next Day: Next day I will be working on the #7005 task. Continue perf improvements and reducing lock contention.

@enso-bot
Copy link

enso-bot bot commented Jun 16, 2023

Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2023-06-15):

Summary: There is 2 days delay in implementation of the Investigate reasons for Request timeouts during AttachVisualizationHandler (#7005) task.
It will cause 0 days delay for the delivery of this weekly plan.

Delay Cause: Discovered some serious flaws in the existing design of job execution that needed to be addressed.

@enso-bot
Copy link

enso-bot bot commented Jun 16, 2023

Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-15):

Progress: Continue my investigation with speeding up execution/preventing timeouts. Sending replies to IDE from commands (rather than jobs triggered by them) both, improved and deteroriated the situation. We could have a request to modify the visualization before it was fully attached. Had to figure how to detect in-progress visualizations and add the additional checks. Created draft PR in which no request timeouts happen to me. It should be finished by 2023-06-16.

Next Day: Next day I will be working on the #7005 task. Cleanup up the PR, address review, more testing.

@hubertp hubertp moved this from 🔧 Implementation to 👁️ Code review in Issues Board Jun 16, 2023
@mergify mergify bot closed this as completed in #7042 Jun 16, 2023
mergify bot pushed a commit that referenced this issue Jun 16, 2023
#7042)

Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests:
1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem.
2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation.

Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would **only** be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation.

Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details.

Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that
`Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it.

This brings another problematic scenario:
1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled.
2. In the meantime `ModifyVisualisationCmd` comes and fails; command cannot find the visualization that will only be added by `UpsertVisualisationJob`, which might have not yet been scheduled to run.

Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently on IDE startup.

This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with.

With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore.

I added the possibility of logging wait times for locks to better investigate further problems.

Closes #7005
@github-project-automation github-project-automation bot moved this from 👁️ Code review to 🟢 Accepted in Issues Board Jun 16, 2023
@enso-bot
Copy link

enso-bot bot commented Jun 19, 2023

Hubert Plociniczak reports a new STANDUP for the provided date (2023-06-16):

Progress: Cleaned up the PR, more testing and (eventually) merged it. Rebased #6774 (for #6875) but realized that it is probably not an issue. Looking at our logging service and thinking if we can simplify it with plain log4j loggers. Started looking at adding retries to GraalVM updater. It should be finished by 2023-06-16.

Next Day: Next day I will be working on the #6880 task. Continue with adding retries to the updater.

Frizi pushed a commit that referenced this issue Jun 19, 2023
#7042)

Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests:
1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem.
2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation.

Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would **only** be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation.

Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details.

Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that
`Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it.

This brings another problematic scenario:
1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled.
2. In the meantime `ModifyVisualisationCmd` comes and fails; command cannot find the visualization that will only be added by `UpsertVisualisationJob`, which might have not yet been scheduled to run.

Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently on IDE startup.

This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with.

With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore.

I added the possibility of logging wait times for locks to better investigate further problems.

Closes #7005
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants