-
Notifications
You must be signed in to change notification settings - Fork 326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate reasons for Request timeouts during AttachVisualizationHandler
#7005
Comments
One thing which stands out when I'm reading the code is why are we requesting |
Having a job that, as a first operation, tries to obtain a compiler lock, is a sign of poor design. If the lock isn't available, the job shall not be scheduled at all. Jobs that require some information from compiler or need the compiler to be idle are so common, that they need support in the infrastructure. It shouldn't be the job that requires the lock - the infrastructure shall realize that compilation is over and invoke the job at that moment. NetBeans editing infrastructure has an intricate system of "phases". Jobs register for individual phases and get invoked when such (compiler) phase is finished. For example identifying local variables happens sooner than resolving and binding imports. Regardless of the infrastructure we select, the more lock free code of regular tasks is, the better. Locks belong to the infrastructure. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-12): Progress: Investigating timeouts for visualizations. Having trouble with reproduction as the problem does not happen reliably. Adding more logs to find out the source of the delay. It should be finished by 2023-06-14. Next Day: Next day I will be working on the #7005 task. Continue investigation. |
I can confirm that acquisition of locks in is the real reason for the delays in handling those requests, ending up eventually as timeouts. |
Another insight: it seems we can also easily experience thread starvation in our |
FYI: Mutex - my implementation of read/write lock with "post-when unlocked" methods. Probably more flexible than our current |
Having 60 threads that exclude each other by synchronizing on a lock is a sign of a poor design. |
Command pool is (almost) unlimited while job pool apparently defaults to 1. So in the case of sudden bursts of visualization requests, we run a ton of command handlers which can submit various jobs that are then queued until resources are available. Won't comment on the design as I wasn't the one who did it and there might have been different requirements 3 years ago. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-13): Progress: There are 2 problems leading to request timeouts. 1) Lock contention between commands and jobs 2) Thread starvation for job executor. We can have tens of commands submitted, most of them handled relatively quickly but they are submitting further jobs. Then we end up with a single executor for the jobs, which is the real bottleneck and the source of the problem. Will need figure how to improve the current status quo, hopefully without rewriting all that logic. It should be finished by 2023-06-14. Next Day: Next day I will be working on the #7005 task. Figure out changes to improve the design. |
Note that "job parallelism" is adapted during bootstrapping to read the number of available processors. That code should just work except when I actually log the value that is retrieved I get 1(!). Edit: I was wrong, |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-14): Progress: Been debugging our usage of Truffle's OptionDescriptors and came to the conclusion that our job processing never had more than 1 thread in the created threadpool. Once I fixed that, contention for locks became more visible. Considering sending replies earlier to avoid request timeouts. Needs more debugging. It should be finished by 2023-06-14. Next Day: Next day I will be working on the #7005 task. Continue perf improvements and reducing lock contention. |
Hubert Plociniczak reports a new 🔴 DELAY for yesterday (2023-06-15): Summary: There is 2 days delay in implementation of the Investigate reasons for Request timeouts during Delay Cause: Discovered some serious flaws in the existing design of job execution that needed to be addressed. |
Hubert Plociniczak reports a new STANDUP for yesterday (2023-06-15): Progress: Continue my investigation with speeding up execution/preventing timeouts. Sending replies to IDE from commands (rather than jobs triggered by them) both, improved and deteroriated the situation. We could have a request to modify the visualization before it was fully attached. Had to figure how to detect in-progress visualizations and add the additional checks. Created draft PR in which no request timeouts happen to me. It should be finished by 2023-06-16. Next Day: Next day I will be working on the #7005 task. Cleanup up the PR, address review, more testing. |
#7042) Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests: 1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem. 2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation. Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would **only** be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation. Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details. Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that `Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it. This brings another problematic scenario: 1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled. 2. In the meantime `ModifyVisualisationCmd` comes and fails; command cannot find the visualization that will only be added by `UpsertVisualisationJob`, which might have not yet been scheduled to run. Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently on IDE startup. This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with. With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore. I added the possibility of logging wait times for locks to better investigate further problems. Closes #7005
Hubert Plociniczak reports a new STANDUP for the provided date (2023-06-16): Progress: Cleaned up the PR, more testing and (eventually) merged it. Rebased #6774 (for #6875) but realized that it is probably not an issue. Looking at our logging service and thinking if we can simplify it with plain log4j loggers. Started looking at adding retries to GraalVM updater. It should be finished by 2023-06-16. Next Day: Next day I will be working on the #6880 task. Continue with adding retries to the updater. |
#7042) Request Timeouts started plaguing IDE due to numerous `executionContext/***Visualization` requests. While caused by a bug they revealed a bigger problem in the Language Server when serving large amounts of requests: 1) Long and short lived jobs are fighting for various locks. Lock contention leads to some jobs waiting for a longer than desired leading to unexpected request timeouts. Increasing timeout value is just delaying the problem. 2) Requests coming from IDE are served almost instantly and handled by various commands. Commands can issue further jobs that serve request. We apparently have and always had a single-thread thread pool for serving such jobs, leading to immediate thread starvation. Both reasons increase the chances of Request Timeouts when dealing with a large number of requests. For 2) I noticed that while we used to set the `enso-runtime-server.jobParallelism` option descriptor key to some machine-dependent value (most likely > 1), the value set would **only** be available for instrumentation. `JobExecutionEngine` where it is actually used would always get the default, i.e. a single-threaded ThreadPool. This means that this option descriptor was simply misused since its introduction. Moved that option to runtime options so that it can be set and retrieved during normal operation. Adding parallelism intensified problem 1), because now we could execute multiple jobs and they would compete for resources. It also revealed a scenario for a yet another deadlock scenario, due to invalid order of lock acquisition. See `ExecuteJob` vs `UpsertVisualisationJob` order for details. Still, a number of requests would continue to randomly timeout due to lock contention. It became apparent that `Attach/Modify/Detach-VisualisationCmd` should not wait until a triggered `UpsertVisualisationJob` sends a response to the client; long and short lived jobs will always compete for resources and we cannot guarantee that they will not timeout that way. That is why the response is sent immediately from the command handler and not from the job executed after it. This brings another problematic scenario: 1. `AttachVisualisationCmd` is executed, response sent to the client, `UpsertVisualisationJob` scheduled. 2. In the meantime `ModifyVisualisationCmd` comes and fails; command cannot find the visualization that will only be added by `UpsertVisualisationJob`, which might have not yet been scheduled to run. Remedied that by checking visualisation-related jobs that are still in progress. It also allowed for cancelling jobs which results wouldn't be used anyway (`ModifyVisualisationCmd` sends its own `UpsertVisualisationJob`). This is not a theoretical scenario, it happened frequently on IDE startup. This change does not fully solve the rather problematic setup of numerous locks, which are requested by short and long lived jobs. A better design should still be investigated. But it significantly reduces the chances of Request Timeouts which IDE had to deal with. With this change I haven't been able to experience Request Timeouts for relatively modest projects anymore. I added the possibility of logging wait times for locks to better investigate further problems. Closes #7005
#6724 (comment) lists that
Request timeouts
happen frequently under heavy workload. Increasing timeouts (as in #6972) is unlikely to fix the underlying problem.A request for attaching visualization should be handled rather swiftly (compared to sending data for the visualization).
The text was updated successfully, but these errors were encountered: