-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved parallel transport mechanism and improved cleanup #809
Merged
gdt
merged 9 commits into
bcpierce00:master
from
tleedjarv:connection-cleanup+new-transport
Nov 8, 2022
Merged
Improved parallel transport mechanism and improved cleanup #809
gdt
merged 9 commits into
bcpierce00:master
from
tleedjarv:connection-cleanup+new-transport
Nov 8, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently there is quite a bit of common update propagation code which is duplicated in each UI implementation with slight variations. Make this code shared by all UI implementations and move it to the Uicommon module.
Currently all update actions are propagated by up front creating a separate Lwt thread for each update. All threads are kept in a single list and the sync is considered complete once all threads in this list have completed their work. This approach, creating all threads up front and keeping them in a list, can keep a lot of memory allocated because threads and their data are only collected by GC once the entire list is collected. For large syncs, with hundreds of thousands or millions of updates, the memory wasted by inactive and uncollected completed threads can reach gigabytes. This patch includes two changes to considerably reduce memory requirement. First, reduce the number of Lwt threads which are not collectable by GC by not storing the threads in a list or some other data structure. This makes detecting sync completion a bit harder. In this patch it is solved by adding a separate thread which acts sort of like a ref counter and wakes up when the counter of running threads reaches zero (or when any thread fails with an exception). Second, don't create threads that won't be started due to the limit of 'maxthreads' (essentially the worker pool size). New threads are created on demand when workers become available in the pool.
Code for managing client connections in the Remote module is scattered around and duplicated in many locations. Move the code into dedicated functions and collect most of the functions in a new sub-module. While at it, make some smaller code changes (no change in functionality) to prepare for future improvements.
... and increase its correctness just a tiny bit. In previous code, looking up connections used a less unique key than the one used when registering opened the connections. This resulted in the previous code potentially overwriting distinct connections. This is still not fully fixed. This is not really a problem at the moment because several connections are never used in parallel.
The code before recent commits never closed sockets (or remote channel fds) and remote connections on the client. On the socket server, the connected socket was closed only once the exception of lost connection had reached the topmost loop of the server. Now, when detecting a lost connection at low-level IO, the socket (or remote channel fd) that lost the connection and other fds (if any) grouped into the same "connection" are closed immediately. After that, connection cleanup is triggered (running registered [at_conn_close] handlers). Finally, an exception for lost connection is raised (as before). This means that any code catching the lost connection exception is run after all the cleanup has been executed. This is expected not to be a problem, as cleanup code will typically close open resources and purge pending tasks, so is either a duplicate of or at least not interfering with other exception-catching code.
When a connection is closed in the middle of ongoing update propagation then the exception will stop the running propagation threads (of which there can be thousands, as these are cheap Lwt threads). Add cleanup code to close resources and purge pending tasks, primarily intended to be run when a connection is unexpectedly closed. The same cleanup will also be executed when connection is closed intentionally. This cleanup has not been needed before because any uncaught exception during transport was considered a fatal error and resulted in quitting the UI. Since 3a4e158, the GUI can survive such fatal errors and will benefit from the cleanup.
Previously, client never closed remote connections. Since recent commits, client closes the connection (close related sockets and run cleanup code) when detecting lost connection (as reported by the OS). Add a function for the client to close remote connections at will.
When an uncaught exception is raised during transport, trigger a cleanup attempt which will: cancel all pending transport threads, stop current running threads as much as possible, close the remote connection (which in turn will trigger most of the relevant cleanups) and try to drain any still-running threads by letting them fail with an (ignored) exception or run to completion.
Indentation changes were excluded from a few previous commits to reduce the diff and make reviewing them easier. This commit can easily be reviewed by ignoring indentation changes (the diff becomes empty).
I'd like to see test reports, but am inclined to just hit merge 2 weeks after 2.53.0 is released if not. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR includes two features that are logically related. See motivation for a single PR below.
The first commit (Move common update propagation code to Uicommon) is common to both features.
The second commit (Reduce memory usage during update propagation) is the first feature, inspired by #352 and #780. This covers the green path of update propagation.
The remaining commits are the second feature, which is a continuation of improvements introduced in #649 with fatal errors not stopping the GUI and #764. This covers the error path of update propagation.
The motivation and goals of these changes are: improved resource usage, fixing/prevention of resource leaks and making the UI more error-tolerant. See more detailed description in commit messages and in code comments.
For reviewers: Reviewing commit-by-commit is the simplest. Ignoring whitespace changes and detecting moves (also ignoring whitespace or indentation changes for move detection) may help.
Testers needed!
Volunteer testing is very welcome. Regarding fear of data loss/corruption, it is as safe to test as any other Unison version. In fact, there are improvements in error handling code that are expected to result in a cleaner and safer recovery from errors.
How to test? Just use Unison as normal. There are no user-visible changes as long as update propagation completes successfully. If you want to go further then you can simulate errors/interruptions during update propagation. For example, kill the server process, break the network connection, suspend the client, kill the ssh process; anything you can come up with, really. This should stop the propagation with a fatal error but should not result in uncaught exceptions or additional errors caused by the cleanup. If you're using the GUI then you should be able to continue using the GUI normally and either rescan or change the profile. Please report any deviations.
Text UI and the GTK GUI have received testing (but please test more!). The macOS native GUI has not had any testing yet.
With this merged, I think #352 can be closed. I see memory consumption reduced by > 1 kB and there have been other improvements before this. I think memory consumption of < 2 kB per update (in text UI) is reasonable to expect now. (But it depends a lot on the length of path names and synced properties.)
It will probably fix #780, too, but I'm not going to test it.
Why two features in a single PR? Given the resources available to reviewing and testing, having two (or more) separate PRs would result in less review and testing. The included features make changes in the same process (update propagation), one in the green path and the other in the error path. It makes sense to test these changes together and the risk of masking/hiding errors due to too much change at once is minimal.