Improved parallel transport mechanism and improved cleanup #809

tleedjarv · 2022-10-24T11:13:56Z

This PR includes two features that are logically related. See motivation for a single PR below.

The first commit (Move common update propagation code to Uicommon) is common to both features.

The second commit (Reduce memory usage during update propagation) is the first feature, inspired by #352 and #780. This covers the green path of update propagation.

The remaining commits are the second feature, which is a continuation of improvements introduced in #649 with fatal errors not stopping the GUI and #764. This covers the error path of update propagation.

The motivation and goals of these changes are: improved resource usage, fixing/prevention of resource leaks and making the UI more error-tolerant. See more detailed description in commit messages and in code comments.

For reviewers: Reviewing commit-by-commit is the simplest. Ignoring whitespace changes and detecting moves (also ignoring whitespace or indentation changes for move detection) may help.

Testers needed!

Volunteer testing is very welcome. Regarding fear of data loss/corruption, it is as safe to test as any other Unison version. In fact, there are improvements in error handling code that are expected to result in a cleaner and safer recovery from errors.

How to test? Just use Unison as normal. There are no user-visible changes as long as update propagation completes successfully. If you want to go further then you can simulate errors/interruptions during update propagation. For example, kill the server process, break the network connection, suspend the client, kill the ssh process; anything you can come up with, really. This should stop the propagation with a fatal error but should not result in uncaught exceptions or additional errors caused by the cleanup. If you're using the GUI then you should be able to continue using the GUI normally and either rescan or change the profile. Please report any deviations.

Text UI and the GTK GUI have received testing (but please test more!). The macOS native GUI has not had any testing yet.

With this merged, I think #352 can be closed. I see memory consumption reduced by > 1 kB and there have been other improvements before this. I think memory consumption of < 2 kB per update (in text UI) is reasonable to expect now. (But it depends a lot on the length of path names and synced properties.)
It will probably fix #780, too, but I'm not going to test it.

Why two features in a single PR? Given the resources available to reviewing and testing, having two (or more) separate PRs would result in less review and testing. The included features make changes in the same process (update propagation), one in the green path and the other in the error path. It makes sense to test these changes together and the risk of masking/hiding errors due to too much change at once is minimal.

Currently there is quite a bit of common update propagation code which is duplicated in each UI implementation with slight variations. Make this code shared by all UI implementations and move it to the Uicommon module.

Currently all update actions are propagated by up front creating a separate Lwt thread for each update. All threads are kept in a single list and the sync is considered complete once all threads in this list have completed their work. This approach, creating all threads up front and keeping them in a list, can keep a lot of memory allocated because threads and their data are only collected by GC once the entire list is collected. For large syncs, with hundreds of thousands or millions of updates, the memory wasted by inactive and uncollected completed threads can reach gigabytes. This patch includes two changes to considerably reduce memory requirement. First, reduce the number of Lwt threads which are not collectable by GC by not storing the threads in a list or some other data structure. This makes detecting sync completion a bit harder. In this patch it is solved by adding a separate thread which acts sort of like a ref counter and wakes up when the counter of running threads reaches zero (or when any thread fails with an exception). Second, don't create threads that won't be started due to the limit of 'maxthreads' (essentially the worker pool size). New threads are created on demand when workers become available in the pool.

Code for managing client connections in the Remote module is scattered around and duplicated in many locations. Move the code into dedicated functions and collect most of the functions in a new sub-module. While at it, make some smaller code changes (no change in functionality) to prepare for future improvements.

... and increase its correctness just a tiny bit. In previous code, looking up connections used a less unique key than the one used when registering opened the connections. This resulted in the previous code potentially overwriting distinct connections. This is still not fully fixed. This is not really a problem at the moment because several connections are never used in parallel.

The code before recent commits never closed sockets (or remote channel fds) and remote connections on the client. On the socket server, the connected socket was closed only once the exception of lost connection had reached the topmost loop of the server. Now, when detecting a lost connection at low-level IO, the socket (or remote channel fd) that lost the connection and other fds (if any) grouped into the same "connection" are closed immediately. After that, connection cleanup is triggered (running registered [at_conn_close] handlers). Finally, an exception for lost connection is raised (as before). This means that any code catching the lost connection exception is run after all the cleanup has been executed. This is expected not to be a problem, as cleanup code will typically close open resources and purge pending tasks, so is either a duplicate of or at least not interfering with other exception-catching code.

When a connection is closed in the middle of ongoing update propagation then the exception will stop the running propagation threads (of which there can be thousands, as these are cheap Lwt threads). Add cleanup code to close resources and purge pending tasks, primarily intended to be run when a connection is unexpectedly closed. The same cleanup will also be executed when connection is closed intentionally. This cleanup has not been needed before because any uncaught exception during transport was considered a fatal error and resulted in quitting the UI. Since 3a4e158, the GUI can survive such fatal errors and will benefit from the cleanup.

Previously, client never closed remote connections. Since recent commits, client closes the connection (close related sockets and run cleanup code) when detecting lost connection (as reported by the OS). Add a function for the client to close remote connections at will.

When an uncaught exception is raised during transport, trigger a cleanup attempt which will: cancel all pending transport threads, stop current running threads as much as possible, close the remote connection (which in turn will trigger most of the relevant cleanups) and try to drain any still-running threads by letting them fail with an (ignored) exception or run to completion.

Indentation changes were excluded from a few previous commits to reduce the diff and make reviewing them easier. This commit can easily be reviewed by ignoring indentation changes (the diff becomes empty).

gdt · 2022-11-01T17:39:08Z

I'd like to see test reports, but am inclined to just hit merge 2 weeks after 2.53.0 is released if not.

tleedjarv added 9 commits October 18, 2022 13:43

Move common update propagation code to Uicommon

a4b704b

Currently there is quite a bit of common update propagation code which is duplicated in each UI implementation with slight variations. Make this code shared by all UI implementations and move it to the Uicommon module.

Fix indendation after previous commits

d3202ee

Indentation changes were excluded from a few previous commits to reduce the diff and make reviewing them easier. This commit can easily be reviewed by ignoring indentation changes (the diff becomes empty).

tleedjarv mentioned this pull request Oct 24, 2022

Make it possible to stop update propagation by user request #810

Merged

gdt merged commit 0a09a24 into bcpierce00:master Nov 8, 2022

gdt mentioned this pull request Nov 8, 2022

Initiial sync of 650,000 files in text UI can use 5kB per file, exhausting 4G of memory #352

Closed

tleedjarv deleted the connection-cleanup+new-transport branch November 8, 2022 13:42

tleedjarv mentioned this pull request Nov 13, 2022

Make repeat mode slightly fault tolerant #830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved parallel transport mechanism and improved cleanup #809

Improved parallel transport mechanism and improved cleanup #809

tleedjarv commented Oct 24, 2022

gdt commented Nov 1, 2022

Improved parallel transport mechanism and improved cleanup #809

Improved parallel transport mechanism and improved cleanup #809

Conversation

tleedjarv commented Oct 24, 2022

Testers needed!

gdt commented Nov 1, 2022