Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved parallel transport mechanism and improved cleanup #809

Merged
merged 9 commits into from
Nov 8, 2022

Conversation

tleedjarv
Copy link
Contributor

This PR includes two features that are logically related. See motivation for a single PR below.

The first commit (Move common update propagation code to Uicommon) is common to both features.

The second commit (Reduce memory usage during update propagation) is the first feature, inspired by #352 and #780. This covers the green path of update propagation.

The remaining commits are the second feature, which is a continuation of improvements introduced in #649 with fatal errors not stopping the GUI and #764. This covers the error path of update propagation.

The motivation and goals of these changes are: improved resource usage, fixing/prevention of resource leaks and making the UI more error-tolerant. See more detailed description in commit messages and in code comments.

For reviewers: Reviewing commit-by-commit is the simplest. Ignoring whitespace changes and detecting moves (also ignoring whitespace or indentation changes for move detection) may help.

Testers needed!

Volunteer testing is very welcome. Regarding fear of data loss/corruption, it is as safe to test as any other Unison version. In fact, there are improvements in error handling code that are expected to result in a cleaner and safer recovery from errors.

How to test? Just use Unison as normal. There are no user-visible changes as long as update propagation completes successfully. If you want to go further then you can simulate errors/interruptions during update propagation. For example, kill the server process, break the network connection, suspend the client, kill the ssh process; anything you can come up with, really. This should stop the propagation with a fatal error but should not result in uncaught exceptions or additional errors caused by the cleanup. If you're using the GUI then you should be able to continue using the GUI normally and either rescan or change the profile. Please report any deviations.

Text UI and the GTK GUI have received testing (but please test more!). The macOS native GUI has not had any testing yet.


With this merged, I think #352 can be closed. I see memory consumption reduced by > 1 kB and there have been other improvements before this. I think memory consumption of < 2 kB per update (in text UI) is reasonable to expect now. (But it depends a lot on the length of path names and synced properties.)
It will probably fix #780, too, but I'm not going to test it.

Why two features in a single PR? Given the resources available to reviewing and testing, having two (or more) separate PRs would result in less review and testing. The included features make changes in the same process (update propagation), one in the green path and the other in the error path. It makes sense to test these changes together and the risk of masking/hiding errors due to too much change at once is minimal.

Currently there is quite a bit of common update propagation code which
is duplicated in each UI implementation with slight variations. Make
this code shared by all UI implementations and move it to the Uicommon
module.
Currently all update actions are propagated by up front creating a
separate Lwt thread for each update. All threads are kept in a single
list and the sync is considered complete once all threads in this list
have completed their work. This approach, creating all threads up front
and keeping them in a list, can keep a lot of memory allocated because
threads and their data are only collected by GC once the entire list is
collected. For large syncs, with hundreds of thousands or millions of
updates, the memory wasted by inactive and uncollected completed threads
can reach gigabytes.

This patch includes two changes to considerably reduce memory
requirement. First, reduce the number of Lwt threads which are not
collectable by GC by not storing the threads in a list or some other
data structure. This makes detecting sync completion a bit harder. In
this patch it is solved by adding a separate thread which acts sort of
like a ref counter and wakes up when the counter of running threads
reaches zero (or when any thread fails with an exception).

Second, don't create threads that won't be started due to the limit of
'maxthreads' (essentially the worker pool size). New threads are created
on demand when workers become available in the pool.
Code for managing client connections in the Remote module is scattered
around and duplicated in many locations. Move the code into dedicated
functions and collect most of the functions in a new sub-module.

While at it, make some smaller code changes (no change in functionality)
to prepare for future improvements.
... and increase its correctness just a tiny bit. In previous code,
looking up connections used a less unique key than the one used when
registering opened the connections. This resulted in the previous code
potentially overwriting distinct connections. This is still not fully
fixed. This is not really a problem at the moment because several
connections are never used in parallel.
The code before recent commits never closed sockets (or remote channel
fds) and remote connections on the client. On the socket server, the
connected socket was closed only once the exception of lost connection
had reached the topmost loop of the server.

Now, when detecting a lost connection at low-level IO, the socket (or
remote channel fd) that lost the connection and other fds (if any)
grouped into the same "connection" are closed immediately. After that,
connection cleanup is triggered (running registered [at_conn_close]
handlers). Finally, an exception for lost connection is raised (as
before).

This means that any code catching the lost connection exception is run
after all the cleanup has been executed. This is expected not to be a
problem, as cleanup code will typically close open resources and purge
pending tasks, so is either a duplicate of or at least not interfering
with other exception-catching code.
When a connection is closed in the middle of ongoing update propagation
then the exception will stop the running propagation threads (of which
there can be thousands, as these are cheap Lwt threads).

Add cleanup code to close resources and purge pending tasks, primarily
intended to be run when a connection is unexpectedly closed. The same
cleanup will also be executed when connection is closed intentionally.

This cleanup has not been needed before because any uncaught exception
during transport was considered a fatal error and resulted in quitting
the UI. Since 3a4e158, the GUI can
survive such fatal errors and will benefit from the cleanup.
Previously, client never closed remote connections. Since recent
commits, client closes the connection (close related sockets and run
cleanup code) when detecting lost connection (as reported by the OS).
Add a function for the client to close remote connections at will.
When an uncaught exception is raised during transport, trigger a cleanup
attempt which will: cancel all pending transport threads, stop current
running threads as much as possible, close the remote connection (which
in turn will trigger most of the relevant cleanups) and try to drain any
still-running threads by letting them fail with an (ignored) exception
or run to completion.
Indentation changes were excluded from a few previous commits to reduce
the diff and make reviewing them easier. This commit can easily be
reviewed by ignoring indentation changes (the diff becomes empty).
@gdt
Copy link
Collaborator

gdt commented Nov 1, 2022

I'd like to see test reports, but am inclined to just hit merge 2 weeks after 2.53.0 is released if not.

@gdt gdt merged commit 0a09a24 into bcpierce00:master Nov 8, 2022
@tleedjarv tleedjarv deleted the connection-cleanup+new-transport branch November 8, 2022 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stack usage grows with number of updated files
2 participants