Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary cache: async push_success #908

Merged

Conversation

autoantwort
Copy link
Contributor

@autoantwort autoantwort commented Feb 15, 2023

This results in ~10-20% faster build times on my machine.

For example building boost on my M1 mac went down from 2.948 min to 2.375 min

@autoantwort autoantwort marked this pull request as draft February 15, 2023 20:04
@autoantwort
Copy link
Contributor Author

How or when should "upload messages" (like Uploaded binaries to {count} {vendor}.) be printed?

@Thomas1664
Copy link
Contributor

Doesn't this have the same problem as #694 that the working thread might exit due to calls to check_exit or value_or_exit?

@autoantwort
Copy link
Contributor Author

Doesn't this have the same problem as #694 that the working thread might exit due to calls to check_exit or value_or_exit?

Kind of. In general we need an option to decide if a binary cache failure should be a hard error or only a warning

@Thomas1664
Copy link
Contributor

Kind of. In general we need an option to decide if a binary cache failure should be a hard error or only a warning

The problem is that we almost never can be sure that there isn't some nested API call that exits on failure. But it seems like #909 at least partially addresses this issue.

@autoantwort
Copy link
Contributor Author

Yeah but in the binary cache are nearly no hard exists. It currently also only prints warnings.

Copy link
Contributor

@ras0219-msft ras0219-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction; unblocking I/O work has great potential for making vcpkg much faster.

However we need to be very careful about the impacts of concurrency -- deadlocks suck :(

src/vcpkg.cpp Outdated
@@ -156,6 +157,7 @@ namespace vcpkg::Checks
// Implements link seam from basic_checks.h
void on_final_cleanup_and_exit()
{
BinaryCache::wait_for_async_complete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we can do this here. This is on the critical path for ctrl-c handling and should only be used for extremely fast, emergency tear-down behavior (like restoring the console).

If there happens to be an exit anywhere in any BinaryCache implementation, this would deadlock. Importantly, this include any sort of assertion we might want to do, like checking pointers for null.

Unfortunately, the only path forward I see is to call this (or appropriately scope the BinaryCache itself) at the relevant callers. The consequence of possibly not uploading some set of binary caches in the case of some unhandled program error (such as permissions issue on a directory expected to be writable) is vastly preferable to deadlocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the BinaryCache::wait_for_async_complete() implementation so it does not deadlock anymore.

I also moved the call to Checks::exit_with_code which is not called when crtl+c is handled. (I personally would like to have a way to terminate vcpkg but wait until the binary cache is done so that I don't lose progress.)

And I prefer it when build packages are uploaded to the binary caches before vcpkg exits because of an error, otherwise I have to build the already build packages again at a later point when there is no cache entry.

Copy link
Contributor

@ras0219-msft ras0219-msft Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that it is, desirable to finish uploading on "understood" errors. For example, if a package failed to build or failed to be installed.

I was also wrong about my original assessment of a deadlock. My concern was the call path of the binary upload thread calling Checks::unreachable() or .value_or_exit(), but it seems that std::thread::join() does have a carve-out to handle this specific case: it will throw a resource_deadlock_would_occur if you try to join yourself.

I've put some other concerns below, but I don't want those to distract from my main point: We must make it as trivial / correct-by-construction as possible to guarantee that the binary cache thread NEVER attempts to wait on itself. I think the best approach for vcpkg right now is to add calls from Install::perform() etc to BinaryCache::wait_for_async_complete() before any "user-facing" error, such as the exit guarded by result.code != BuildResult::SUCCEEDED && keep_going == KeepGoing::NO. This is motivated by the perspective that it's always safer to terminate than to join and possibly deadlock / race condition / etc.


There's still a UB data race if the main thread and binary upload thread attempt to exit at the same time:

Concurrently calling join() on the same thread object from multiple threads constitutes a data race that results in undefined behavior.
-- https://en.cppreference.com/w/cpp/thread/thread/join

There's also a serious "scalability" problem if we ever want a second background thread for whatever reason, because BGThread A would join on BGThread B, while BGThread B tries to join on BGThread A. This might be solvable with ever more complex structures, such as a thread ownership DAG that gets threads to join only on their direct children, but I don't think the benefit is worth the cost.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UB and the joining itself could simply be prevented by doing a if (std::this_thread::get_id() == instance->push_thread.get_id()). My concern with the explicit approach is that it is easy to forget to call the waiting function of the BinaryCache and every time you want to exit you have to remember to call it. This seems to me to be very prone to human error.

Copy link
Contributor Author

@autoantwort autoantwort Mar 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have now implemented your request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ras0219-msft Is there anything left that is preventing this PR from being merged?

@autoantwort autoantwort marked this pull request as ready for review March 5, 2023 20:28
Comment on lines 98 to 102
std::vector<std::pair<Color, std::string>> m_published;

// buffers messages until newline is reached
// guarded by m_print_directly_lock
std::vector<std::pair<Color, std::string>> m_unpublished;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BillyONeal Now that you have implemented the "error document" type stuff, I am right that this should be a DiagnosticLine etc. instead.

Copy link
Member

@BillyONeal BillyONeal Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure yet, I ran into a problem where I'm not sure exactly how "status" type information should be conveyed through this infrastructure; I'm working on it over here: main...BillyONeal:vcpkg-tool:message-sink-line

As part of that I realized that #1137 touches the same area and would be easier to merge first and I'm doing that right now...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does a "status" need to be conveyed here at all? This are only informational messages.
Or what is the problem you are trying to solve here?

Copy link
Member

@BillyONeal BillyONeal Jan 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some messages, like "I am about to touch the network now", are time sensitive and under normal conditions must be printed. However, errors/warnings like "the download failed" might need to be suppressed, if a subsequent retry / alternate makes the overall process succeed.

For example, when downloading a file, in "time order" let's say this happens:

  1. We try to download the file from an asset cache. This should get a timely message because it touches the network.
  2. Attempting to contact the asset cache fails. This normally emits an error.
  3. We try to download the file from upstream. This needs a timely message because it touches the network.
  4. Download from the real upstream succeeds. We need to 'go back in time' and not emit the error from #2. This means we can't print that error when it happens, we have to buffer it until we know it's actually going to happen or not. But we normally must not buffer #1.
  5. We try to submit the freshly downloaded file back to the asset cache. This needs a timely message because it touches the network.
  6. Submitting back to the asset cache fails for some reason. That normally emits an error, but in this context needs to be reduced to a warning because we can continue without it.

If any of this is happening on 'background' threads, even the 'timely' messages need to be held until the next synchronization point with the thread that owns the console. This is what 'statusln' is for in my WIP. They need to share one channel rather than just passing both MessageSink and DiagnosticContext to handle this background case where a caller wants to keep the original-time-order of diagnostics and status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of this is happening on 'background' threads, even the 'timely' messages need to be held until the next synchronization point with the thread that owns the console.

This PR already does this (for the scope of the PR)

background case where a caller wants to keep the original-time-order of diagnostics and status

That is already the case in the PR. But yeah this PR don't know when a message ends (if it spans multiple lines), the whole reason for the error document type stuff to be created. But could this not simply be solved by buffering DiagnosticLines instead and the use of DiagnostigContext instead of MessageSink in this PR, or what breaks then?
Currently: every message (regardless of the type) -> MessageSinkBuffer -> MessageSink
Future: every message (regardless of the type) -> DiagnosticLineBuffer -> MessageSink

Copy link
Member

@BillyONeal BillyONeal Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But could this not simply be solved by buffering DiagnosticLines instead and the use of DiagnostigContext instead of MessageSink in this PR, or what breaks then?

Then 'inner' parts have no way to emit the 'intended to be timely' messages. What I'm doing is:

  1. Add one function, statusln to DiagnosticContext where these timely messages go. Normal buffering of errors won't buffer these, but background thread stuff will. Notably, there is no non-ln version because it needs to be a reasonable buffering point. (Embedded newlines are fine, but callers need to assume there will be a newline there)
  2. Teach downloads.cpp et al. to use DiagnosticContext, which implied teaching system.process.cpp how to use DiagnosticContext.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then 'inner' parts have no way to emit the 'intended to be timely' messages.

Iiuc 'intended to be timely' = must be printed immediately: The inner parts that don't run in the background can print there stuff immediately via the MessageSink/DiagnosticContext and the stuff in the background thread is never allowed to print stuff immediately otherwise you get interleaved messages with the build output.
So I don't understand why we need this extra "print messages immediately" channel when the only possible states are "print everything immediately" or "print nothing immediately".
Maybe I should just wait and see what your resulting code looks like 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I don't understand why we need this extra "print messages immediately" channel when the only possible states are "print everything immediately" or "print nothing immediately".

No, there's a third condition, which is step 4 in my example above. We need to buffer errors and warnings from the inner operation, because we may want to swallow and not emit them if a subsequent attempt succeeds, but we must not buffer any of the timely status messages.

Example: https://github.com/BillyONeal/vcpkg-tool/blob/9aa671863a68ef90d0c355e4594bd9925d9df083/src/vcpkg/base/downloads.cpp#L926-L966

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah thanks for the example! :)
But iiuc this "problem" is not caused by this PR and already existed beforehand?

Copy link
Member

@BillyONeal BillyONeal Jan 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it has nothing to do with your PR. It's a problem I ran into trying to DiagnosticContext-ize the downloads stuff, which I want to do in order to have confidence that your PR is correct. (This way everything the background thread might touch speaks the 'can be made background thread aware' language)

BillyONeal added a commit that referenced this pull request Jan 15, 2025
…utput (#1565)

Extensive overhaul of our downloads handling and console output; @JavierMatosD and I have gone back and forth several times and yet kept introducing unintended bugs in other places, which led me to believe targeted fixes would no longer cut it.

Fixes many longstanding bugs and hopefully makes our console output for this more understandable:
* We no longer print 'error' when an asset cache misses but the authoritative download succeeds. This partially undoes #1541. It is good to print errors immediately when they happen, but if a subsequent authoritative download succeeds we need to not print those errors.
* We now always and consistently print output from x-script s at the time that actually happens. Resolves https://devdiv.visualstudio.com/DevDiv/_workitems/edit/2300063
* We don't tell the user that proxy settings might fix a hash mismatch problem.
* We do tell the user that proxy settings might fix a download from asset cache problem.
* We now always tell the user the full command line we tried when invoking an x-script that fails.
* We don't crash if an x-script doesn't create the file we expect, or creates a file with the wrong hash.
* We now always print what we are doing *before* touching the network, so if we hang the user knows which server is being problematic. Note that this includes storing back to asset caches which we were previously entirely silent about except in case of failure.

Other changes:
* Removed debug output about asset cache configuration. The output was misleading / wrong depending on readwrite settings, and echoing to the user exactly what they said before we've interpreted it is not useful debug output. (Contrast with other `VcpkgPaths` debug output which tend to be paths we have likely changed from something a user said) 

Other notes:
* This makes all dependencies of #908 speak `DiagnosticContext` so it will be easy to audit that the foreground/background thread behavior is correct after this.
* I did test the curl status parsing on old Ubuntu again.

Special thanks to @JavierMatosD for his help in review of the first console output attempts and for blowing the dust off this area in the first place.
…cache-push-success

# Conflicts:
#	include/vcpkg/base/fwd/message_sinks.h
#	include/vcpkg/base/message_sinks.h
#	src/vcpkg/base/message_sinks.cpp
…cache-push-success

# Conflicts:
#	src/vcpkg/commands.install.cpp
#	src/vcpkg/commands.set-installed.cpp
…ture/async-binary-cache-push-success

# Conflicts:
#	include/vcpkg/binarycaching.h
#	src/vcpkg/binarycaching.cpp
… the work queue is drained before returning that no work is left.
* Restore autoantwort's only printing counts when done.
* Note which specs we are submitting in messages from the background.
@BillyONeal BillyONeal marked this pull request as ready for review February 3, 2025 18:34
@BillyONeal
Copy link
Member

@autoantwort I pushed some changes here, can you let me know if you are happy with them? Thanks!

static ExpectedL<BinaryProviders> make_binary_providers(const VcpkgCmdArguments& args, const VcpkgPaths& paths)
void ReadOnlyBinaryCache::fetch(View<InstallPlanAction> actions)
{
std::vector<const InstallPlanAction*> action_ptrs;
Copy link
Member

@BillyONeal BillyONeal Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is just moved up from 2325 as these things became members of ReadOnlyBinaryCache or BinaryCache rather than being local to this file now.

});
}

void BinaryCacheSynchronizer::add_submitted() noexcept
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starts meaningfully new code.

Copy link
Contributor Author

@autoantwort autoantwort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +217 to +218
using backing_uint_t = std::conditional_t<sizeof(size_t) == 4, uint32_t, uint64_t>;
using counter_uint_t = std::conditional_t<sizeof(size_t) == 4, uint16_t, uint32_t>;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this depend on size_t?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a lot of 64 bit machines without 32 bit atomics, and a lot of 32 bit machines without 64 bit atomics, and I wanted to choose something least likely to put us into lockful atomics world.

@BillyONeal BillyONeal merged commit a6289e8 into microsoft:main Feb 5, 2025
6 checks passed
@BillyONeal
Copy link
Member

Thanks for the contribution!

@BillyONeal
Copy link
Member

time results

@Neumann-A
Copy link
Contributor

Why is this so inconsistent? I would have expected less variance in the result. Especially for stuff taking 1d and longer.

@dg0yt
Copy link
Contributor

dg0yt commented Feb 14, 2025

Does the artifact size with static linkage explain most of the inconsistency? In particular when ports install executables.

@Neumann-A
Copy link
Contributor

Does the artifact size with static linkage explain most of the inconsistency? In particular when ports install executables.

Hmm maybe. The android triplets are mor ore less consistent and the -static and -static-md are also more or less consistent. @BillyONeal do you have storage data for the different triplets?

@BillyONeal
Copy link
Member

The difference being mostly a function of how big the binary cache size is is my supposition as well. I don't have those stats though. For instance, the triplets with an LLVM have more improvement. The improvement for macOS seems bigger, which might be explained by not being in the same data center as the caches.

@autoantwort autoantwort deleted the feature/async-binary-cache-push-success branch February 15, 2025 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants