Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a crash that happens when Firestore is being terminated when the last shared pointer to it is in a listener #7412

Closed
wants to merge 18 commits into from

Conversation

var-const
Copy link
Contributor

@var-const var-const commented Jan 29, 2021

If a user calls terminate and immediately disposes their reference to Firestore, it's possible that while termination is in process, the last remaining reference (more precisely, shared pointer) to Firestore is in a listener. When that listener is destroyed as part of the termination, it leads to api::Firestore being destroyed. Importantly, termination happens on the worker queue. The destructor of api::Firestore calls Dispose which presumes it's not called from the worker queue and tries to enqueue work, leading to a failing assertion and a crash.

The general problem is that FirestoreClient::Dispose may be invoked on the worker queue from TerminateInternal and needs a way to handle it gracefully. This PR largely makes Dispose a no-op in that case.

Fixes #6909.

@var-const
Copy link
Contributor Author

@wilhuff Any suggestions on different approaches/potential pitfalls/etc. are very welcome.

@var-const
Copy link
Contributor Author

@wilhuff Looks like extending the lifetime of api::Firestore isn't really necessary -- Dispose works fine when invoked from TerminateInternal as long as I use is_terminated to check whether enqueuing TerminateInternal is necessary. This approach seems simpler -- sorry if you have already started with the review.

@var-const
Copy link
Contributor Author

Hmm, I'm getting an Asan error in CI. I'll investigate and let you know when this PR is ready for another look.

@var-const
Copy link
Contributor Author

Ok, it's ready for review again. The simpler approach doesn't work, and the previous approach needed a fix.

@@ -86,6 +86,7 @@ void AsyncQueue::ExecuteBlocking(const Operation& operation) {
"ExecuteBlocking may not be called "
"before the previous operation finishes executing");

auto self = shared_from_this();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we haven't done this before? is_operation_in_progress_ = false; may crash when api::Firestore (and, consequently, the worker queue) is destroyed on the queue (in FirestoreClient::TerminateAsync).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an oversight.

I think the idea at the time was that FirestoreClient would keep the AsyncQueue alive as long as it needed to, but this doesn't account for the case where the Firestore destructor is running on the AsyncQueue.

worker_queue_->EnqueueEvenWhileRestricted([&, this, callback] {
// Make sure this `FirestoreClient` is not destroyed during
// `TerminateInternal`, so that `user_executor_` stays valid.
auto self = shared_from_this();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're depending on the shared_ptr keeping the instance alive, we should create it outside the callback. Otherwise, this can be destroyed while this lambda is awaiting execution.

It seems like both callbacks that call TerminateInternal should capture a shared_ptr<FirestoreClient> to ensure the instance lives past that point.

@@ -223,6 +225,9 @@ class FirestoreClient : public std::enable_shared_from_this<FirestoreClient> {
bool credentials_initialized_ = false;
local::LruDelegate* _Nullable lru_delegate_;
util::DelayedOperation lru_callback_;

// Used during shutdown to guarantee lifetimes.
std::weak_ptr<api::Firestore> weak_firestore_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type here doesn't actually matter since we don't need to manipulate this instance. Could this be std::weak_ptr<void>?

TerminateInternal();
});
bool enqueued = false;
// If `remote_store_` is null, it means termination has finished already. In
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no reference to remote_store_ in the code below, and remote_store_ isn't actually a good signal anyway, since it is reset asynchronously. I think you're referring to is_terminated(), right?

Second, phrasing this in terms of "the sequential order invariant" obscures what's going on. I think the issue is that EnqueueEvenWhileRestricted can't be called while already on the async queue. Restated that way it seems like we should just fix that rather than trying to introduce yet another guard on top of that.

@@ -263,7 +274,13 @@ void FirestoreClient::Dispose() {

void FirestoreClient::TerminateAsync(StatusCallback callback) {
worker_queue_->EnterRestrictedMode();
worker_queue_->EnqueueEvenWhileRestricted([this, callback] {
worker_queue_->EnqueueEvenWhileRestricted([&, this, callback] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default capture by reference in an asynchronously executing block is a recipe for disaster. It doesn't even seem like it's necessary. Remove?

// dangerous in the case when `Dispose` is invoked immediately after the
// termination, still on the worker queue -- it would break the sequential
// order invariant of the queue.
if (!is_terminated()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is racy. Two threads could observe !is_terminated() simultaneously and then both attempt the block below.

A better solution might be to have EnqueueEvenWhileRestricted do the right thing. This already has logic to deal with the race inherent in two callers attempting to dispose the instance. Perhaps we just need to make it EnqueueRelaxedEvenWhileRestricted?

@@ -86,6 +86,7 @@ void AsyncQueue::ExecuteBlocking(const Operation& operation) {
"ExecuteBlocking may not be called "
"before the previous operation finishes executing");

auto self = shared_from_this();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an oversight.

I think the idea at the time was that FirestoreClient would keep the AsyncQueue alive as long as it needed to, but this doesn't account for the case where the Firestore destructor is running on the AsyncQueue.

@var-const
Copy link
Contributor Author

Closing in favor of #7421.

@var-const var-const closed this Feb 2, 2021
var-const added a commit that referenced this pull request Feb 3, 2021
…last shared pointer to it is in a listener (#7421)

If a user calls terminate and immediately disposes their reference to Firestore, it's possible that while termination is in process, the last remaining reference (more precisely, shared pointer) to Firestore is in a listener. When that listener is destroyed as part of the termination, it leads to `api::Firestore` being destroyed. Importantly, termination happens on the worker queue. The destructor of `api::Firestore` calls `Dispose` which presumes it's not called from the worker queue and tries to enqueue work, leading to a failing assertion and a crash.

The solution is simply to remove the sequential order checks when the queue enters/is in restricted mode. There is a legitimate case where the Firestore destructor can run on the worker queue, as the original issue shows. The complexity of #7412 also indicates that a simpler solution is preferable.

Fixes #6909.
@firebase firebase locked and limited conversation to collaborators Mar 4, 2021
@paulb777 paulb777 deleted the varconst/terminate-fix-2 branch January 21, 2022 00:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
2 participants