Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock in Foundation (pre-macOS 11) triggered by GDT code. #7171

Closed
bjhomer opened this issue Dec 15, 2020 · 11 comments
Closed

Deadlock in Foundation (pre-macOS 11) triggered by GDT code. #7171

bjhomer opened this issue Dec 15, 2020 · 11 comments
Assignees
Milestone

Comments

@bjhomer
Copy link
Contributor

bjhomer commented Dec 15, 2020

[REQUIRED] Step 1: Describe your environment

  • Xcode version: 12.2
  • Firebase SDK version: 7.3.0
  • Installation method: CocoaPods
  • Firebase Component: GoogleDataTransport

[REQUIRED] Step 2: Describe the problem

We have users reporting that our macOS app often hangs on launch on pre-macOS 11 systems. We've traced this to a deadlock happening when two threads try to do an initial access of NSUserDefaults at the same time during +load calls. See the attached backtrace for details. We're able to reproduce this on ~50% of launches on these older systems. This is probably dependent on the particular ordering and timing of +load calls, and so may be very sensitive to the presence of otherwise-unrelated files.

This makes it ultimately an Apple issue, and they seem to have fixed it in macOS 11 (Big Sur). However, our users are still running into it, and it's easy to fix by making one change to GoogleDataTransport.

Steps to reproduce:

  1. Install pod Firebase/Crashlytics in a macOS app that runs on something older than macOS 11. (We've seen it on both 10.13 and 10.15).
  2. Launch the app a few times.

What happened? How can we make the problem occur?
The main thread is doing this:

 2803 -[GDTCORLifecycle init]  (in GoogleDataTransport) + 84  [0x103530de1]
   2803 standardDefaultCenter  (in Foundation) + 68  [0x7fff34aab6ea]
     2803 -[NSFileManager _registerForUbiquityAccountChangeNotifications]  (in Foundation) + 182  [0x7fff34aab819]
       2803 _dispatch_once_wait  (in libdispatch.dylib) + 76  [0x7fff6c377dbb]
         2803 _dlock_wait  (in libdispatch.dylib) + 44  [0x7fff6c377e29]
           2803 __ulock_wait  (in libsystem_kernel.dylib) + 10  [0x7fff6c51355e]

A second thread is doing this:

2803 __37-[GDTCORUploadCoordinator startTimer]_block_invoke_2  (in GoogleDataTransport) + 41  [0x1035349f7]
  2803 +[GDTCORApplication sharedApplication]  (in GoogleDataTransport) + 46  [0x103531f34]
    2803 _dispatch_once_callout  (in libdispatch.dylib) + 20  [0x7fff6c3787de]
      2803 _dispatch_client_callout  (in libdispatch.dylib) + 8  [0x7fff6c377658]
        2803 __38+[GDTCORApplication sharedApplication]_block_invoke  (in GoogleDataTransport) + 32  [0x103531f56]
          2803 -[GDTCORApplication init]  (in GoogleDataTransport) + 82  [0x103531fbd]
            2803 standardDefaultCenter  (in Foundation) + 68  [0x7fff34aab6ea]
              2803 -[NSFileManager _registerForUbiquityAccountChangeNotifications]  (in Foundation) + 182  [0x7fff34aab819]
                2803 _dispatch_once_callout  (in libdispatch.dylib) + 20  [0x7fff6c3787de]
                  2803 _dispatch_client_callout  (in libdispatch.dylib) + 8  [0x7fff6c377658]
                    2803 __63-[NSFileManager _registerForUbiquityAccountChangeNotifications]_block_invoke  (in Foundation) + 26  [0x7fff34aab8c7]
                      2803 -[NSFileManager _processHasUbiquityContainerEntitlement]  (in Foundation) + 93  [0x7fff34aab96f]
                        2803 _dispatch_once_callout  (in libdispatch.dylib) + 20  [0x7fff6c3787de]
                          2803 _dispatch_client_callout  (in libdispatch.dylib) + 8  [0x7fff6c377658]
                            2803 __56-[NSFileManager _processHasUbiquityContainerEntitlement]_block_invoke  (in Foundation) + 316  [0x7fff34aabaad]
                              2803 _dispatch_once_callout  (in libdispatch.dylib) + 20  [0x7fff6c3787de]
                                2803 _dispatch_client_callout  (in libdispatch.dylib) + 8  [0x7fff6c377658]
                                  2803 ___initializeLibrarianFunctions_block_invoke  (in Foundation) + 285  [0x7fff34aabcd5]
                                    2803 _dispatch_once_callout  (in libdispatch.dylib) + 20  [0x7fff6c3787de]
                                      2803 _dispatch_client_callout  (in libdispatch.dylib) + 8  [0x7fff6c377658]
                                        2803 ___NSFrameworkHandleForLibrarianSymbols_block_invoke  (in Foundation) + 21  [0x7fff34aabcf1]
                                          2803 dlopen  (in libdyld.dylib) + 171  [0x7fff6c3bbd8a]
                                            2803 dlopen_internal  (in dyld) + 609  [0x1121a6e8f]
                                              2803 dyld::runInitializers(ImageLoader*)  (in dyld) + 82  [0x11219cef2]
                                                2803 ImageLoader::runInitializers(ImageLoader::LinkContext const&, ImageLoader::InitializerTimingList&)  (in dyld) + 82  [0x1121ab154]
                                                  2803 ImageLoader::processInitializers(ImageLoader::LinkContext const&, unsigned int, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&)  (in dyld) + 188  [0x
                                                    2803 ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&)  (
                                                      2803 ImageLoader::recursiveInitialization(ImageLoader::LinkContext const&, unsigned int, char const*, ImageLoader::InitializerTimingList&, ImageLoader::UninitedUpwards&) 
                                                        2803 dyld::notifySingle(dyld_image_states, ImageLoader const*, ImageLoader::InitializerTimingList*)  (in dyld) + 418  [0x11219926c]
                                                          2803 load_images  (in libobjc.A.dylib) + 239  [0x7fff6b2236a4]
                                                            2803 _os_unfair_lock_lock_slow  (in libsystem_platform.dylib) + 160  [0x7fff6c5cb9c3]
                                                              2803 __ulock_wait  (in libsystem_kernel.dylib) + 10  [0x7fff6c51355e]

A full stack trace is attached at the end of this report.

Note that the second thread is calling back into dlopen() down in Foundation, which seems to be causing a deadlock with the main thread. This secondary thread call is happening because GDTCORUploadCoordinator is being initialized from some +load call, which is triggering a Dispatch timer to check for uploads.

As far as I can tell, GDTCORUploadCoordinator is checking every 30 seconds whether it has information to report. That 30 second suggests that it is probably not critical that this check happen right at launch, while we're still loading in frameworks. Thus, I'm proposing a simple workaround: defer GDTCORUploadCoordinator's upload check for 1s after launch.

Relevant Code:

+[GDTCORFlatFileStorage load] calls +[GDTCORUploadCoordinator sharedInstance], which calls -[GDTCORUploadCoordinator startTimer]. That creates a timer with an initial deadline of DISPATCH_TIME_NOW, which makes it run during the processing of +load calls, which means it can race with other +load calls.

I have verified that I can fix it by changing the following code in -startTimer:

// Old version:
self->_timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, self->_coordinationQueue);
dispatch_source_set_timer(self->_timer, DISPATCH_TIME_NOW, self->_timerInterval,
                              self->_timerLeeway);



// New version:
dispatch_time_t deadline = dispatch_time(DISPATCH_TIME_NOW, 1 * NSEC_PER_SEC);
self->_timer = dispatch_source_create(DISPATCH_SOURCE_TYPE_TIMER, 0, 0, self->_coordinationQueue);
dispatch_source_set_timer(self->_timer, deadline, self->_timerInterval,
                              self->_timerLeeway);

Stack trace of deadlock attached here:
deadlock_sample.txt

@google-oss-bot
Copy link

I found a few problems with this issue:

  • I couldn't figure out how to label this issue, so I've labeled it for a human to triage. Hang tight.
  • This issue does not seem to follow the issue template. Make sure you provide all the required information.

@maksymmalyhin
Copy link
Contributor

@bjhomer Thank you for reporting it and your detailed investigation! The fix looks reasonable to me. Would you like to crate a PR with the fix? Your contribution is appreciated!

bjhomer added a commit to bjhomer/firebase-ios-sdk that referenced this issue Dec 15, 2020
If an upload coordinator gets created during `+load` calls, it can end up running its immediately, while +load calls are still running. It's generally a bad idea to be splitting off threads while +load calls are still running.

The upload coordinator checks every 30 seconds, and doesn't really need to run right away. Since it's easy to transitively create an upload coordinator from other calls that happen during +load, we'll just push off the initial run of the timer slightly. That makes it less likely to interfere (or deadlock) with other load calls.

See firebase#7171 for an example of a problem solved by this change.
@bjhomer
Copy link
Contributor Author

bjhomer commented Dec 15, 2020

I've added a PR to make the proposed changes:

#7173

@bjhomer
Copy link
Contributor Author

bjhomer commented Dec 15, 2020

If this is merged soon, is there any chance we could get a new version of the Firebase pod cut? We're hoping to put out a new release by tomorrow, and I'd rather not have to point to a fork of the whole SDK.

@maksymmalyhin
Copy link
Contributor

Thank you for the PR!

As for the release - we don't have any regular releases scheduled until the beginning of 2021 due to the holidays. We will have to evaluate if it is safe to release a version with the change as a patch release. If you are going to release soon, I would recommend to be ready to use the version from the fork until the next release. We will post expected release date later.

bjhomer added a commit to bjhomer/firebase-ios-sdk that referenced this issue Dec 15, 2020
If an upload coordinator gets created during `+load` calls, it can end up running its immediately, while +load calls are still running. It's generally a bad idea to be splitting off threads while +load calls are still running.

The upload coordinator checks every 30 seconds, and doesn't really need to run right away. Since it's easy to transitively create an upload coordinator from other calls that happen during +load, we'll just push off the initial run of the timer slightly. That makes it less likely to interfere (or deadlock) with other load calls.

See firebase#7171 for an example of a problem solved by this change.
@maksymmalyhin
Copy link
Contributor

... I'd rather not have to point to a fork of the whole SDK.

Just to make sure we are on the same page here. If by the whole SDK you mean entire Firebase SDK, then actually you can avoid it. E.g. for Cocoapods you can add something like the following to install only GoogleDataTransport from the fork:

pod 'GoogleDataTransport', :git=>'https://github.com/firebase/firebase-ios-sdk.git', :commit=>'<commit_hash>'

@bjhomer
Copy link
Contributor Author

bjhomer commented Dec 15, 2020

Thanks. I tried this:

  pod 'Firebase/Crashlytics'
  pod 'GoogleDataTransport', :git => 'https://github.com/bjhomer/firebase-ios-sdk.git', :commit => 'd5e421c'

But I get this error on pod install:

[!] CocoaPods could not find compatible versions for pod "nanopb":
  In snapshot (Podfile.lock):
    nanopb (= 2.30906.0, ~> 2.30906.0)

  In Podfile:
    GoogleDataTransport (from `https://github.com/bjhomer/firebase-ios-sdk.git`, commit `d5e421c9861b3c79f002f0b438f9c7b33cadc061`) was resolved to 8.1.0, which depends on
      nanopb (~> 2.30907.0)

Specs satisfying the `nanopb (= 2.30906.0, ~> 2.30906.0), nanopb (~> 2.30907.0)` dependency were found, but they required a higher minimum deployment target.

It seems like there's a dependency conflict of some kind, but I'm not familiar enough with these particular dependencies to know how to resolve it.

@maksymmalyhin
Copy link
Contributor

@bjhomer It seems like pod update should solve the issue above.

@paulb777
Copy link
Member

paulb777 commented Dec 15, 2020

Yep, our next minor release is doing a coordinated update of the nanopb dependency, so parts of the master branch are not separably compatible with the current latest.

The PR would need to be backported/cherry-picked to the release-7.3 branch to be compatible.

bjhomer added a commit to bjhomer/firebase-ios-sdk that referenced this issue Dec 15, 2020
If an upload coordinator gets created during `+load` calls, it can end up running its immediately, while +load calls are still running. It's generally a bad idea to be splitting off threads while +load calls are still running.

The upload coordinator checks every 30 seconds, and doesn't really need to run right away. Since it's easy to transitively create an upload coordinator from other calls that happen during +load, we'll just push off the initial run of the timer slightly. That makes it less likely to interfere (or deadlock) with other load calls.

See firebase#7171 for an example of a problem solved by this change.
@bjhomer
Copy link
Contributor Author

bjhomer commented Dec 15, 2020

Okay, I can do that. It looks like my PR has some failures in pod_lib_lint (macos), but I don't think I did anything that would have affected those. Do I need to do something to address those?

bjhomer added a commit to bjhomer/firebase-ios-sdk that referenced this issue Dec 15, 2020
If an upload coordinator gets created during `+load` calls, it can end up running its immediately, while +load calls are still running. It's generally a bad idea to be splitting off threads while +load calls are still running.

The upload coordinator checks every 30 seconds, and doesn't really need to run right away. Since it's easy to transitively create an upload coordinator from other calls that happen during +load, we'll just push off the initial run of the timer slightly. That makes it less likely to interfere (or deadlock) with other load calls.

See firebase#7171 for an example of a problem solved by this change.
maksymmalyhin pushed a commit that referenced this issue Dec 15, 2020
If an upload coordinator gets created during `+load` calls, it can end up running its immediately, while +load calls are still running. It's generally a bad idea to be splitting off threads while +load calls are still running.

The upload coordinator checks every 30 seconds, and doesn't really need to run right away. Since it's easy to transitively create an upload coordinator from other calls that happen during +load, we'll just push off the initial run of the timer slightly. That makes it less likely to interfere (or deadlock) with other load calls.

See #7171 for an example of a problem solved by this change.
@maksymmalyhin maksymmalyhin added this to the 7.4.0 - M87 milestone Dec 15, 2020
@maksymmalyhin
Copy link
Contributor

maksymmalyhin commented Dec 15, 2020

Fixed by #7173.

The change is scheduled for the next release which is expected to be in the beginning of 2021. Please follow the recommendations described in the comment to install the updated GoogleDataTransport SDK version before the release if needed.

@firebase firebase locked and limited conversation to collaborators Jan 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants