Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ReadClient crash when ExchangeManager is already shutdown (so trying to get SystemLayer fails) #20085

Closed
kpark-apple opened this issue Jun 28, 2022 · 15 comments
Assignees
Labels

Comments

@kpark-apple
Copy link
Contributor

Problem

While Matter stack is shutdown, which causes subscription failure callback, if the ReadClient object for the subscription is deleted after the SystemLayer object is shutdown, dereferencing null object would cause a crash.
The following backtrace was seen:

Thread 71 Crashed ↩::   Dispatch queue: com.zigbee.chip.framework.controller.workqueue
0   CHIP                          	       0x105250514 chip::SessionManager::SystemLayer() + 12 (/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Binaries/CHIPFramework/install/TempContent/Objects/CHIP.build/CHIP.build/out/../../../../../../../../Sources/CHIPFramework/CHIPFramework-59.SHA_52006db3cfeb3f94459e096fe57597060a9dd25c/connectedhomeip/src/transport/SessionManager.h:178)
1   CHIP                          	       0x1051ccc60 chip::app::ReadClient::CancelLivenessCheckTimer() + 60 (/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Binaries/CHIPFramework/install/TempContent/Objects/CHIP.build/CHIP.build/out/../../../../../../../../Sources/CHIPFramework/CHIPFramework-59.SHA_52006db3cfeb3f94459e096fe57597060a9dd25c/connectedhomeip/src/app/ReadClient.cpp:777)
2   CHIP                          	       0x1051ccd88 chip::app::ReadClient::~ReadClient() + 96 (/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Binaries/CHIPFramework/install/TempContent/Objects/CHIP.build/CHIP.build/out/../../../../../../../../Sources/CHIPFramework/CHIPFramework-59.SHA_52006db3cfeb3f94459e096fe57597060a9dd25c/connectedhomeip/src/app/ReadClient.cpp:117)
3   CHIP                          	       0x1051cce4c chip::app::ReadClient::~ReadClient() + 32 (/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Binaries/CHIPFramework/install/TempContent/Objects/CHIP.build/CHIP.build/out/../../../../../../../../Sources/CHIPFramework/CHIPFramework-59.SHA_52006db3cfeb3f94459e096fe57597060a9dd25c/connectedhomeip/src/app/ReadClient.cpp:112)
4   CHIP                          	       0x1051cce7c chip::app::ReadClient::~ReadClient() + 32 (/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Binaries/CHIPFramework/install/TempContent/Objects/CHIP.build/CHIP.build/out/../../../../../../../../Sources/CHIPFramework/CHIPFramework-59.SHA_52006db3cfeb3f94459e096fe57597060a9dd25c/connectedhomeip/src/app/ReadClient.cpp:112)
5   CHIP                          	       0x104c3b708 (anonymous namespace)::SubscriptionCallback::~SubscriptionCallback() + 124 (/AppleInternal/BuildRoot/Applications/Xcode.app/Contents/Developer/Platforms/AppleTVOS.platform/Developer/SDKs/AppleTVOS16.0.Internal.sdk/usr/include/c++/v1/__memory/unique_ptr.h:57)
6   CHIP                          	       0x104c3c434 ___ZN12_GLOBAL__N_120SubscriptionCallback11ReportErrorEP7NSError_block_invoke_2 + 32 (/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/CHIPFramework/CHIPFramework-59.SHA_52006db3cfeb3f94459e096fe57597060a9dd25c/connectedhomeip/src/darwin/Framework/CHIP/CHIPDevice.mm:1563)
7   libdispatch.dylib             	       0x1867d77a4 _dispatch_call_block_and_release + 32 (/Library/Caches/com.apple.xbs/Sources/libdispatch/src/init.c:1518)
8   libdispatch.dylib             	       0x1867d92c8 _dispatch_client_callout + 20 (/Library/Caches/com.apple.xbs/Sources/libdispatch/src/object.m:560)
9   libdispatch.dylib             	       0x1867e09cc _dispatch_lane_serial_drain + 672 (/Library/Caches/com.apple.xbs/Sources/libdispatch/src/inline_internal.h:2631)
10  libdispatch.dylib             	       0x1867e1518 _dispatch_lane_invoke + 384 (/Library/Caches/com.apple.xbs/Sources/libdispatch/src/queue.c:3939)
11  libdispatch.dylib             	       0x1867ec0a0 _dispatch_workloop_worker_thread + 652 (/Library/Caches/com.apple.xbs/Sources/libdispatch/src/queue.c:6766)
12  libsystem_pthread.dylib       	       0x186a32fe4 _pthread_wqthread + 288 (/Library/Caches/com.apple.xbs/Sources/libpthread/src/pthread.c:2618)
13  libsystem_pthread.dylib       	       0x186a3a5c8 start_wqthread + 8

Proposed Solution

In ReadClient::CancelLivenessCheckTimer(), check null before dereferencing SystemLayer() to call CancelTimer().

@bzbarsky-apple
Copy link
Contributor

@mrjerryjohns @yunhanw-google Sounds like we should have canceled our timer much earlier than the destructor here....

@turon
Copy link
Contributor

turon commented Jun 29, 2022

What would the replication procedure be using chip-tool or chip-repl and all-clusters-app?

@bzbarsky-apple
Copy link
Contributor

I don't know about chip-repl, but for chip-tool I suspect this is not possible to replicate because it shuts down all its subscriptions manually before shutting down the stack.

Note that this is a crash on the client side.

@turon
Copy link
Contributor

turon commented Jun 29, 2022

Could chip-tool-darwin in interactive mode be used to replicate?
Given a known set of replication steps, the work of solving it can begin more efficiently.

@bzbarsky-apple
Copy link
Contributor

Could chip-tool-darwin in interactive mode be used to replicate?

I think that also shuts down all its subscriptions before shutting down the stack. But maybe we can hack it up to skip that step and see if that replicates... @krypton36 is that something you might have bandwidth for?

@mrjerryjohns
Copy link
Contributor

mrjerryjohns commented Jun 29, 2022

This isn't a core SDK side problem. Application logic needs to correctly sequence shutting down their respective SDK allocated objects, which includes de-allocating an active ReadClient instance.

@mrjerryjohns
Copy link
Contributor

mrjerryjohns commented Jun 29, 2022

Should InteractionModelEngine::Shutdown() call Close on all active ReadClient objects? On the one-hand, I think applications should be in business of safely orchestrating their shutdowns, and calling Close on the ReadClient from under-neath the application feels like we're subverting it's ownership.

However, it's clear that ReadClient depends on InteractionModelEngine for some services, so if the latter is no longer usable, then presumably, neither is the former. So perhaps, calling Close isn't odd...

@bzbarsky-apple
Copy link
Contributor

Application logic needs to correctly sequence shutting down their respective SDK allocated objects, which includes de-allocating an active ReadClient instance.

That's not workable given that we have read/subscribe APIs in src/controller that do not expose the ReadClient to the consumer.

@bzbarsky-apple
Copy link
Contributor

Note that in this case the ReadClient is getting Close called on it, by the way. And then it's deallocating itself, and trying to do stuff from its destructor that assumes a stack is running.

@mrjerryjohns
Copy link
Contributor

ReadClient is getting Close called on it, by the way

The destructor is what's getting called no, not Close?

@bzbarsky-apple
Copy link
Contributor

The destructor is getting called because Close got called, called the OnDone callback, and that is deleting the ReadClient.

@aajain-com aajain-com added the p1 priority 1 work label Aug 10, 2022
@bzbarsky-apple bzbarsky-apple changed the title ReadClient crash when SystemLayer is already shutdown ReadClient crash when ExchangeManager is already shutdown (so trying to get SystemLayer fails) Aug 11, 2022
@andy31415
Copy link
Contributor

V1: review - needs follow up if this is platform specific or not and reproduction steps.

In general, if a patch exists for this, it should be acceptable for 1.0 because it fixes a crash.

@bzbarsky-apple
Copy link
Contributor

V1: review - needs follow up if this is platform specific or not and reproduction steps.

This is not particularly platform-specific. Simplest reproduction steps are probably in #21811 (comment) using iOS CHIPTool (which does stack restarts in a controlled way, making this easy to reproduce) but similar steps could be designed for command-line chip-tool, I suspect, as evidenced by #22138

@bzbarsky-apple
Copy link
Contributor

With #22282 and #22245 fixed, we are in a slightly better spot here. At this point commissioner shutdown will at least nix all the secure sessions early enough, which will trigger exchange timeouts for all the things that are waiting on a response. That leaves two things that can go wrong, as far as I can tell.

  1. ReadClients that are not in the middle of waiting for a response will stick around. If they are then destroyed, this crash will be hit. Filed DeviceController shutdown should ShutdownSubscription on still-registered subscriptions. #22319 on this. This is a cross-platform issue.
  2. An API consumer could fail to delete a ReadClient synchronously from the OnDone notification. That's something that we should probably make more explicit somehow, and in particular some of the Darwin clients have that problem. Filed Queued deletion setup on Darwin leads to crashes #22320 on this, so there's a cleaner slate for the analysis and discussion.

@bzbarsky-apple
Copy link
Contributor

With #22319 and #22320 fixed, I believe this should be completely fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants