RUM-3463 feat(watchdog-termination): track app state and detect watchdog terminations #1889

ganeshnj · 2024-06-10T09:41:46Z

What and why?

We want to detect the watchdog terminations in the SDK and report them as a crash to the backend.

How?

This PR setups the building block of detecting a watchdog termination using the heuristic based approach. The heuristic works on checking all the possible ways an app can be terminated and if the termination is from an unknown method, it is considered as a watchdog termination.

Read more about the approach in the internal RFC.

The reporting part is not yet implemented in this PR.

Review checklist

Feature or bugfix MUST have appropriate tests (unit, integration)
Make sure each commit and the PR mention the Issue number or JIRA reference
Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

Run unit tests for Core, RUM, Trace, Logs, CR and WVT
Run unit tests for Session Replay
Run integration tests
Run smoke tests
Run tests for tools/

datadog-datadog-prod-us1 · 2024-06-10T09:54:22Z

Datadog Report

Branch report: ganeshnj/feat/RUM-3463-wt-detect
Commit report: 42879d9
Test service: dd-sdk-ios

✅ 0 Failed, 3351 Passed, 0 Skipped, 3m 46.76s Total Time
🔻 Test Sessions change in coverage: 10 decreased, 3 increased

🔻 Code Coverage Decreases vs Default Branch (10)

This report shows up to 5 code coverage decreases.

test DatadogCoreTests tvOS 79.11% (-0.42%) - Details
test DatadogCrashReportingTests tvOS 26.95% (-0.29%) - Details
test DatadogCrashReportingTests iOS 26.92% (-0.29%) - Details
test DatadogCoreTests iOS 73.43% (-0.25%) - Details
test DatadogTraceTests tvOS 54.72% (-0.18%) - Details

mariedm

Looks great! I have only reviewed the ~2/3 for now, but I asked some open questions. Feel free to disregard the nits of course.

mariedm · 2024-06-12T10:17:46Z

Datadog/Example/ExampleAppDelegate.swift

@@ -85,6 +82,9 @@ class ExampleAppDelegate: UIResponder, UIApplicationDelegate {
        )
        RUMMonitor.shared().debug = true

+        // Enable Crash Reporting


/question: What is the reason for moving this call later?

I needed crash reporting enabled first in my testing but no longer needed.

I might revisit it laster when doing full end to end integration because we want RUM to be enabled first in order to get the launch message.

DatadogInternal/Sources/Context/Sysctl.swift

DatadogInternal/Sources/MessageBus/FeatureMessageReceiver.swift

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/VendorIdProvider.swift

mariedm · 2024-06-12T14:25:56Z

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationChecker.swift

+
+        guard let previous = previous else {
+            return false
+        }


tiny nit suggestion:

guard let previous else { return false }

mariedm · 2024-06-12T14:28:20Z

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationChecker.swift

+        // Watchdog Termination detection doesn't work on simulators.
+        guard deviceInfo.isSimulator == false else {
+            return false
+        }


tiny nit suggestion:

guard !deviceInfo.isSimulator else { return false }

Same for all the other guard statement below, but I also totally understand if you prefer being explicit.

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationChecker.swift

mariedm

That was fast! 🏃💨 Thank you for the changes and answers 😊

ncreated

Massive effort 🏋️🏅. Very elegant code that reads smooth and is well integrated into the whole architecture.

I only reviewed the implementation part (no yet tests), leaving several questions and few change requests.

On the process level, I wonder if this PR shouldn't target a feature branch instead of develop. It is missing the "send RUM error" part, which makes it not functional but it brings impact. Alternatively, we could disable it at the entry point and only enable with the final touch on the whole topic.

Datadog/Datadog.xcodeproj/project.pbxproj

ncreated · 2024-06-13T07:28:16Z

DatadogCrashReporting/Sources/Integrations/CrashReportSender.swift

+        // We use baggage message to pass the launch report instead updating the global context
+        // because under the hood, some integrations start certain features based on the launch report (e.g. `WatchdogTerminationMonitor`).
+        // If we update the global context, the integrations will keep starting the features on every update which is not desired.


question/ It feels that having LaunchReport as part of DatadogContext will be more scalable:

no requirement to enable DatadogCrashReporting after Logs or RUM,

every feature can access it when writing event or through featureScope.context {} explicitly;

What is the actual limitation behind following?

If we update the global context, the integrations will keep starting the features on every update which is not desired.

It sounds like architecture flaw that should be fixed rather than bypassed through point-to-point messaging over bus. Would we be up to improving it as part of this / next PR?

We are mixing two concepts here and I would prefer not to re-architect whole flow here.

We want something to start working when something happens in some other feature.

Why? because if start the monitor before launch report, it will update the state in the disk and we will lose the previous session date. This can't be achieved by global context sharing.

Share data

This is easily doable with both global context and point to point messaging.

That said, in an ideal world, I want to subscribe to only launch report (not context), so my receiver must only be invoked for launch report only and with additional context (extra data but not required).

Thanks for clarifying 👍. I'm okay with proposed way as long as there is no requirement on the order of RUM.enable() vs CrashReporting.enable() - isn't there 🤔💭?

Keeping it for the next PR, my hunch says there will be because SDK sends the message as soon as crash reporting is enabled and if RUM is not enabled yet, it will be missed.

Which may lead to re-architecture of this.

DatadogInternal/Sources/Context/Sysctl.swift

...dogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationAppStateManager.swift

DatadogRUM/Sources/Feature/RUMFeature.swift

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/VendorIdProvider.swift

DatadogRUM/Sources/Instrumentation/WatchdogTerminations/WatchdogTerminationChecker.swift

DatadogRUM/Sources/Integrations/LaunchReportReceiver.swift

ganeshnj · 2024-06-13T12:50:17Z

Massive effort 🏋️🏅. Very elegant code that reads smooth and is well integrated into the whole architecture.

I only reviewed the implementation part (no yet tests), leaving several questions and few change requests.

On the process level, I wonder if this PR shouldn't target a feature branch instead of develop. It is missing the "send RUM error" part, which makes it not functional but it brings impact. Alternatively, we could disable it at the entry point and only enable with the final touch on the whole topic.

I just disabled by default until we finish the reporting part.

ncreated

Thanks for changes, looks great. I gave it another look, this time focusing more on tests. Left few suggestions with the most important one on incorporating mocked core vs our mid-term efforts. Few remarks, no blockers 👍

Once again, well done 💪 🏋️‍♂️ 🏅

DatadogInternal/Sources/Context/AppState.swift

DatadogInternal/Sources/Context/Sysctl.swift

DatadogRUM/Sources/Feature/RUMDataStore.swift

DatadogRUM/Sources/Feature/RUMFeature.swift

ncreated · 2024-06-14T09:56:27Z

...RUM/Tests/Instrumentation/WatchdogTerminations/WatchdogTerminationAppStateManagerTests.swift

+        core = PassthroughCoreMock(
+            context: .mockWith(applicationStateHistory: .mockAppInBackground()),
+            messageReceiver: sut
+        )


convention/ Ref.: #1744, we want to get rid of the pattern of testing message receivers (here sut) through integration with mocked core. This convention proves to be error-prone and can increase flakiness due to leaking core reference. Most importantly, it doesn't test the critical aspect of message receiver which is the return type from receive(message:from:) as this detail is hidden inside PassthroughCoreMock implementation. In the long term, we plan to simplify receive(message:from:) signature by removing the from core: parameter at all (it is not required).

That said, the recommended convention is to create standalone DatadogContext value and pass it to the receiver directly. For tests in this file that would mean:

// app state changes var context: DatadogContext = .mockWith(applicationStateHistory: .mockAppInBackground()) context.applicationStateHistory.append(.init(state: .active, date: .init())) XCTAssertFalse(sut.receive(message: .context(context), from: NOPDatadogCore()), "It should not consume the message") // app state changes again context.applicationStateHistory.append(.init(state: .background, date: .init())) XCTAssertFalse(sut.receive(message: .context(context), from: NOPDatadogCore()), "It should not consume the message")

...RUM/Tests/Instrumentation/WatchdogTerminations/WatchdogTerminationAppStateManagerTests.swift

ncreated · 2024-06-14T10:03:39Z

DatadogRUM/Tests/Instrumentation/WatchdogTerminations/WatchdogTerminationCheckerTests.swift

+    // swiftlint:enable implicitly_unwrapped_optional
+
+    func testNoPreviousState_NoWatchdogTermination() throws {
+        given(isSimulator: .mockRandom())


nice/ ⭐ I'll borrow it

DatadogRUM/Tests/Instrumentation/WatchdogTerminations/WatchdogTerminationMonitorTests.swift

…dog terminations

ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch 6 times, most recently from 0821d6f to 3e90059 Compare June 12, 2024 09:43

ganeshnj marked this pull request as ready for review June 12, 2024 09:53

ganeshnj requested review from a team as code owners June 12, 2024 09:53

ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch from ffc03c1 to fb70ce7 Compare June 12, 2024 11:48

mariedm reviewed Jun 12, 2024

View reviewed changes

mariedm previously approved these changes Jun 13, 2024

View reviewed changes

ncreated requested changes Jun 13, 2024

View reviewed changes

ganeshnj dismissed mariedm’s stale review via 32625d8 June 13, 2024 09:06

ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch 2 times, most recently from 255ff38 to a2a51d1 Compare June 13, 2024 12:14

ganeshnj requested a review from ncreated June 13, 2024 12:53

ncreated previously approved these changes Jun 14, 2024

View reviewed changes

ganeshnj dismissed ncreated’s stale review via 32a775d June 17, 2024 12:03

RUM-3463 feat(watchdog-termination): track app state and detect watch…

0358d1b

…dog terminations

ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch from 20254d9 to 0358d1b Compare June 17, 2024 12:14

RUM-3463 feat(watchdog-termination): fix linter

a016589

ncreated approved these changes Jun 18, 2024

View reviewed changes

ganeshnj merged commit 8d1e01e into develop Jun 18, 2024
8 checks passed

ganeshnj deleted the ganeshnj/feat/RUM-3463-wt-detect branch June 18, 2024 13:11

ganeshnj mentioned this pull request Jun 21, 2024

RUM-4911 feat(watchdog-termination): send Watchdog Termination #1917

Merged

6 tasks

maxep mentioned this pull request Jul 4, 2024

Release 2.14.0 #1941

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUM-3463 feat(watchdog-termination): track app state and detect watchdog terminations #1889

RUM-3463 feat(watchdog-termination): track app state and detect watchdog terminations #1889

ganeshnj commented Jun 10, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Jun 10, 2024 •

edited

Loading

mariedm left a comment •

edited

Loading

mariedm Jun 12, 2024

ganeshnj Jun 13, 2024 •

edited

Loading

ganeshnj Jun 13, 2024

mariedm Jun 12, 2024

mariedm Jun 12, 2024

mariedm left a comment

ncreated left a comment

ncreated Jun 13, 2024

ganeshnj Jun 13, 2024

ncreated Jun 14, 2024

ganeshnj Jun 14, 2024

ganeshnj commented Jun 13, 2024

ncreated left a comment

ncreated Jun 14, 2024

ncreated Jun 14, 2024

RUM-3463 feat(watchdog-termination): track app state and detect watchdog terminations #1889

RUM-3463 feat(watchdog-termination): track app state and detect watchdog terminations #1889

Conversation

ganeshnj commented Jun 10, 2024 • edited Loading

What and why?

How?

Review checklist

Custom CI job configuration (optional)

datadog-datadog-prod-us1 bot commented Jun 10, 2024 • edited Loading

Datadog Report

🔻 Code Coverage Decreases vs Default Branch (10)

mariedm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganeshnj Jun 13, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mariedm left a comment

Choose a reason for hiding this comment

ncreated left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganeshnj commented Jun 13, 2024

ncreated left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ganeshnj commented Jun 10, 2024 •

edited

Loading

datadog-datadog-prod-us1 bot commented Jun 10, 2024 •

edited

Loading

mariedm left a comment •

edited

Loading

ganeshnj Jun 13, 2024 •

edited

Loading