Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUM-3463 feat(watchdog-termination): track app state and detect watchdog terminations #1889

Merged
merged 2 commits into from
Jun 18, 2024

Conversation

ganeshnj
Copy link
Contributor

@ganeshnj ganeshnj commented Jun 10, 2024

What and why?

We want to detect the watchdog terminations in the SDK and report them as a crash to the backend.

How?

This PR setups the building block of detecting a watchdog termination using the heuristic based approach. The heuristic works on checking all the possible ways an app can be terminated and if the termination is from an unknown method, it is considered as a watchdog termination.

Read more about the approach in the internal RFC.

The reporting part is not yet implemented in this PR.

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration)
  • Make sure each commit and the PR mention the Issue number or JIRA reference
  • Add CHANGELOG entry for user facing changes

Custom CI job configuration (optional)

  • Run unit tests for Core, RUM, Trace, Logs, CR and WVT
  • Run unit tests for Session Replay
  • Run integration tests
  • Run smoke tests
  • Run tests for tools/

@datadog-datadog-prod-us1
Copy link

datadog-datadog-prod-us1 bot commented Jun 10, 2024

Datadog Report

Branch report: ganeshnj/feat/RUM-3463-wt-detect
Commit report: 42879d9
Test service: dd-sdk-ios

✅ 0 Failed, 3351 Passed, 0 Skipped, 3m 46.76s Total Time
🔻 Test Sessions change in coverage: 10 decreased, 3 increased

🔻 Code Coverage Decreases vs Default Branch (10)

This report shows up to 5 code coverage decreases.

  • test DatadogCoreTests tvOS 79.11% (-0.42%) - Details
  • test DatadogCrashReportingTests tvOS 26.95% (-0.29%) - Details
  • test DatadogCrashReportingTests iOS 26.92% (-0.29%) - Details
  • test DatadogCoreTests iOS 73.43% (-0.25%) - Details
  • test DatadogTraceTests tvOS 54.72% (-0.18%) - Details

@ganeshnj ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch 6 times, most recently from 0821d6f to 3e90059 Compare June 12, 2024 09:43
@ganeshnj ganeshnj marked this pull request as ready for review June 12, 2024 09:53
@ganeshnj ganeshnj requested review from a team as code owners June 12, 2024 09:53
@ganeshnj ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch from ffc03c1 to fb70ce7 Compare June 12, 2024 11:48
Copy link
Member

@mariedm mariedm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I have only reviewed the ~2/3 for now, but I asked some open questions. Feel free to disregard the nits of course.

@@ -85,6 +82,9 @@ class ExampleAppDelegate: UIResponder, UIApplicationDelegate {
)
RUMMonitor.shared().debug = true

// Enable Crash Reporting
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/question: What is the reason for moving this call later?

Copy link
Contributor Author

@ganeshnj ganeshnj Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed crash reporting enabled first in my testing but no longer needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might revisit it laster when doing full end to end integration because we want RUM to be enabled first in order to get the launch message.


guard let previous = previous else {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit suggestion:

guard let previous else {
    return false
}

// Watchdog Termination detection doesn't work on simulators.
guard deviceInfo.isSimulator == false else {
return false
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit suggestion:

guard !deviceInfo.isSimulator else {
    return false
}

Same for all the other guard statement below, but I also totally understand if you prefer being explicit.

mariedm
mariedm previously approved these changes Jun 13, 2024
Copy link
Member

@mariedm mariedm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was fast! 🏃💨 Thank you for the changes and answers 😊

Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Massive effort 🏋️🏅. Very elegant code that reads smooth and is well integrated into the whole architecture.

I only reviewed the implementation part (no yet tests), leaving several questions and few change requests.

On the process level, I wonder if this PR shouldn't target a feature branch instead of develop. It is missing the "send RUM error" part, which makes it not functional but it brings impact. Alternatively, we could disable it at the entry point and only enable with the final touch on the whole topic.

Comment on lines +85 to +87
// We use baggage message to pass the launch report instead updating the global context
// because under the hood, some integrations start certain features based on the launch report (e.g. `WatchdogTerminationMonitor`).
// If we update the global context, the integrations will keep starting the features on every update which is not desired.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question/ It feels that having LaunchReport as part of DatadogContext will be more scalable:

  • no requirement to enable DatadogCrashReporting after Logs or RUM,
  • every feature can access it when writing event or through featureScope.context {} explicitly;

What is the actual limitation behind following?

If we update the global context, the integrations will keep starting the features on every update which is not desired.

It sounds like architecture flaw that should be fixed rather than bypassed through point-to-point messaging over bus. Would we be up to improving it as part of this / next PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are mixing two concepts here and I would prefer not to re-architect whole flow here.

  1. We want something to start working when something happens in some other feature.

Why? because if start the monitor before launch report, it will update the state in the disk and we will lose the previous session date. This can't be achieved by global context sharing.

  1. Share data

This is easily doable with both global context and point to point messaging.

That said, in an ideal world, I want to subscribe to only launch report (not context), so my receiver must only be invoked for launch report only and with additional context (extra data but not required).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying 👍. I'm okay with proposed way as long as there is no requirement on the order of RUM.enable() vs CrashReporting.enable() - isn't there 🤔💭?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping it for the next PR, my hunch says there will be because SDK sends the message as soon as crash reporting is enabled and if RUM is not enabled yet, it will be missed.

Which may lead to re-architecture of this.

@ganeshnj ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch 2 times, most recently from 255ff38 to a2a51d1 Compare June 13, 2024 12:14
@ganeshnj
Copy link
Contributor Author

Massive effort 🏋️🏅. Very elegant code that reads smooth and is well integrated into the whole architecture.

I only reviewed the implementation part (no yet tests), leaving several questions and few change requests.

On the process level, I wonder if this PR shouldn't target a feature branch instead of develop. It is missing the "send RUM error" part, which makes it not functional but it brings impact. Alternatively, we could disable it at the entry point and only enable with the final touch on the whole topic.

I just disabled by default until we finish the reporting part.

@ganeshnj ganeshnj requested a review from ncreated June 13, 2024 12:53
ncreated
ncreated previously approved these changes Jun 14, 2024
Copy link
Member

@ncreated ncreated left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for changes, looks great. I gave it another look, this time focusing more on tests. Left few suggestions with the most important one on incorporating mocked core vs our mid-term efforts. Few remarks, no blockers 👍

Once again, well done 💪 🏋️‍♂️ 🏅

Comment on lines 29 to 32
core = PassthroughCoreMock(
context: .mockWith(applicationStateHistory: .mockAppInBackground()),
messageReceiver: sut
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convention/ Ref.: #1744, we want to get rid of the pattern of testing message receivers (here sut) through integration with mocked core. This convention proves to be error-prone and can increase flakiness due to leaking core reference. Most importantly, it doesn't test the critical aspect of message receiver which is the return type from receive(message:from:) as this detail is hidden inside PassthroughCoreMock implementation. In the long term, we plan to simplify receive(message:from:) signature by removing the from core: parameter at all (it is not required).

That said, the recommended convention is to create standalone DatadogContext value and pass it to the receiver directly. For tests in this file that would mean:

// app state changes
var context: DatadogContext = .mockWith(applicationStateHistory: .mockAppInBackground())
context.applicationStateHistory.append(.init(state: .active, date: .init()))
XCTAssertFalse(sut.receive(message: .context(context), from: NOPDatadogCore()), "It should not consume the message")

// app state changes again
context.applicationStateHistory.append(.init(state: .background, date: .init()))
XCTAssertFalse(sut.receive(message: .context(context), from: NOPDatadogCore()), "It should not consume the message")

// swiftlint:enable implicitly_unwrapped_optional

func testNoPreviousState_NoWatchdogTermination() throws {
given(isSimulator: .mockRandom())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice/ ⭐ I'll borrow it

@ganeshnj ganeshnj force-pushed the ganeshnj/feat/RUM-3463-wt-detect branch from 20254d9 to 0358d1b Compare June 17, 2024 12:14
@ganeshnj ganeshnj merged commit 8d1e01e into develop Jun 18, 2024
8 checks passed
@ganeshnj ganeshnj deleted the ganeshnj/feat/RUM-3463-wt-detect branch June 18, 2024 13:11
@maxep maxep mentioned this pull request Jul 4, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants