From f9208786bf7f4d6a57909d74a53c7aa134136b98 Mon Sep 17 00:00:00 2001 From: Philipp Hofmann Date: Thu, 5 Dec 2024 11:44:16 +0100 Subject: [PATCH] more cleanup --- text/0143-sdk-fail-safe-mode.md | 117 +++++++++++++++++++------------- 1 file changed, 71 insertions(+), 46 deletions(-) diff --git a/text/0143-sdk-fail-safe-mode.md b/text/0143-sdk-fail-safe-mode.md index c9889d9c..359038df 100644 --- a/text/0143-sdk-fail-safe-mode.md +++ b/text/0143-sdk-fail-safe-mode.md @@ -42,7 +42,13 @@ Crashing while writing a crash report is terrible but not as fatal as continuous # Recommended Approach -After detecting a potential SDK crash via [checkpoints](#option-a1), the SDK switches to a [safe SDK mode](#option-b1), a bare minimum SDK with only essential features. When the SDK initialization fails in the safe mode, the SDK makes the initialization a [NoOp (no operation)](#option-b2), communicating this to a [failing SDK endpoint](#option-c1). To minimize the risk of staying incorrectly in the NoOp mode, the SDK implements a retry logic, which switches to the safe mode after being initialized x times in the NoOp mode. The same applies to the safe mode. The SDK switches back to a normal initialization after being initalized x times in the safe mode. While the retry mode will yield further crashes in the worst-case scenario, it ensures our users still receive some SDK crashes in the App Store Connect and the Google Play Console, which is essential for fixing the root cause. We accept this tradeoff over completely flying blind. +The recommended approach solves the three problems mentioned [above](#background): + +1. [A: Detecting a continuously crashing SDK](#a-detecting-continuous-sdk-crashes) with [Checkpoints](#option-a1). +2. [B: Minimizing the damage of a continuously crashing SDK](b-minimizing-the-damage) with [Safe Mode](#option-b1) and [NoOp Init](#option-b2). +3. [C: Knowing when and why the SDK is continuously crashing](#c-knowing-when-the-sdk-is-disabled) with [Failing SDK Endpoint](#option-c1) and [Retry Logic](#option-c2). + +After detecting a potential SDK crash via [checkpoints](#option-a1), the SDK switches to a [safe SDK mode](#option-b1), a bare minimum SDK with only essential features. When the SDK initialization fails in the safe mode, the SDK makes the initialization a [NoOp (no operation)](#option-b2), communicating this to a [failing SDK endpoint](#option-c1). To minimize the risk of staying incorrectly in the NoOp mode, the SDK implements a retry logic, which switches to the safe mode after being initialized x times in the NoOp mode. The same applies to the safe mode. The SDK switches back to a normal initialization after being initalized x times in the safe mode. While the retry mode will yield further crashes in the worst-case scenario, it ensures our users still receive some SDK crashes in the App Store Connect and the Google Play Console, which is essential for fixing the root cause. We accept this tradeoff over completely flying blind. On platforms where we can check the stacktrace to find out if the crash is caused by the SDK, we might use the stacktrace detection in addition to the checkpoints. ```mermaid --- @@ -93,9 +99,16 @@ flowchart LR sdk-init-no-op --> sdk-fail-endpoint ``` -On platforms where we can check the stacktrace to find out if the crash is caused by the SDK, we use the stacktrace detection in addition to the checkpoints. -# A: Detecting Continuous SDK Crashes +# Problems to Solve + +This section looks at the three different problems we need to solve: + +1. [Detecting Continuous SDK Crashes](#a-detecting-continuous-sdk-crashes) +2. [Minimizing the Damage](#b-minimizing-the-damage) +3. [Knowing When the SDK is Disabled](#c-knowing-when-the-sdk-is-disabled) + +## A: Detecting Continuous SDK Crashes First, we need to know when our SDKS continuously crash our customers. Only then can we act accordingly. We can categorize crashes into four different time categories: @@ -122,47 +135,47 @@ flowchart LR Now, let's have a look at the different scenarios for a continuously crashing SDK with different severities before we look at the potential solutions. -## Continuous Crash Scenarios +### Continuous Crash Scenarios There are different scenarios for a continuously crashing SDK with different severities. Potential solutions must cover the first two scenarios. Covering scenario 3 is still important, but we can delay it a bit: -### Scenario 1: Worst Case - SDK Continuously Crashing During App Start No Crash Reports +#### Scenario 1: Worst Case - SDK Continuously Crashing During App Start No Crash Reports The worst case scenario is a continuously crashing SDK during app start that cannot send crash events to Sentry. The reason for this could be a crash in the SDK initialization code or while sending a crash report or other data. The app is in a death spiral, meaning it continuously crashes during the app start and is unusable. Finding a strategy to escape the death spiral is vital because not only does it crash our users, but we stay in the dark and must rely on them to report the problem. This scenario is painful for our users because it takes time to realize what is crashing their app, as they must use other tools such as App Store Connect or the Google Play Store to identify the root cause. Once they identify the root cause, they must publish a new release, which can take several hours or even days. Finally, they must rely on their users to update their apps to fix the issue. Some users might lose trust in Sentry if this only happens once. -### Scenario 2: Almost Worst Case - SDK Continuously Crashing During App Start Can Send Crash Reports +#### Scenario 2: Almost Worst Case - SDK Continuously Crashing During App Start Can Send Crash Reports Almost as bad as scenario 1, the SDK crashes continuously during app start but can still send crashes to Sentry. Now, the SDK crash detection can identify this and alarm us, and our users see in Sentry that the Sentry SDK is crashing their app. Our users rely on Sentry to notify them but must immediately release a hotfix. There is still damage, but they most likely still trust Sentry because we informed them about the problem. Some users might again lose trust in Sentry. -### Scenario 3: SDK Continuously Crashing Shortly After SDK Init +#### Scenario 3: SDK Continuously Crashing Shortly After SDK Init Finally, a bad-case scenario is our SDK crashing continuously shortly after the app start. The app might be unusable as it constantly crashes at a specific area, or only certain features stop working. The Sentry SDK should still be able to send a crash report, so the SDK crash detection should surface this, and our users can see the crashes in their data. If the SDK can't send a crash report, this scenario can either turn into scenario 1, where sending a crash report continuously crashes, or the SDK can't send crash reports, which is also bad but out of the scope of this RFC. Similar to scenario 2, our users must release a hotfix, and they could lose trust, but it is better than [scenario 1](#continuous-crash-scenario-1). -### Scenario 4: SDK Continuously Crashing After SDK Init +#### Scenario 4: SDK Continuously Crashing After SDK Init Similar to [scenario 3](#continuous-crash-scenario-3), but the SDK crash happens after the SDK init. -## Potential False Positives +### Potential False Positives There are crashing scenarios that could look like the Sentry SDK is causing the crash, but it's the user's application code. We need to keep these scenarios in mind, but we can only ignore some of these and inform our users via documentation how to prevent them: -### Scenario 1: User's Application Crashes Before SDK Initialization +#### Scenario 1: User's Application Crashes Before SDK Initialization The user's application crashes before the initialization of the Sentry SDK. We can only educate our users about the importance of initializing the Sentry SDK as early as possible. We can't disable the SDK, as it never initializes. -### Scenario 2: User's Application Crashes Async During SDK Initialization +#### Scenario 2: User's Application Crashes Async During SDK Initialization The user's application crashes async during the initialization of the Sentry SDK. It could be that the Sentry SDK can send a crash report to Sentry or not. When the SDK detects a crash during its initialization, it switches to the SDK Safe Mode, which runs the SDK with the essential SDK features. If the SDK fails to finish its initialization and can't send a crash report to Sentry, it doesn't make a difference if the SDK is enabled or disabled. -### Scenario 3: User's Application Crashes Shortly After SDK Initialization +#### Scenario 3: User's Application Crashes Shortly After SDK Initialization The user's application crashes shortly after the SDK initialization. We have to ensure that we're not wrongly disabling the Sentry SDK. Still, suppose we detect that the app continuously crashes after x seconds of the Sentry SDK initialization. In that case, switching to the SDK Safe Mode might be acceptable to minimize the risk of the Sentry SDK being the root cause. Furthermore, our users will mainly be interested in the crash events, not other data such as performance or session replay. -### Scenario 4: User's Application Crashes After SDK Initialization +#### Scenario 4: User's Application Crashes After SDK Initialization This happens frequently, and we must ensure that the SDK correctly ignores this scenario. -## Option A1: [Preferred] Checkpoints +### Option A1: [Preferred] Checkpoints The SDK uses two checkpoints to identify if it can launch successfully. The first checkpoint is the **start init**, which marks that the SDK started initialization. The second checkpoint is the **success init** checkpoint. This checkpoint marks that the SDK could successfully initialize. Every time the SDK initializes, it validates the checkpoints from the previous SDK initialization. Therefore, the SDK must persist with the checkpoint information across SDK initializations. Based on which checkpoints were successfully persisted, the SDK can decide if it was initialized successfully on a previous run. @@ -221,7 +234,7 @@ Scenario: New SDK version inits with previous failed init As we must access checkpoint information during the application launch, we must choose an efficient way to read and write this information to not slow down our users' apps. Depending on the platform, we can use marker files or key-value stores such as [UserDefaults](https://developer.apple.com/documentation/foundation/userdefaults) on Apple or [SharedPreferences](https://developer.android.com/reference/android/content/SharedPreferences) on Android. Marker files are more efficient than reading file contents because, for these, the OS only needs to check the file's existence, which is usually a system metadata look-up. We still need to determine which approach is the most efficient. -### Continuous Crashing Scenarios +#### Continuous Crashing Scenarios Notes on [continuous crashing scenarios](#continuous-crash-scenarios): @@ -241,14 +254,14 @@ Notes on [potential false positives](#potential-false-positives): | 3. User's Application Crashes Shortly After SDK Initialization | ✅ - yes | The SDK correctly ignores this scenario. | | 4. User's Application Crashes After SDK Initialization | ✅ - yes | The SDK correctly ignores this scenario. | -### Pros +#### Pros 1. It can detect if the SDK crashes during its initialization for any technical setup and when the crash handlers can't capture the crash. 2. SDKs could use checkpoints to identify the failure of other critical actions, such as writing a crash report. 3. It works when the SDK is offline. 4. It can be implemented solely in the SDKs, and doesn't require any changes on the backend. -### Cons +#### Cons 1. It requires extra disk I/O and negatively impacts the SDK startup time. 2. It could incorrectly disable the SDK when the app crashes async during the initialization of the Sentry SDK. @@ -257,11 +270,11 @@ Notes on [potential false positives](#potential-false-positives): 5. It won't work when there is no disk space left. 6. The logic could get complex for hybrid SDKs. -## Option A2: SDK Crash Detection +### Option A2: SDK Crash Detection We have already used the SDK crash detection to surface continuous SDK crashes after its initialization. This only works when the SDK can send a crash report, which happens on the server. -### Continuous Crashing Scenarios +#### Continuous Crashing Scenarios Notes on [continuous crashing scenarios](#continuous-crash-scenarios): @@ -281,25 +294,24 @@ Notes on [potential false positives](#potential-false-positives): | 3. User's Application Crashes Shortly After SDK Initialization | ✅ - yes | The SDK correctly ignores this scenario. | | 4. User's Application Crashes After SDK Initialization | ✅ - yes | The SDK correctly ignores this scenario. | -### Pros +#### Pros 1. It already exists. 2. It correctly ignores all potential false positives. 3. It works for all SDK crash scenarios when the SDK can send a crash report. -### Cons +#### Cons 1. It doesn't work for SDK init crashes. 2. It only works for single tenants or self-hosted. 3. It runs on the server, so it's delayed, and we need extra functionality to communicate the failing SDK info to the SDKs. 4. It doesn't work offline. -## Option A3: Stacktrace Detection +### Option A3: Stacktrace Detection Before sending a crash report, the SDK identifies an SDK crash by looking at the topmost frames of the crashing thread. If the topmost frames stem from the SDK itself, it disables itself. The [SDK crash detection](https://github.com/getsentry/sentry/tree/master/src/sentry/utils/sdk_crashes) already uses this approach in the event processing pipeline. - -### Continuous Crashing Scenarios +#### Continuous Crashing Scenarios Notes on [continuous crashing scenarios](#continuous-crash-scenarios): @@ -320,13 +332,13 @@ Notes on [potential false positives](#potential-false-positives): | 4. User's Application Crashes After SDK Initialization | ✅ - yes | The stacktrace detection correctly ignores this scenario. | -### Pros +#### Pros 1. It requires little to no extra overhead. 2. It can ignore async app crashes during SDK initialization. 3. It is the most reliable option to detect if the SDK crashes. -### Cons +#### Cons 1. __Doesn't work with static linking:__ This approach doesn’t work with static linking, as the Sentry SDKs end up in the same binary as the main app. As we don’t have symbolication in release builds, we can’t reliably detect if the memory address stems from the Sentry SDK or the app. We might be able to compare addresses with known addresses of specific methods or classes, but this won’t work reliably. As with iOS, many apps use static linking, so we must use an alternative approach. 2. __Doesn't work for obfuscated code:__ For obfuscated code, detecting if a frame in the stacktrace stems from the Sentry SDK or the app can be difficult or even impossible. @@ -334,35 +346,35 @@ Notes on [potential false positives](#potential-false-positives): 4. It doesn't work when the SDK crashes during or before sending the crash report. 5. It doesn't work when the SDK crashes before installing the crash handlers. -# B: Minimizing the Damage +## B: Minimizing the Damage -## Option B1: SDK Safe Mode +### Option B1: SDK Safe Mode Like Windows Safe Mode, our SDKs have a bare minimum SDK; if the SDK detects it continuously crashes, it initializes in the safe mode. The SDK only enables crash handlers, session tracking, and functionality that enriches the scope, but it doesn't enable tracing, profiling, session replay, and automatic breadcrumbs. We still need to define the exact feature set, which can vary per SDK. SDKs must clearly mark that data stems from the safe mode so that users and we are aware. To avoid being stuck in the Safe Mode, the SDK always switches back to normal mode for an app update and when the SDK initializes successfully x times. -### Pros +#### Pros 1. The SDK still works for crashes if there is a critical bug in most areas of the SDK. For example, if a bug in session replay continuously crashes the app shortly after SDK initialization, the SDK will still report crashes. -### Cons +#### Cons 1. If the crash occurs in the safe mode, the SDK causes an extra crash before disabling itself. -## Option B2: NoOp SDK Init +### Option B2: NoOp SDK Init The SDK makes the SDK init a NoOp (no operation) when it detects a continuous crash with one of the options of [A](#options-for-detecting-sdk-crashes).This is the last resort and we should try to avoid this as much as we can, but it's better to have no data than crashes and no data. To minimize the risk of staying wrongly in NoOp mode and to avoid flying completely blind on the root cause, the SDK keeps track of how often the SDK was started in NoOp mode and after x times it retries to initialize. -### Pros +#### Pros 1. We stop crashing our users. -### Cons +#### Cons 1. We stop sending data due to a false positive. -## Option B3: [Discarded] Remote Kill Switch +### Option B3: [Discarded] Remote Kill Switch > 🚫 **Discard reason:** It makes sense to consider this option once we implement a remote config for all SDKs, but it's too much effort as a standalone feature. 🚫 @@ -370,7 +382,7 @@ There might be scenarios where the SDK can’t detect it’s crashing. We might The remote kill switch has to be strictly tied to SDK versions. When the SDK gets an update, it ignores the killswitch from the previous SDK version. -### Pros +#### Pros 1. It works for continuous SDK crashes after the SDK is initalized. 2. We could reenable the SDK if we disable it by mistake. @@ -379,7 +391,7 @@ The remote kill switch has to be strictly tied to SDK versions. When the SDK get 5. We could extend the logic to only disable specific integrations of the SDK. 6. We can use this logic to disable the SDK if it causes other severe issues, such as breaking the UI in the app. -### Cons +#### Cons 1. When the SDK is in a critical state and potentially causing crashes, the last thing we want to do is a web request. 2. It doesn't work for continuous SDK crashes during SDK init. @@ -387,48 +399,61 @@ The remote kill switch has to be strictly tied to SDK versions. When the SDK get 4. It requires manual action. We need to monitor our SDK crashes and input from customers continuously. 5. It requires infrastructure changes. -## Option B4: Bundling SDK versions at the same time +### Option B4: Bundling SDK versions at the same time The SDK ships with two different SDK versions. It has a wrapper for the user and then delegates the actual method calls to the duplicated SDK code. If the SDK detects it’s crashing often, it uses the fallback SDK version. No notes on [crashing scenarios](#crashing-scenarios), because we can discard this option as it has two many significant cons. -### Pros +#### Pros 1. When the SDK crashes, it can still function with the fallback SDK version. -### Cons +#### Cons 1. Roughly doubles the size of the SDK. 2. It requires an extra package. 3. Only a subset of customers might use this, and only high-quality aware customers might accept the tradeoff of a double-sized SDK. In fact, most high-quality aware customers most likely care about app size and will use the stable release channel. -# C: Knowing When the SDK is Disabled +## C: Knowing When the SDK is Disabled -## Option C1: Failing SDK Endpoint +### Option C1: Failing SDK Endpoint Add a unique endpoint for sending a simple HTTP request with only the SDK version and a bit of meta-data, such as the DSN, to notify Sentry about failed SDKs. We must keep this logic as simple as possible, and it should hardly ever change to drastically minimize the risk of causing more damage. The HTTP request must not use other parts of the SDK, such as client, hub, or transport. The SDKs must only send this request once. As we can’t have any logic running, such as rate-limiting or client reports, it’s good to have a specific endpoint for this to reduce the potential impact on the rest of the infrastructure. -### Pros +#### Pros 1. We know when a SDK disables itself. -### Cons +#### Cons 1. Potential risk of crashing while performing this action. 2. It requires extra infrastructure. 3. We don't know why the SDK disabled itself. -## Option C2: Anomaly Detection +### Option C2: SDK Retry Logic + +When the SDK either operates in Safe Mode or NoOp Mode, it retries to go back to normal or safe mode after running x times in Safe Mode or NoOp Mode. + +#### Pros + +1. We don't fly completely blind when the SDK is in NoOp Mode. +2. We reduce the risk of falsely staying in the NoOp or Safe Mode. + +#### Cons + +1. The SDK might cause extra crashes, which could confuse users. + +### Option C3: Anomaly Detection The backend detects anomalies in our customers' session data. If there is a significant drop, we can assume that the SDK works in NoOp mode. The logic has to correctly detect debug and staging releases and take sampling into account. -### Pros +#### Pros 1. No SDK changes are needed, so it works even for old SDK versions. 2. This would be a useful feature for our customers even with this RFC. -### Cons +#### Cons 1. Requires backend changes. 2. It doesn't work when the SDK starts to crash for a new release for all users, as the backend won't know that there is a new release to expect data from unless users manually create the release.