From 88b64eb052968373f39b06895dd443caf3b2e94d Mon Sep 17 00:00:00 2001
From: Matthias Killat <matthias.killat@apex.ai>
Date: Tue, 14 Jun 2022 14:24:18 +0200
Subject: [PATCH] iox-#1032 Define error handling requirements

Signed-off-by: Matthias Killat <matthias.killat@apex.ai>
---
 doc/design/error_handling_requirements.md | 87 +++++++++++++++++++++++
 1 file changed, 87 insertions(+)
 create mode 100644 doc/design/error_handling_requirements.md

diff --git a/doc/design/error_handling_requirements.md b/doc/design/error_handling_requirements.md
new file mode 100644
index 00000000000..6ec2681e293
--- /dev/null
+++ b/doc/design/error_handling_requirements.md
@@ -0,0 +1,87 @@
+# Error Handling
+
+## Faults, Errors and Failures
+
+We use the following error terminology.
+
+1. A **fault** is a design flaw, a programming error or can e.g. be caused by unforseen use (oversight of a use case is basically also a design flaw).
+2. An **error** is the result of a fault and manifests as an incorrect or inconsistent (intermediate) result. An error may not be externally observable by the user of the system.
+3. A **failure** occurs if the system observably acts in contradiction to its specification.
+
+This reasonably assumes that errors are always caused by faults. If these errors are not detected they can cause failures. By detecting and handling errors failures are avoided.
+
+The goal of this document is to define error handling requirements.
+
+## Error Detection vs. Error Handling
+
+To handle an error it obviously needs to be detected first. To do so, the code must contain conditionals that check for inconsistent states, unavailability of resources, return codes of system calls etc. This means that the error has to be anticipated by the system designer, i.e. we cannot handle unforseen errors since they are not detected by definition.
+
+Error handling is the execution of handling code once the error was detected. This ranges from simple logging to sophisticated recovery or shutdown code and depends on the error in question. We distinguish between recoverable and non-recoverable error handling.
+
+### Recoverable Error Handling
+
+A recoverable error handling strategy allows the system to continue operation after the handling code was executed successfully. Exceptions are an example of such a mechanism but also any handler code that does not result in termination. If the handling code itself fails (e.g. insufficient resources) no recovery is possible and the system terminates.
+
+### Non-recoverable Error Handling
+
+A non-recoverable error handling strategy will always call terminate after the handling code was executed. Handler code may be used to initiate a graceful shutdown. An example are (strict) asserts.
+
+## Requirements
+
+In the following we list the requirements of the Error Handling (EH) strategy. The term platform
+refers to the usage environment of the error handler. There can be various platforms, e.g. for regular operations, testing, different OS etc.
+
+### Functional
+
+1. The EH shall support error codes.
+    - configurable on a module basis
+    - uniqueness is not enforced (can be done by the layer above if desired)
+1. The EH shall support multiple Error Levels.
+    - the individual error levels are defined by the platform
+    - there has to be at least a FATAL error level
+    - if the code requires an ERROR level that is not supported by the platform it shall not compile
+1. The EH shall support non-recoverable error handling strategies in case of FATAL errors.
+1. The EH shall support a recoverable error handling strategies for non-FATAL errors.
+1. The EH shall not use  any exceptions.
+1. The detection mechanism of the EH shall allow logging the error.
+    - user defined-logging shall be supported
+    - EH will depend on logging (to some extent)
+1. The EH shall allow to trace the error to its source location (file, function, line).
+
+1.
+
+### Technical constraints
+
+1. The EH shall be uniquely defined for each platform.
+    - only one error handler can be active for a platform at any time
+    - the EH shall not be changed while the system runs
+    - the EH must be configurable at runtime in the init phase
+      (for tests, but it might be sufficient to change it at compile time if only required for tests)
+    - for tests it must be possible to override the FATAL error reaction causing termination
+
+## API Proposal
+
+## Open Questions
+
+- current signature
+template<typename Error>
+errorHandler(Error, ErrorLevel )
+
+- RAISE_ERROR(...)
+
+- Always fatal?
+ASSERT(...)
+
+- RAISE_ERROR delegates to ErrorHandler (if any)
+
+- Recoverable strategies and non-recoverable strategy e.g. ASSERT (depends on level)
+
+Requirements:
+
+1. Error codes, source location, error level, Minimum active level, Module specific code
+
+2. Optional code in case of error
+
+3. Definable in platform, singleton, to be set once
+
+4. Default version