vmware · DeltaMichael · Jul 26, 2023 · Jul 24, 2023
diff --git a/specs/vep-2448-vdk-run-logs-simplified-and-readable/README.md b/specs/vep-2448-vdk-run-logs-simplified-and-readable/README.md
@@ -0,0 +1,251 @@
+
+# VEP-2448: VDK Run Logs: Simplified And Readable
+
+* **Author(s):** Dilyan Marinov ([email protected]), Antoni Ivanov ([email protected])
+<!-- * **Status:** draft | implementable | implemented | rejected | withdrawn | replaced -->
+* **Status:** draft
+
+
+To get started with this template:
+
+- [x] **Create an issue in Github (if one does not exists already)**
+- [x] **Make a copy of this template directory.**
+  Copy this template into the specs directory and name it
+  `NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
+  leading-zero padding) created above.
+- [ ] **Fill out this file as best you can.**
+  There are instructions as HTML comments.
+  At minimum, you should fill in the "Summary" and "Motivation" sections.
+- [ ] **Create a PR for this VEP.**
+- [ ] **Merge early and iterate.**
+  Avoid getting hung up on specific details and instead aim to get the goals of
+  the VEP clarified and merged quickly. The best way to do this is to just
+  start with the high-level sections and fill out details incrementally in
+  subsequent PRs.
+
+<!-- Provide table of content as it's helpful. -->
+
+- [VEP-2448: VDK Run Logs: Simplified And Readable](#vep-2448-vdk-run-logs-simplified-and-readable)
+  - [Summary](#summary)
+  - [Glossary](#glossary)
+  - [Motivation](#motivation)
+    - [Logs on failures](#logs-on-failures)
+    - [Logs on success](#logs-on-success)
+    - [Debug mode](#debug-mode)
+    - [Progress Tracking](#progress-tracking)
+    - [VDK exception handling and  categorization](#vdk-exception-handling-and--categorization)
+  - [Requirements and goals](#requirements-and-goals)
+  - [High-level design](#high-level-design)
+  - [API design](#api-design)
+  - [Detailed design](#detailed-design)
+  - [Implementation stories](#implementation-stories)
+  - [Alternatives](#alternatives)
+
+## Summary
+
+<!--
+Short summary of the proposal. It will be used as user-focused
+documentation such as release notes or a (customer facing) development roadmap.
+The tone and content of the `Summary` section should be
+useful for a wide audience.
+-->
+
+<!-- TODO: Revisit this section once the all other sections are completed -->
+The VDK user experience when it comes to using logs for data job runs is lacking
+in several areas. This makes it hard to debug failed data job runs and get
+useful information out of successful data job runs. The following sections
+provide an in-depth view of improvemnts to VDK run logs which address a number
+of user pain points, such as noisy logs, overly verbose error messages,
+insufficient error information, lack of documentation, lack of progress tracking
+features, etc. Resolving these pain points will result in a better user
+experience.
+
+## Glossary
+<!--
+Optional section which defines terms and abbreviations used in the rest of the document.
+-->
+
+## Motivation
+<!--
+It tells **why** do we need X?
+Describe why the change is important and the benefits to users.
+Explain the user problem that need to be solved.
+-->
+
+At the moment, VKD users face multiple challenges related to logging in data job
+runs. Based on user feedback we've received, these can be split into 5 categories.
+
+### Logs on failures
+
+Useful information gets lost in the noise or sometimes just
+missing This is especially true for user errors. There are also some error
+messages that just don't contain any useful information to begin with We might
+have incentivized some bad practices with our error message format. Some people
+tend to just write placeholder text, e.g. "WHAT HAPPENED: An error occurred", or
+"COUNTERMEASURES: Check exception". These kinds of messages are extremely
+annoying from an user perspective. When a step fails, we don't point to the
+actual line of the code where the failure happened Actual root causes are
+sometimes buried in the stack There's lots of meta-logs, impala query logs, etc.
+This leads us to believe that we have a lot of info logging that should actually
+be on the debug level.
+
+### Logs on success
+
+Users expect little or no log output on success. Users tend to add their own
+logging to signify when a step started or ended they expect these kinds of logs
+to be highlighted somehow, or be the only logs they see
+
+### Debug mode
+
+Some users have trouble running jobs in debug mode and getting debug level logs
+This points to a gap in our documentation
+
+### Progress Tracking
+
+Users want better tracking of job progress (for DAGs, steps, ingestion)
+
+### VDK exception handling and  categorization
+
+There is a need for more clarity and categorization in error messages, likely
+more structured error reporting, such as knowing whether an error is a syntax
+error, data error, library issue, or platform error, as very important
+visibility of the steps or components where failures occur so people can quickly
+find out the principal issue
+
+Logging and error handling should be improved in ways that adress all these
+challenges. This will result in less time spent troubleshooting, fewer support
+queries, more data job runs that are successful and an overall better user
+experience.
+
+## Requirements and goals
+<!--
+It tells **what** is it trying to achieve?
+List the specific goals (functional and nonfunctional requirements)? How will we
+know that this has succeeded?
+
+Specify non-goals. Clearly, the list of non-goals can't be exhaustive.
+Non-goals are only features, which a contributor can reasonably assume were a goal.
+One example is features that were cut during scoping.
+-->
+1. Data job run logs provide useful and readable information
+   1. Data job run logs provide progress tracking information
+      1. current step
+      1. ingested data
+      1. the amount of resources currently consumed
+      1. progress bars for local runs
+   1. User logs stand out
+   1. DAGs are traceable in the logs
+      1. Logs show current job name
+      1. Logs show parent job name for the current job
+      1. DAG jobs that run in parallel can be sorted and grouped, so that logs for a single job are extracted
+   1. Debug information is at the appropriate level, e.g. impala query meta logs are at debug level and not info
+1. Finding information for failed runs is straightforward
+   1. The root cause is immediately visible from the logs.
+   1. When an error occurs, relevant information is visible on top of the error log. This includes
+      1. why it happened
+      1. where it happened (data job stage and line)
+      1. most common troubleshooting steps
+1. Changing the logging level is straightforward
+   1. Changing the logging level is documented
+   1. Changing the logging level is available in the cloud
+1. The above points are valid for local and cloud data job runs.
+
+**Out of scope**
+
+Issues related to VDK internal cloud deployments, e.g. write permissions, data
+format issues, etc. These should be forwarded to the appropriate VMWare team.
+
+
+## High-level design
+
+<!--
+All the rest sections tell **how** are we solving it?
+
+This is where we get down to the specifics of what the proposal actually is.
+This should have enough detail that reviewers can understand exactly what
+you're proposing, but should not include things like API designs or
+implementation. What is the desired outcome and how do we measure success?
+
+Provide a valid UML Component diagram that focuses on the architecture changes
+implementing the feature. For more details on how to write UML Component Spec -
+see https://en.wikipedia.org/wiki/Component_diagram#External_links.
+
+For every new component on the diagram, explain which goals does it solve.
+In this context, a component is any separate software process.
+
+-->
+
+
+## API design
+
+<!--
+
+Describe the changes and additions to the public API (if there are any).
+
+For all API changes:
+
+Include Swagger URL for HTTP APIs, no matter if the API is RESTful or RPC-like.
+PyDoc/Javadoc (or similar) for Python/Java changes.
+Explain how does the system handle API violations.
+-->
+
+
+## Detailed design
+<!--
+Dig deeper into each component. The section can be as long or as short as necessary.
+Consider at least the below topics but you do not need to cover those that are not applicable.
+
+### Capacity Estimation and Constraints
+    * Cost of data path: CPU cost per-IO, memory footprint, network footprint.
+    * Cost of control plane including cost of APIs, expected timeliness from layers above.
+### Availability.
+    * For example - is it tolerant to failures, What happens when the service stops working
+### Performance.
+    * Consider performance of data operations for different types of workloads.
+       Consider performance of control operations
+    * Consider performance under steady state as well under various pathological scenarios,
+       e.g., different failure cases, partitioning, recovery.
+    * Performance scalability along different dimensions,
+       e.g. #objects, network properties (latency, bandwidth), number of data jobs, processed/ingested data, etc.
+### Database data model changes
+### Telemetry and monitoring changes (new metrics).
+### Configuration changes.
+### Upgrade / Downgrade Strategy (especially if it might be breaking change).
+  * Data migration plan (it needs to be automated or avoided - we should not require user manual actions.)
+### Troubleshooting
+  * What are possible failure modes.
+    * Detection: How can it be detected via metrics?
+    * Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    * Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+    * Testing: Are there any tests for failure mode? If not, describe why._
+### Operability
+  * What are the SLIs (Service Level Indicators) an operator can use to determine the health of the system.
+  * What are the expected SLOs (Service Level Objectives).
+### Test Plan
+  * Unit tests are expected. But are end to end test necessary. Do we need to extend vdk-heartbeat ?
+  * Are there changes in CICD necessary
+### Dependencies
+  * On what services the feature depends on ? Are there new (external) dependencies added?
+### Security and Permissions
+  How is access control handled?
+  * Is encryption in transport supported and how is it implemented?
+  * What data is sensitive within these components? How is this data secured?
+      * In-transit?
+      * At rest?
+      * Is it logged?
+  * What secrets are needed by the components? How are these secrets secured and attained?
+-->
+
+
+## Implementation stories
+<!--
+Optionally, describe what are the implementation stories (eventually we'd create github issues out of them).
+-->
+
+## Alternatives
+<!--
+Optionally, describe what alternatives has been considered.
+Keep it short - if needed link to more detailed research document.
+-->