vmware · DeltaMichael · Aug 22, 2023 · Aug 16, 2023
diff --git a/specs/vep-2448-vdk-run-logs-simplified-and-readable/README.md b/specs/vep-2448-vdk-run-logs-simplified-and-readable/README.md
@@ -5,24 +5,6 @@
 <!-- * **Status:** draft | implementable | implemented | rejected | withdrawn | replaced -->
 * **Status:** draft
 
-
-To get started with this template:
-
-- [x] **Create an issue in Github (if one does not exists already)**
-- [x] **Make a copy of this template directory.**
-  Copy this template into the specs directory and name it
-  `NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
-  leading-zero padding) created above.
-- [ ] **Fill out this file as best you can.**
-  There are instructions as HTML comments.
-  At minimum, you should fill in the "Summary" and "Motivation" sections.
-- [ ] **Create a PR for this VEP.**
-- [ ] **Merge early and iterate.**
-  Avoid getting hung up on specific details and instead aim to get the goals of
-  the VEP clarified and merged quickly. The best way to do this is to just
-  start with the high-level sections and fill out details incrementally in
-  subsequent PRs.
-
 <!-- Provide table of content as it's helpful. -->
 
 - [VEP-2448: VDK Run Logs: Simplified And Readable](#vep-2448-vdk-run-logs-simplified-and-readable)
@@ -40,12 +22,21 @@ To get started with this template:
     - [Log Less](#log-less)
       - [Logs not at the apprpopriate level](#logs-not-at-the-apprpopriate-level)
       - [Multi-layered logging](#multi-layered-logging)
-    - [Don't Repeat Yourself (DRY)](#dont-repeat-yourself-dry)
     - [Clean Error Handling](#clean-error-handling)
     - [Progress Indicators](#progress-indicators)
   - [API design](#api-design)
   - [Detailed design](#detailed-design)
+    - [Log Structure](#log-structure-1)
+    - [Log Less](#log-less-1)
+    - [Clean Error Handling](#clean-error-handling-1)
+    - [Progress Indicators](#progress-indicators-1)
   - [Implementation stories](#implementation-stories)
+    - [Log Structure](#log-structure-2)
+    - [Log Less](#log-less-2)
+    - [Clean Error Handling](#clean-error-handling-2)
+    - [Progress Indicators](#progress-indicators-2)
+    - [Documentation](#documentation)
+    - [Promotional Materials](#promotional-materials)
   - [Alternatives](#alternatives)
 
 ## Summary
@@ -182,7 +173,7 @@ In this context, a component is any separate software process.
 
 -->
 
-Based on user feedback, we've identified 5 workstreams.
+Based on user feedback, we've identified 4 workstreams.
 
 ### Log Structure
 
@@ -238,13 +229,6 @@ Full error log /tmp/30618c1b-677b-4f96-86a3-dda26011b3d8-1690469444-20a99/error.
 ```
 `[Cloud]` Full error log should be output to stderr
 
-### Don't Repeat Yourself (DRY)
-
-`[Local][Cloud]` Remove any repeating info in success logs. Success logs should
-show which step has started or ended and the success status at the end. Note
-that locally, this could be entirely replaced by progress bars, but users should
-be able to choose between one or the other.
-
 ### Clean Error Handling
 
 `[Local][Cloud]` Do away with the log-and-throw and log-and-rethrow patterns.
@@ -315,58 +299,92 @@ Explain how does the system handle API violations.
 
 
 ## Detailed design
-<!--
-Dig deeper into each component. The section can be as long or as short as necessary.
-Consider at least the below topics but you do not need to cover those that are not applicable.
-
-### Capacity Estimation and Constraints
-    * Cost of data path: CPU cost per-IO, memory footprint, network footprint.
-    * Cost of control plane including cost of APIs, expected timeliness from layers above.
-### Availability.
-    * For example - is it tolerant to failures, What happens when the service stops working
-### Performance.
-    * Consider performance of data operations for different types of workloads.
-       Consider performance of control operations
-    * Consider performance under steady state as well under various pathological scenarios,
-       e.g., different failure cases, partitioning, recovery.
-    * Performance scalability along different dimensions,
-       e.g. #objects, network properties (latency, bandwidth), number of data jobs, processed/ingested data, etc.
-### Database data model changes
-### Telemetry and monitoring changes (new metrics).
-### Configuration changes.
-### Upgrade / Downgrade Strategy (especially if it might be breaking change).
-  * Data migration plan (it needs to be automated or avoided - we should not require user manual actions.)
-### Troubleshooting
-  * What are possible failure modes.
-    * Detection: How can it be detected via metrics?
-    * Mitigations: What can be done to stop the bleeding, especially for already
-      running user workloads?
-    * Diagnostics: What are the useful log messages and their required logging
-      levels that could help debug the issue?
-    * Testing: Are there any tests for failure mode? If not, describe why._
-### Operability
-  * What are the SLIs (Service Level Indicators) an operator can use to determine the health of the system.
-  * What are the expected SLOs (Service Level Objectives).
-### Test Plan
-  * Unit tests are expected. But are end to end test necessary. Do we need to extend vdk-heartbeat ?
-  * Are there changes in CICD necessary
-### Dependencies
-  * On what services the feature depends on ? Are there new (external) dependencies added?
-### Security and Permissions
-  How is access control handled?
-  * Is encryption in transport supported and how is it implemented?
-  * What data is sensitive within these components? How is this data secured?
-      * In-transit?
-      * At rest?
-      * Is it logged?
-  * What secrets are needed by the components? How are these secrets secured and attained?
--->
+
+### Log Structure
+
+Users should be able to override the default logging format structure so that
+metadata fields they don't care about are hidden. This should happen through
+standard vdk configuration, e.g. config.ini or env variables. Users should be
+able to pass something like
+
+```
+[vdk]
+log_config=[timestamp, vdk_tag, line_number]
+```
+
+in config.ini. Operators should also be able to pass the same kind of config
+when deploying control service.
+
+https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/values.yaml#L324
+
+The user configuration should take precedence over the global one.
+
+User should be able to do log.info("xxx") inside a data job and vdk
+automatically add structure around it.
+
+Plugins should be able to declare and inject new metadata fields, e.g. DAGs
+should be able to add a dag_name field or a current_job field
+
+The above requirements are more than likely achievable using structlog.
+
+https://www.structlog.org/en/stable/
+
+### Log Less
+
+We should make sure that logs throughout vdk have the correct log level. Any
+logs that don't should be moved to the appropriate level, e.g. we seem to have
+lots of info logs that should be on the debug level.
+
+Users have the ability to to change the log level for specific modules in their
+data job with the `log_level_module` option. We should add documentation around
+this.
+
+### Clean Error Handling
+
+Error messages should clearly state the problem without placeholder and repeated
+logs text so that users can directly understand what went wrong. Users should be
+able to see the original exception when it's passed up the call stack and is in
+the user code so they can handle it.
+
+VDK developers should be discouraged from using generic error messages like "An
+error occurred". This will give more meaningful feedback to users.
+
+### Progress Indicators
+
+Data jobs in the CLI should display progress indicators by default instead of
+logs to stdout/stderr. Users should be given the option to switch between the
+progress indicator and logging using an option in config.ini. This switch should
+be available at the global level when deploying vdk to production environments.
+
+On error, the error message should give the root cause and line number of the
+failure. The full logs should be available inside a temp file. The link to the
+temp file should also be part of the CLI error message.
+
+We should consider implementing an asynchronous mechanism for updating progress
+indicators. This would make sure that the primary task doesn’t get interrupted
+or slowed down due to progress bar updates. This is really relevant for
+ingestion.
+
+Also as we need progress indicators in multiple places likely we need a common
+encapsulation for that - in a form of a python module or plugin. The progress
+indicator should provide a notification callback mechanism or similar so it's
+integratable with Notebook
 
 
 ## Implementation stories
-<!--
-Optionally, describe what are the implementation stories (eventually we'd create github issues out of them).
--->
+
+### [Log Structure](https://github.com/vmware/versatile-data-kit/milestone/14)
+
+### [Log Less](https://github.com/vmware/versatile-data-kit/milestone/15)
+
+### [Clean Error Handling](https://github.com/vmware/versatile-data-kit/milestone/16)
+
+### [Progress Indicators](https://github.com/vmware/versatile-data-kit/milestone/17)
+
+### [Documentation](https://github.com/vmware/versatile-data-kit/milestone/18)
+
+### [Promotional Materials](https://github.com/vmware/versatile-data-kit/milestone/19)
+
 
 ## Alternatives
 <!--