Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specs: vep-2448 detailed design section #2558

Merged
merged 1 commit into from
Aug 22, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
170 changes: 94 additions & 76 deletions specs/vep-2448-vdk-run-logs-simplified-and-readable/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,6 @@
<!-- * **Status:** draft | implementable | implemented | rejected | withdrawn | replaced -->
* **Status:** draft


To get started with this template:

- [x] **Create an issue in Github (if one does not exists already)**
- [x] **Make a copy of this template directory.**
Copy this template into the specs directory and name it
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no
leading-zero padding) created above.
- [ ] **Fill out this file as best you can.**
There are instructions as HTML comments.
At minimum, you should fill in the "Summary" and "Motivation" sections.
- [ ] **Create a PR for this VEP.**
- [ ] **Merge early and iterate.**
Avoid getting hung up on specific details and instead aim to get the goals of
the VEP clarified and merged quickly. The best way to do this is to just
start with the high-level sections and fill out details incrementally in
subsequent PRs.

<!-- Provide table of content as it's helpful. -->

- [VEP-2448: VDK Run Logs: Simplified And Readable](#vep-2448-vdk-run-logs-simplified-and-readable)
Expand All @@ -40,12 +22,21 @@ To get started with this template:
- [Log Less](#log-less)
- [Logs not at the apprpopriate level](#logs-not-at-the-apprpopriate-level)
- [Multi-layered logging](#multi-layered-logging)
- [Don't Repeat Yourself (DRY)](#dont-repeat-yourself-dry)
- [Clean Error Handling](#clean-error-handling)
- [Progress Indicators](#progress-indicators)
- [API design](#api-design)
- [Detailed design](#detailed-design)
- [Log Structure](#log-structure-1)
- [Log Less](#log-less-1)
- [Clean Error Handling](#clean-error-handling-1)
- [Progress Indicators](#progress-indicators-1)
- [Implementation stories](#implementation-stories)
- [Log Structure](#log-structure-2)
- [Log Less](#log-less-2)
- [Clean Error Handling](#clean-error-handling-2)
- [Progress Indicators](#progress-indicators-2)
- [Documentation](#documentation)
- [Promotional Materials](#promotional-materials)
- [Alternatives](#alternatives)

## Summary
Expand Down Expand Up @@ -182,7 +173,7 @@ In this context, a component is any separate software process.

-->

Based on user feedback, we've identified 5 workstreams.
Based on user feedback, we've identified 4 workstreams.

### Log Structure

Expand Down Expand Up @@ -238,13 +229,6 @@ Full error log /tmp/30618c1b-677b-4f96-86a3-dda26011b3d8-1690469444-20a99/error.
```
`[Cloud]` Full error log should be output to stderr

### Don't Repeat Yourself (DRY)

`[Local][Cloud]` Remove any repeating info in success logs. Success logs should
show which step has started or ended and the success status at the end. Note
that locally, this could be entirely replaced by progress bars, but users should
be able to choose between one or the other.

### Clean Error Handling

`[Local][Cloud]` Do away with the log-and-throw and log-and-rethrow patterns.
Expand Down Expand Up @@ -315,58 +299,92 @@ Explain how does the system handle API violations.


## Detailed design
<!--
Dig deeper into each component. The section can be as long or as short as necessary.
Consider at least the below topics but you do not need to cover those that are not applicable.

### Capacity Estimation and Constraints
* Cost of data path: CPU cost per-IO, memory footprint, network footprint.
* Cost of control plane including cost of APIs, expected timeliness from layers above.
### Availability.
* For example - is it tolerant to failures, What happens when the service stops working
### Performance.
* Consider performance of data operations for different types of workloads.
Consider performance of control operations
* Consider performance under steady state as well under various pathological scenarios,
e.g., different failure cases, partitioning, recovery.
* Performance scalability along different dimensions,
e.g. #objects, network properties (latency, bandwidth), number of data jobs, processed/ingested data, etc.
### Database data model changes
### Telemetry and monitoring changes (new metrics).
### Configuration changes.
### Upgrade / Downgrade Strategy (especially if it might be breaking change).
* Data migration plan (it needs to be automated or avoided - we should not require user manual actions.)
### Troubleshooting
* What are possible failure modes.
* Detection: How can it be detected via metrics?
* Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
* Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
* Testing: Are there any tests for failure mode? If not, describe why._
### Operability
* What are the SLIs (Service Level Indicators) an operator can use to determine the health of the system.
* What are the expected SLOs (Service Level Objectives).
### Test Plan
* Unit tests are expected. But are end to end test necessary. Do we need to extend vdk-heartbeat ?
* Are there changes in CICD necessary
### Dependencies
* On what services the feature depends on ? Are there new (external) dependencies added?
### Security and Permissions
How is access control handled?
* Is encryption in transport supported and how is it implemented?
* What data is sensitive within these components? How is this data secured?
* In-transit?
* At rest?
* Is it logged?
* What secrets are needed by the components? How are these secrets secured and attained?
-->

### Log Structure

Users should be able to override the default logging format structure so that
metadata fields they don't care about are hidden. This should happen through
standard vdk configuration, e.g. config.ini or env variables. Users should be
able to pass something like

```
[vdk]
log_config=[timestamp, vdk_tag, line_number]
```

in config.ini. Operators should also be able to pass the same kind of config
when deploying control service.

https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/values.yaml#L324

The user configuration should take precedence over the global one.

User should be able to do log.info("xxx") inside a data job and vdk
automatically add structure around it.

Plugins should be able to declare and inject new metadata fields, e.g. DAGs
should be able to add a dag_name field or a current_job field

The above requirements are more than likely achievable using structlog.

https://www.structlog.org/en/stable/

### Log Less

We should make sure that logs throughout vdk have the correct log level. Any
logs that don't should be moved to the appropriate level, e.g. we seem to have
lots of info logs that should be on the debug level.

Users have the ability to to change the log level for specific modules in their
data job with the `log_level_module` option. We should add documentation around
this.

### Clean Error Handling

Error messages should clearly state the problem without placeholder and repeated
logs text so that users can directly understand what went wrong. Users should be
able to see the original exception when it's passed up the call stack and is in
the user code so they can handle it.

VDK developers should be discouraged from using generic error messages like "An
error occurred". This will give more meaningful feedback to users.

### Progress Indicators

Data jobs in the CLI should display progress indicators by default instead of
logs to stdout/stderr. Users should be given the option to switch between the
progress indicator and logging using an option in config.ini. This switch should
be available at the global level when deploying vdk to production environments.

On error, the error message should give the root cause and line number of the
failure. The full logs should be available inside a temp file. The link to the
temp file should also be part of the CLI error message.

We should consider implementing an asynchronous mechanism for updating progress
indicators. This would make sure that the primary task doesn’t get interrupted
or slowed down due to progress bar updates. This is really relevant for
ingestion.

Also as we need progress indicators in multiple places likely we need a common
encapsulation for that - in a form of a python module or plugin. The progress
indicator should provide a notification callback mechanism or similar so it's
integratable with Notebook


## Implementation stories
<!--
Optionally, describe what are the implementation stories (eventually we'd create github issues out of them).
-->

### [Log Structure](https://github.com/vmware/versatile-data-kit/milestone/14)

### [Log Less](https://github.com/vmware/versatile-data-kit/milestone/15)

### [Clean Error Handling](https://github.com/vmware/versatile-data-kit/milestone/16)

### [Progress Indicators](https://github.com/vmware/versatile-data-kit/milestone/17)

### [Documentation](https://github.com/vmware/versatile-data-kit/milestone/18)

### [Promotional Materials](https://github.com/vmware/versatile-data-kit/milestone/19)


## Alternatives
<!--
Expand Down