-
Notifications
You must be signed in to change notification settings - Fork 59
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs/vdk-core: VEP-2448: VDK Run Logs: Simplified And Readable
Why? Part of the run logs improvement initiative #2408 What? Create VEP document and fill out Summary, Motivation and Goals sections How has this been tested? n/a What type of change are you making? docs Signed-off-by: Dilyan Marinov <[email protected]>
- Loading branch information
Dilyan Marinov
committed
Jul 24, 2023
1 parent
f9a290b
commit 24ccc12
Showing
1 changed file
with
244 additions
and
0 deletions.
There are no files selected for viewing
244 changes: 244 additions & 0 deletions
244
specs/vep-2448-vdk-run-logs-simplified-and-readable/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,244 @@ | ||
|
||
# VEP-2448: VDK Run Logs: Simplified And Readable | ||
|
||
* **Author(s):** Dilyan Marinov ([email protected]), Antoni Ivanov ([email protected]) | ||
<!-- * **Status:** draft | implementable | implemented | rejected | withdrawn | replaced --> | ||
* **Status:** draft | ||
|
||
|
||
To get started with this template: | ||
|
||
- [x] **Create an issue in Github (if one does not exists already)** | ||
- [x] **Make a copy of this template directory.** | ||
Copy this template into the specs directory and name it | ||
`NNNN-short-descriptive-title`, where `NNNN` is the issue number (with no | ||
leading-zero padding) created above. | ||
- [ ] **Fill out this file as best you can.** | ||
There are instructions as HTML comments. | ||
At minimum, you should fill in the "Summary" and "Motivation" sections. | ||
- [ ] **Create a PR for this VEP.** | ||
- [ ] **Merge early and iterate.** | ||
Avoid getting hung up on specific details and instead aim to get the goals of | ||
the VEP clarified and merged quickly. The best way to do this is to just | ||
start with the high-level sections and fill out details incrementally in | ||
subsequent PRs. | ||
|
||
<!-- Provide table of content as it's helpful. --> | ||
|
||
- [VEP-2448: VDK Run Logs: Simplified And Readable](#vep-2448-vdk-run-logs-simplified-and-readable) | ||
- [Summary](#summary) | ||
- [Glossary](#glossary) | ||
- [Motivation](#motivation) | ||
- [Logs on failures](#logs-on-failures) | ||
- [Logs on success](#logs-on-success) | ||
- [Debug mode](#debug-mode) | ||
- [Progress Tracking](#progress-tracking) | ||
- [VDK exception handling and categorization](#vdk-exception-handling-and--categorization) | ||
- [Requirements and goals](#requirements-and-goals) | ||
- [High-level design](#high-level-design) | ||
- [API design](#api-design) | ||
- [Detailed design](#detailed-design) | ||
- [Implementation stories](#implementation-stories) | ||
- [Alternatives](#alternatives) | ||
|
||
## Summary | ||
|
||
<!-- | ||
Short summary of the proposal. It will be used as user-focused | ||
documentation such as release notes or a (customer facing) development roadmap. | ||
The tone and content of the `Summary` section should be | ||
useful for a wide audience. | ||
--> | ||
The VDK user experience when it comes to using logs for data job runs is lacking | ||
in several areas. This makes it hard to debug failed data job runs and get | ||
useful information out of successful data job runs. The following sections | ||
provide an in-depth view of improvemnts to VDK run logs which address a number | ||
of user pain points, such as noisy logs, overly verbose error messages, | ||
insufficient error information, lack of documentation, lack of progress tracking | ||
features, etc. Resolving these pain points will result in a better user | ||
experience. | ||
|
||
## Glossary | ||
<!-- | ||
Optional section which defines terms and abbreviations used in the rest of the document. | ||
--> | ||
|
||
## Motivation | ||
<!-- | ||
It tells **why** do we need X? | ||
Describe why the change is important and the benefits to users. | ||
Explain the user problem that need to be solved. | ||
--> | ||
|
||
At the moment, VKD users face multiple challenges related to logging in data job | ||
runs. Based on user feedback we've received, these can be split into 5 categories. | ||
|
||
### Logs on failures | ||
|
||
Useful information gets lost in the noise or sometimes just | ||
missing This is especially true for user errors. There are also some error | ||
messages that just don't contain any useful information to begin with We might | ||
have incentivized some bad practices with our error message format. Some people | ||
tend to just write placeholder text, e.g. "WHAT HAPPENED: An error occurred", or | ||
"COUNTERMEASURES: Check exception". These kinds of messages are extremely | ||
annoying from an user perspective. When a step fails, we don't point to the | ||
actual line of the code where the failure happened Actual root causes are | ||
sometimes buried in the stack There's lots of meta-logs, impala query logs, etc. | ||
This leads us to believe that we have a lot of info logging that should actually | ||
be on the debug level. | ||
|
||
### Logs on success | ||
|
||
Users expect little or no log output on success. Users tend to add their own | ||
logging to signify when a step started or ended they expect these kinds of logs | ||
to be highlighted somehow, or be the only logs they see | ||
|
||
### Debug mode | ||
|
||
Some users have trouble running jobs in debug mode and getting debug level logs | ||
This points to a gap in our documentation | ||
|
||
### Progress Tracking | ||
|
||
Users want better tracking of job progress (for DAGs, steps, ingestion) | ||
|
||
### VDK exception handling and categorization | ||
|
||
There is a need for more clarity and categorization in error messages, likely | ||
more structured error reporting, such as knowing whether an error is a syntax | ||
error, data error, library issue, or platform error, as very important | ||
visibility of the steps or components where failures occur so people can quickly | ||
find out the principal issue | ||
|
||
Logging and error handling should be improved in ways that adress all these | ||
challenges. This will result in less time spent troubleshooting, fewer support | ||
queries, more data job runs that are successful and an overall better user | ||
experience. | ||
|
||
## Requirements and goals | ||
<!-- | ||
It tells **what** is it trying to achieve? | ||
List the specific goals (functional and nonfunctional requirements)? How will we | ||
know that this has succeeded? | ||
Specify non-goals. Clearly, the list of non-goals can't be exhaustive. | ||
Non-goals are only features, which a contributor can reasonably assume were a goal. | ||
One example is features that were cut during scoping. | ||
--> | ||
1. Data job run logs should keep the noise to a minimum and highlight useful information | ||
1. Debug information should be at the appropriate level, e.g. impala query meta logs | ||
1. Data job runs should provide progress tracking information out of the box, | ||
for example when the data job started, what the current stage is, ingested data, | ||
progress bars or the amount of resources currently consumed. | ||
1. User logs should be highlighted or stand out in some other way | ||
1. DAGs should be traceable in the logs | ||
1. VDK logs for data job runs (VDK Run Logs) should deliver useful information in | ||
the most straightforward way possible. | ||
1. In the case of failed data jobs runs, the root cause should be immediately visible from the logs. | ||
2. The user should be able to quickly find out what kind of error occurred, | ||
why it happened, where it happened (data job stage and line) and what the most | ||
common troubleshooting steps are. | ||
1. When troubleshooting a failed data job run, users should be able to easily | ||
add debug logs | ||
1. This should be available in the VDK documentation | ||
1. The above points should be true both for local and for cloud data job runs. | ||
|
||
**Out of scope** | ||
|
||
Issues related to VDK internal cloud deployments, e.g. write permissions, data | ||
format issues, etc. These should be forwarded to the appropriate VMWare team. | ||
|
||
|
||
## High-level design | ||
|
||
<!-- | ||
All the rest sections tell **how** are we solving it? | ||
This is where we get down to the specifics of what the proposal actually is. | ||
This should have enough detail that reviewers can understand exactly what | ||
you're proposing, but should not include things like API designs or | ||
implementation. What is the desired outcome and how do we measure success? | ||
Provide a valid UML Component diagram that focuses on the architecture changes | ||
implementing the feature. For more details on how to write UML Component Spec - | ||
see https://en.wikipedia.org/wiki/Component_diagram#External_links. | ||
For every new component on the diagram, explain which goals does it solve. | ||
In this context, a component is any separate software process. | ||
--> | ||
|
||
|
||
## API design | ||
|
||
<!-- | ||
Describe the changes and additions to the public API (if there are any). | ||
For all API changes: | ||
Include Swagger URL for HTTP APIs, no matter if the API is RESTful or RPC-like. | ||
PyDoc/Javadoc (or similar) for Python/Java changes. | ||
Explain how does the system handle API violations. | ||
--> | ||
|
||
|
||
## Detailed design | ||
<!-- | ||
Dig deeper into each component. The section can be as long or as short as necessary. | ||
Consider at least the below topics but you do not need to cover those that are not applicable. | ||
### Capacity Estimation and Constraints | ||
* Cost of data path: CPU cost per-IO, memory footprint, network footprint. | ||
* Cost of control plane including cost of APIs, expected timeliness from layers above. | ||
### Availability. | ||
* For example - is it tolerant to failures, What happens when the service stops working | ||
### Performance. | ||
* Consider performance of data operations for different types of workloads. | ||
Consider performance of control operations | ||
* Consider performance under steady state as well under various pathological scenarios, | ||
e.g., different failure cases, partitioning, recovery. | ||
* Performance scalability along different dimensions, | ||
e.g. #objects, network properties (latency, bandwidth), number of data jobs, processed/ingested data, etc. | ||
### Database data model changes | ||
### Telemetry and monitoring changes (new metrics). | ||
### Configuration changes. | ||
### Upgrade / Downgrade Strategy (especially if it might be breaking change). | ||
* Data migration plan (it needs to be automated or avoided - we should not require user manual actions.) | ||
### Troubleshooting | ||
* What are possible failure modes. | ||
* Detection: How can it be detected via metrics? | ||
* Mitigations: What can be done to stop the bleeding, especially for already | ||
running user workloads? | ||
* Diagnostics: What are the useful log messages and their required logging | ||
levels that could help debug the issue? | ||
* Testing: Are there any tests for failure mode? If not, describe why._ | ||
### Operability | ||
* What are the SLIs (Service Level Indicators) an operator can use to determine the health of the system. | ||
* What are the expected SLOs (Service Level Objectives). | ||
### Test Plan | ||
* Unit tests are expected. But are end to end test necessary. Do we need to extend vdk-heartbeat ? | ||
* Are there changes in CICD necessary | ||
### Dependencies | ||
* On what services the feature depends on ? Are there new (external) dependencies added? | ||
### Security and Permissions | ||
How is access control handled? | ||
* Is encryption in transport supported and how is it implemented? | ||
* What data is sensitive within these components? How is this data secured? | ||
* In-transit? | ||
* At rest? | ||
* Is it logged? | ||
* What secrets are needed by the components? How are these secrets secured and attained? | ||
--> | ||
|
||
|
||
## Implementation stories | ||
<!-- | ||
Optionally, describe what are the implementation stories (eventually we'd create github issues out of them). | ||
--> | ||
|
||
## Alternatives | ||
<!-- | ||
Optionally, describe what alternatives has been considered. | ||
Keep it short - if needed link to more detailed research document. | ||
--> |