Skip to content

Commit

Permalink
Update README.md with dark mode diagrmas
Browse files Browse the repository at this point in the history
  • Loading branch information
antoniivanov committed Jul 27, 2023
1 parent 9502fac commit 5c339a3
Show file tree
Hide file tree
Showing 2 changed files with 5,463 additions and 5,245 deletions.
185 changes: 98 additions & 87 deletions specs/architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ The main goal of the Versatile Data Kit (VDK) is to enable efficient data engine
* Encourage existing best practices in data engineering by making them easy to follow.
* Use only what you need - modular. And build quickly what you miss - extensible.


![](devops-data-vdk-cycle.png)
<!-- source of this picture is from https://github.com/vmware/versatile-data-kit/files/12063655/data-eng-and-devops-and-vdk.pptx -->

Expand All @@ -17,112 +18,41 @@ The main goal of the Versatile Data Kit (VDK) is to enable efficient data engine
## [System Context Diagrams](https://c4model.com/#SystemContextDiagram)

Versatile Data Kit is conceptually composed of two independently usable components:
- Control Service
- Python-based SDK.

- **VDK SDK** - ETL/ELT Python Framework. Highly extensible. Use anywhere there is python installed.
- **Control Service** - Kubernetes based runtime and service for data job deployment and monitoring and operations.

The Control Service provides all the horizontal components required to configure and run [data jobs](https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job) on any cloud:
the UI, and REST APIs for deployment, execution, properties and secrets.
Also job scheduling, logging, alerting, monitoring etc.
It is tightly coupled with Kubernetes and relies on Kubernetes ecosystem for many of its functionalities.

> Note: all diagrams in the document are based on https://c4model.com
![VDK Control Service System Context Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/0cf63ecc-3cd0-4658-8732-bf3a53268a89)

<br>

The VDK SDK includes essential user tools for creating data jobs focusing on ingestion and transformation.
It can be installed locally on the user's machine for local development.
It follows a plugin-oriented architecture in which all functionality is built through plugins.
This way VDK SDK can be rebuilt in any flavor based on the requirements of the organization and users.
This way VDK SDK can be rebuilt in any flavor based on the requirements of the organization and users.<br>
To get started, a default distribution is provided: `pip install quickstart-vdk`

The SDK is built with Python, a language renowned for its powerful libraries for data manipulation.
With the plugin-based architecture, developers can easily add functionality to the VDK SDK.
For example, they could add a plugin that allows the SDK to interact with a new type of database
or a plugin that adds a new data transformation functionality.

![VDK SDK System Context Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/8fab24c9-5d22-4f83-a41f-1a025f2edb4a)

## VDK Control Service Container Diagram


![VDK Control Service Container Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/d25b08da-64a7-45cf-aa2a-3b529791e8dc)

Diving deeper into the VDK Control Service, the entry point from data practitioner's perspective is
* VDK Operations UI which is used for operating and monitoring of data jobs.
* [VDK Rest API](https://iaclqhm5xk.execute-api.us-west-1.amazonaws.com/data-jobs/swagger-ui/index.html#/) (through VDK Control CLI or Notebook UI) which is used to deploy and configure data jobs.

From operator's perspective:
* Operators use provided [helm chart](https://github.com/vmware/versatile-data-kit/wiki/Installation#install-versatile-data-kit-control-service) to install, and configure the Control Service deployment and the needed data infrastructure.

#### Control Service Rest APIs

[Control Service Rest API](https://iaclqhm5xk.execute-api.us-west-1.amazonaws.com/data-jobs/swagger-ui/index.html) streamlines and manages key stages of the development lifecycle:
- Deploy the data job (with "single click") to the Control Service and it would take care of build, release, deploy of DevOps Cycle.
- Manage (operate and monitor) the jobs through Control Service monitoring and alerting and VDK Operations UI.

###### APIs

- Jobs API to register new jobs and stores job configuration.
- Source API to upload job source and provide version of the data job's source ("release")
- Deployment API make sure code is built and deployed with correct configuration
- Execution API tracks all executions exposing needed metrics and data and can start new executions
- Properties And Secrets API are additional services to help data jobs users keep state and credentials

###### Implementation
The Rest API is implemented in Java with Spring Boot framework.

Operators can also configure VDK Control Service to use existing Logging and Monitoring system in organization ecosystem
using environment variables or system properties.

In production, the REST API is running as a Kubernetes Deployment in a kubernetes cluster.

###### Security
The REST API implements authentication and authorization through OAuth2.
Complex authorization logic can be achieved using an authorization webhook.
All communication between the Control Service, the VDK SDK, and the managed components (like databases) can be encrypted.

###### Integration with the Kubernetes Ecosystem

The Control Service relies a lot on Kubernetes ecosystem for scheduling, orchestration,
logging services (for example fluentd can be configured to export logs to a log aggregator),
monitoring (metrics can easily be exported to prometheus)

###### Reliability
The Control Service leverages the resilience and fault-tolerance capabilities of Kubernetes
for scheduling and orchestration (automatic restarts, load balancing).

Errors encountered during data job execution, such as connectivity issues with external systems or
unanticipated exceptions in the data job code, are tracked by the Execution API.

Upon deployment there's provided tool [vdk-heartbeat](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-heartbeat)
that can be used to verify deployment is working correctly.

#### VDK Operations UI
VDK Operations UI is an Angular-based web application. It relies on VDK Rest API and especially for almost all read operations on the [Rest GraphQL Query API](https://iaclqhm5xk.execute-api.us-west-1.amazonaws.com/data-jobs/swagger-ui/index.html#/Data%20Jobs/jobsQuery) .
> Note: all diagrams in the document are based on https://c4model.com
#### Builder Job
Builder Jobs are used to build and release users' data jobs.
Those are system jobs used during deployment to install all dependencies and package the data job.
Builder Jobs interacts with Git to read the source code and with Container Registry (e.g. Docker) in order to store the data job image.
![VDK SDK System Context Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/b006e4fe-c45a-4fab-a593-4b12368aa5e6)

When a user deploys a data job, a data job container image is created for it and deployed as a cron job in kubernetes.
<br>

Operators can provide custom a builder image with further capabilities
(for example running system tests to verify data jobs , security hardening like checking for malicious code, etc.)
The Control Service provides all the horizontal components required to configure and run [data jobs](https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job) on any cloud:
the UI, and REST APIs for deployment, execution, properties and secrets.
Also job scheduling, logging, alerting, monitoring etc.
It is tightly coupled with Kubernetes and relies on Kubernetes ecosystem for many of its functionalities.

#### Data Job

The deployment of a data job happens as a CronJob deployment in Kubernetes. CronJob pulls the data job image from the Container Registry upon each execution. The data jobs are run in Kubernetes and monitored in cloud runtime environment via Prometheus, which uses the Kubernetes APIs to read metrics.

#### Database
![VDK Control Service System Context Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/7b209157-d496-46bf-9cbf-7da337b5684e)

CockroachDB is used to store the data job metadata. This includes information about data job definitions, and data job executions.
---

## VDK SDK Component Diagram

![VDK SDK Component Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/46329809-27da-4afa-83bf-9ae6714b5a11)
![VDK SDK Component Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/e6d59004-4262-4e5a-8b65-4e36c2d490c1)

Diving deeper into the VDK SDK, the entry point from the data practitioner's perspective is
* VDK Data Job Framework which provides tools and interfaces to perform ETL operations on data through SQL or Python.
Expand Down Expand Up @@ -164,8 +94,8 @@ encapsulates a variety of functionalities and interfaces that provide a comprehe
- Ingester interface (IIngester) provide ways to ingest (or load) data into different destinations and remote stores (based on plugins).
- Data Engineers can use `send_tabular_data_for_ingestion` and `send_object_for_ingestion` to send data to remote stores
- Database Query and Connection interface (IManagedConnection) provides access to managed database connections and query execution.
It's managed by VDK - so it can be configured and provides error recovery mechanisms, parameters substitution.
Through plugins, it is possible to collect lineage, provide query quality analysis, etc.
It's managed by VDK - so it can be configured and provides error recovery mechanisms, parameters substitution.
Through plugins, it is possible to collect lineage, provide query quality analysis, etc.
- There is dedicated method `job_input.execute_query`
- or one can use `job_input.get_managed_connection()` to get the managed connection and use it like `pd.read_sql("*", con=job_input.get_managed_connection())`
- SQL steps are automatically executed using the configured managed connection
Expand Down Expand Up @@ -207,3 +137,84 @@ Check out some more interesting and useful plugins:

[quickstart-vdk](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/quickstart-vdk) is a distribution packaged with most useful plugins to get started with VDK.
Users and organization are encourage to create their own distribution for their own specific purposes.




## VDK Control Service Container Diagram


![VDK Control Service Container Diagram](https://github.com/vmware/versatile-data-kit/assets/2536458/5941a4f6-d642-4bef-9c1d-0915c987ef86)


Diving deeper into the VDK Control Service, the entry point from data practitioner's perspective is
* VDK Operations UI which is used for operating and monitoring of data jobs.
* [VDK Rest API](https://iaclqhm5xk.execute-api.us-west-1.amazonaws.com/data-jobs/swagger-ui/index.html#/) (through VDK Control CLI or Notebook UI) which is used to deploy and configure data jobs.

From operator's perspective:
* Operators use provided [helm chart](https://github.com/vmware/versatile-data-kit/wiki/Installation#install-versatile-data-kit-control-service) to install, and configure the Control Service deployment and the needed data infrastructure.
* Anyone can try local demo version by installing `quickstart-vdk` and running `vdk server --install`

#### Control Service Rest APIs

[Control Service Rest API](https://iaclqhm5xk.execute-api.us-west-1.amazonaws.com/data-jobs/swagger-ui/index.html) streamlines and manages key stages of the development lifecycle:
- Deploy the data job (with "single click") to the Control Service and it would take care of build, release, deploy of DevOps Cycle.
- Manage (operate and monitor) the jobs through Control Service monitoring and alerting and VDK Operations UI.

###### APIs

- Jobs API to register new jobs and stores job configuration.
- Source API to upload job source and provide version of the data job's source ("release")
- Deployment API make sure code is built and deployed with correct configuration
- Execution API tracks all executions exposing needed metrics and data and can start new executions
- Properties And Secrets API are additional services to help data jobs users keep state and credentials

###### Implementation
The Rest API is implemented in Java with Spring Boot framework.

Operators can also configure VDK Control Service to use existing Logging and Monitoring system in organization ecosystem
using environment variables or system properties.

In production, the REST API is running as a Kubernetes Deployment in a kubernetes cluster.

###### Security
The REST API implements authentication and authorization through OAuth2.
Complex authorization logic can be achieved using an authorization webhook.
All communication between the Control Service, the VDK SDK, and the managed components (like databases) can be encrypted.

###### Integration with the Kubernetes Ecosystem

The Control Service relies a lot on Kubernetes ecosystem for scheduling, orchestration,
logging services (for example fluentd can be configured to export logs to a log aggregator),
monitoring (metrics can easily be exported to prometheus)

###### Reliability
The Control Service leverages the resilience and fault-tolerance capabilities of Kubernetes
for scheduling and orchestration (automatic restarts, load balancing).

Errors encountered during data job execution, such as connectivity issues with external systems or
unanticipated exceptions in the data job code, are tracked by the Execution API.

Upon deployment there's provided tool [vdk-heartbeat](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-heartbeat)
that can be used to verify deployment is working correctly.

#### VDK Operations UI
VDK Operations UI is an Angular-based web application. It relies on VDK Rest API and especially for almost all read operations on the [Rest GraphQL Query API](https://iaclqhm5xk.execute-api.us-west-1.amazonaws.com/data-jobs/swagger-ui/index.html#/Data%20Jobs/jobsQuery) .

#### Builder Job
Builder Jobs are used to build and release users' data jobs.
Those are system jobs used during deployment to install all dependencies and package the data job.
Builder Jobs interacts with Git to read the source code and with Container Registry (e.g. Docker) in order to store the data job image.

When a user deploys a data job, a data job container image is created for it and deployed as a cron job in kubernetes.

Operators can provide custom a builder image with further capabilities
(for example running system tests to verify data jobs , security hardening like checking for malicious code, etc.)

#### Data Job

The deployment of a data job happens as a CronJob deployment in Kubernetes. CronJob pulls the data job image from the Container Registry upon each execution. The data jobs are run in Kubernetes and monitored in cloud runtime environment via Prometheus, which uses the Kubernetes APIs to read metrics.

#### Database

CockroachDB is used to store the data job metadata. This includes information about data job definitions, and data job executions.
Loading

0 comments on commit 5c339a3

Please sign in to comment.