Skip to content

Commit

Permalink
specs: Introduce VEP-1739
Browse files Browse the repository at this point in the history
This change introduces VDP-1739, which aims at proposing an improvement to Versatile
Data Kit by adding support for using different python versions for data job
deployments.

Testing Done: N/A

Signed-off-by: Andon Andonov <[email protected]>
  • Loading branch information
doks5 committed Mar 14, 2023
1 parent 5a62a50 commit 3bb4163
Show file tree
Hide file tree
Showing 5 changed files with 230 additions and 0 deletions.
185 changes: 185 additions & 0 deletions specs/vep-1739-multiple-python-versions/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@

# VEP-1739: Multiple Python Versions

* **Author(s):** Andon Andonov ([email protected])
* **Status:** draft

<!-- Provide table of content as it's helpful. -->

- [Summary](#summary)
- [Glossary](#glossary)
- [Motivation](#motivation)
- [Requirements and goals](#requirements-and-goals)
- [High-level design](#high-level-design)
- [API Design](#api-design)
- [Detailed design](#detailed-design)
- [Implementation stories](#implementation-stories)
- [Alternatives](#alternatives)

## Summary

---
Currently, when a data job is deployed, the Control Service uses vdk and data job base images set once and applied to
all job deployments. If a data engineer decides they want to use a different python version for their job, they need to
ask their infrastructure administrator, or whoever is responsible for the Control Service deployment, to change the
configuration of the service and re-deploy it to allow for data jobs with a different python version to be deployed.
This, however, would break existing deployed data jobs, as the moment they are re-deployed, the new python version would
be applied with unforeseeable consequences.

We want to allow users to deploy data jobs with different python versions without needing to re-deploy the Control
Service. To do this, we will extend the Control Service logic and API to support multiple python versions for data job
deployments.

## Glossary

---
* VDK: https://github.com/vmware/versatile-data-kit/wiki/dictionary#vdk
* Control Service: https://github.com/vmware/versatile-data-kit/wiki/dictionary#control-service
* Data Job: https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job
* Data Job Deployment: https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job-deployment
* Kubernetes: https://kubernetes.io/

## Motivation

---
As mentioned in the [Summary](#Summary) section above, the vdk and data job base images are set per Control Service
deployment. This is not an issue in general, as it is assumed that the Versatile Data Kit administrators, responsible
for the Control Service deployment, have taken into account the data engineers' tech stack.

There are, however, situations when this might not be the case. For example, if the administrators of a Versatile Data
Kit deployment decide to keep an older python version (say 3.8) for all data job deployments, but a data engineer working
on a special use case needs to use a dependency that does not support anything below python 3.10, they would not be able
to deploy a data job, because the job would be build with python 3.8. To accommodate the special job, the administrators
would need to re-configure and re-deploy the Control Service. Although this may not be a big issue (not taking into
account the hassle of redeploying the whole Control Service for just one special job), it would break all jobs whose
dependencies rely on python versions older than 3.10, as once set to 3.10, the Control Service will deploy all data jobs
with it.

In such cases, there are two main approaches that could be taken:
1) Look for a different package -- this works in most cases, as there are often multiple packages that solve the same
problems and are build for different python versions. However, depending on how specialized the problem at hand is, there
might be no alternatives to a package, or there might be necessary to use an older version of the package which could
expose the data job to vulnerabilities patched in the newer package releases.
2) Deploy a separate Control Service instance -- with this solution, a new instance of the Control Service would need to
deployed and configured to use the newer python version. In addition, the vdk SDK would also need to be reconfigured to
point to the new Control Service instance, which may cause more confusion among other engineers who may not be aware that
there are multiple Control Services and SDKs with different configurations. This may be acceptable if the specialized
data job is with high priority, but having a completely separate Control Service instance for a single job deployment is
unreasonable.

To avoid situations, where old or unsafe dependencies are used, or to avoid the necessity to deploy separate Control
Service deployments, changes will be made to the Control Service API and deployment logic to allow for different python
versions to be used per data job deployment. Additionally, minor changes will be made to the vdk-control-cli plugin to
facilitate the selection of python version at job deployment.

## Requirements and goals

---
### Goals
* **Change API to accommodate passing the python version to be used in data job deployments**
* A data engineer developing a data job wants to use specific python version for their job deployment. They need to be
able to specify what python version they want to use in the config.ini file of the job, or as part of the body of the
job deployment request in case they use the Control Service API directly, and not through the vdk SDK.
* **Introduce mechanism to configure what python versions are supported by the Control Service**
* An administrator need to be able to configure what python versions are supported bby a Control Service deployment,
and what vdk and job base images correspond to a certain python version.
* **Save python version configuration in the Control Service's database.**
* The python version configuration related to a specific data job deployment needs to be stored in the database
alongside the rest of the data job's deployment configuration
* **Add python version used for a job's deployment to the job's cronjob spec.**
* The python version used for a data job's deployment needs to be added as annotation to the job's cronjob spec.
* **Update vdk-control-cli plugin to allow it to read the python version from config.ini**
* When a data engineer creates a data job and updates the job's config.ini file to set a specific python version to
used when the job is deployed, the python version needs to be read from the config file and passed to the Control
Service.
### Non-Goals
* **Extensive error handling.**
* Some basic error handling will be added to avoid common issues with mismatched python versions, etc. However, at
this stage is not possible to foresee all corner cases that may arise, so extensive error handling would not be added
as part of this initiative.
* **Python version validation at SDK level.**
* As there will be python version validation at the Control Service level, such validation will not be added at the
vdk SDK level.

## High-level design

---
![high_level_design.png](diagrams/high_level_design.png)

The proposed design will introduce changes to the Control Service API and deployment logic, as well as to the database configuration and vdk SDK. Additionally, it will allow data engineers to specify what python version their job needs to be deployed with as part of the job's config.ini file.

Once set, the python version will be passed from the config.ini to the Control Service through the vdk SDK, or it could be passed directly as part of the deployment request body in case the Control Service API is called directly. If no python version is passed to the Control Service, a predefined default version will be used.

## API design

<!--
Describe the changes and additions to the public API (if there are any).
For all API changes:
Include Swagger URL for HTTP APIs, no matter if the API is RESTful or RPC-like.
PyDoc/Javadoc (or similar) for Python/Java changes.
Explain how does the system handle API violations.
-->


## Detailed design
<!--
Dig deeper into each component. The section can be as long or as short as necessary.
Consider at least the below topics but you do not need to cover those that are not applicable.
### Capacity Estimation and Constraints
* Cost of data path: CPU cost per-IO, memory footprint, network footprint.
* Cost of control plane including cost of APIs, expected timeliness from layers above.
### Availability.
* For example - is it tolerant to failures, What happens when the service stops working
### Performance.
* Consider performance of data operations for different types of workloads.
Consider performance of control operations
* Consider performance under steady state as well under various pathological scenarios,
e.g., different failure cases, partitioning, recovery.
* Performance scalability along different dimensions,
e.g. #objects, network properties (latency, bandwidth), number of data jobs, processed/ingested data, etc.
### Database data model changes
### Telemetry and monitoring changes (new metrics).
### Configuration changes.
### Upgrade / Downgrade Strategy (especially if it might be breaking change).
* Data migration plan (it needs to be automated or avoided - we should not require user manual actions.)
### Troubleshooting
* What are possible failure modes.
* Detection: How can it be detected via metrics?
* Mitigations: What can be done to stop the bleeding, especially for already
running user workloads?
* Diagnostics: What are the useful log messages and their required logging
levels that could help debug the issue?
* Testing: Are there any tests for failure mode? If not, describe why._
### Operability
* What are the SLIs (Service Level Indicators) an operator can use to determine the health of the system.
* What are the expected SLOs (Service Level Objectives).
### Test Plan
* Unit tests are expected. But are end to end test necessary. Do we need to extend vdk-heartbeat ?
* Are there changes in CICD necessary
### Dependencies
* On what services the feature depends on ? Are there new (external) dependencies added?
### Security and Permissions
How is access control handled?
* Is encryption in transport supported and how is it implemented?
* What data is sensitive within these components? How is this data secured?
* In-transit?
* At rest?
* Is it logged?
* What secrets are needed by the components? How are these secrets secured and attained?
-->


## Implementation stories
<!--
Optionally, describe what are the implementation stories (eventually we'd create github issues out of them).
-->

## Alternatives
<!--
Optionally, describe what alternatives has been considered.
Keep it short - if needed link to more detailed research document.
-->
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
@startuml
!include <awslib/AWSSimplified>
!include <awslib/General/User>
!include <awslib/Containers/ElasticContainerRegistry>
!include <cloudinsight/file>

caption Figure 1: High-level Design

User(engineer, "Data\n<b>Engineer", " ")
ElasticContainerRegistry(ecr, "Image\n<b>Registry", " ")


rectangle "K8s Cluster" {
component "Control Service" as cs
rectangle " Data Jobs\nBuilders/Deployments\n Namespace" as djn
cs - djn
}

rectangle "<$file>\nData Job" as data_job


engineer -- data_job : Set <b>python_version</b> in config.ini
data_job -- cs : Deploy data job

cs -- ecr : Pull data job base and \nvdk images based on \nprovided <b>python_version</b>

@enduml
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
@startuml
actor DataEngineer as engineer
participant ControlService as cs
participant K8sNamespace_DataJobBuilders as namespace
database Database as db
database Registry as ecr

engineer -> cs : vdk create
cs -> db : Register new data job
cs -> engineer : Download sample data job
engineer -> engineer : Set <b>python_version</b> in the data job's config.ini
engineer -> cs : vdk deploy
cs -> cs : Read Deployment data
cs -> db : Update data job configuration data
cs -> ecr : Pull data job base image
cs -> ecr : Pull vdk image
cs -> namespace : Start data job builder
@enduml

0 comments on commit 3bb4163

Please sign in to comment.