specs: Introduce VEP-1739

This change introduces VDP-1739, which aims at proposing an improvement to Versatile Data Kit by adding support for using different python versions for data job deployments. Testing Done: N/A Signed-off-by: Andon Andonov <[email protected]>
vmware · Mar 14, 2023 · 3bb4163 · 3bb4163
1 parent 5a62a50
commit 3bb4163
Show file tree

Hide file tree

Showing 5 changed files with 230 additions and 0 deletions.
diff --git a/specs/vep-1739-multiple-python-versions/README.md b/specs/vep-1739-multiple-python-versions/README.md
@@ -0,0 +1,185 @@
+
+# VEP-1739: Multiple Python Versions
+
+* **Author(s):** Andon Andonov ([email protected])
+* **Status:** draft
+
+<!-- Provide table of content as it's helpful. -->
+
+- [Summary](#summary)
+- [Glossary](#glossary)
+- [Motivation](#motivation)
+- [Requirements and goals](#requirements-and-goals)
+- [High-level design](#high-level-design)
+- [API Design](#api-design)
+- [Detailed design](#detailed-design)
+- [Implementation stories](#implementation-stories)
+- [Alternatives](#alternatives)
+
+## Summary
+
+---
+Currently, when a data job is deployed, the Control Service uses vdk and data job base images set once and applied to
+all job deployments. If a data engineer decides they want to use a different python version for their job, they need to
+ask their infrastructure administrator, or whoever is responsible for the Control Service deployment, to change the
+configuration of the service and re-deploy it to allow for data jobs with a different python version to be deployed.
+This, however, would break existing deployed data jobs, as the moment they are re-deployed, the new python version would
+be applied with unforeseeable consequences.
+
+We want to allow users to deploy data jobs with different python versions without needing to re-deploy the Control
+Service. To do this, we will extend the Control Service logic and API to support multiple python versions for data job
+deployments.
+
+## Glossary
+
+---
+* VDK: https://github.com/vmware/versatile-data-kit/wiki/dictionary#vdk
+* Control Service: https://github.com/vmware/versatile-data-kit/wiki/dictionary#control-service
+* Data Job: https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job
+* Data Job Deployment: https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job-deployment
+* Kubernetes: https://kubernetes.io/
+
+## Motivation
+
+---
+As mentioned in the [Summary](#Summary) section above, the vdk and data job base images are set per Control Service
+deployment. This is not an issue in general, as it is assumed that the Versatile Data Kit administrators, responsible
+for the Control Service deployment, have taken into account the data engineers' tech stack.
+
+There are, however, situations when this might not be the case. For example, if the administrators of a Versatile Data
+Kit deployment decide to keep an older python version (say 3.8) for all data job deployments, but a data engineer working
+on a special use case needs to use a dependency that does not support anything below python 3.10, they would not be able
+to deploy a data job, because the job would be build with python 3.8. To accommodate the special job, the administrators
+would need to re-configure and re-deploy the Control Service. Although this may not be a big issue (not taking into
+account the hassle of redeploying the whole Control Service for just one special job), it would break all jobs whose
+dependencies rely on python versions older than 3.10, as once set to 3.10, the Control Service will deploy all data jobs
+with it.
+
+In such cases, there are two main approaches that could be taken:
+1) Look for a different package -- this works in most cases, as there are often multiple packages that solve the same
+problems and are build for different python versions. However, depending on how specialized the problem at hand is, there
+might be no alternatives to a package, or there might be necessary to use an older version of the package which could
+expose the data job to vulnerabilities patched in the newer package releases.
+2) Deploy a separate Control Service instance -- with this solution, a new instance of the Control Service would need to
+deployed and configured to use the newer python version. In addition, the vdk SDK would also need to be reconfigured to
+point to the new Control Service instance, which may cause more confusion among other engineers who may not be aware that
+there are multiple Control Services and SDKs with different configurations. This may be acceptable if the specialized
+data job is with high priority, but having a completely separate Control Service instance for a single job deployment is
+unreasonable.
+
+To avoid situations, where old or unsafe dependencies are used, or to avoid the necessity to deploy separate Control
+Service deployments, changes will be made to the Control Service API and deployment logic to allow for different python
+versions to be used per data job deployment. Additionally, minor changes will be made to the vdk-control-cli plugin to
+facilitate the selection of python version at job deployment.
+
+## Requirements and goals
+
+---
+### Goals
+* **Change API to accommodate passing the python version to be used in data job deployments**
+  * A data engineer developing a data job wants to use specific python version for their job deployment. They need to be
+  able to specify what python version they want to use in the config.ini file of the job, or as part of the body of the
+  job deployment request in case they use the Control Service API directly, and not through the vdk SDK.
+* **Introduce mechanism to configure what python versions are supported by the Control Service**
+  * An administrator need to be able to configure what python versions are supported bby a Control Service deployment,
+  and what vdk and job base images correspond to a certain python version.
+* **Save python version configuration in the Control Service's database.**
+  * The python version configuration related to a specific data job deployment needs to be stored in the database
+  alongside the rest of the data job's deployment configuration
+* **Add python version used for a job's deployment to the job's cronjob spec.**
+  * The python version used for a data job's deployment needs to be added as annotation to the job's cronjob spec.
+* **Update vdk-control-cli plugin to allow it to read the python version from config.ini**
+  * When a data engineer creates a data job and updates the job's config.ini file to set a specific python version to
+  used when the job is deployed, the python version needs to be read from the config file and passed to the Control
+  Service.
+### Non-Goals
+* **Extensive error handling.**
+  * Some basic error handling will be added to avoid common issues with mismatched python versions, etc. However, at
+  this stage is not possible to foresee all corner cases that may arise, so extensive error handling would not be added
+  as part of this initiative.
+* **Python version validation at SDK level.**
+  * As there will be python version validation at the Control Service level, such validation will not be added at the
+  vdk SDK level.
+
+## High-level design
+
+---
+![high_level_design.png](diagrams/high_level_design.png)
+
+The proposed design will introduce changes to the Control Service API and deployment logic, as well as to the database configuration and vdk SDK. Additionally, it will allow data engineers to specify what python version their job needs to be deployed with as part of the job's config.ini file.
+
+Once set, the python version will be passed from the config.ini to the Control Service through the vdk SDK, or it could be passed directly as part of the deployment request body in case the Control Service API is called directly. If no python version is passed to the Control Service, a predefined default version will be used.
+
+## API design
+
+<!--
+
+Describe the changes and additions to the public API (if there are any).
+
+For all API changes:
+
+Include Swagger URL for HTTP APIs, no matter if the API is RESTful or RPC-like.
+PyDoc/Javadoc (or similar) for Python/Java changes.
+Explain how does the system handle API violations.
+-->
+
+
+## Detailed design
+<!--
+Dig deeper into each component. The section can be as long or as short as necessary.
+Consider at least the below topics but you do not need to cover those that are not applicable.
+
+### Capacity Estimation and Constraints
+    * Cost of data path: CPU cost per-IO, memory footprint, network footprint.
+    * Cost of control plane including cost of APIs, expected timeliness from layers above.
+### Availability.
+    * For example - is it tolerant to failures, What happens when the service stops working
+### Performance.
+    * Consider performance of data operations for different types of workloads.
+       Consider performance of control operations
+    * Consider performance under steady state as well under various pathological scenarios,
+       e.g., different failure cases, partitioning, recovery.
+    * Performance scalability along different dimensions,
+       e.g. #objects, network properties (latency, bandwidth), number of data jobs, processed/ingested data, etc.
+### Database data model changes
+### Telemetry and monitoring changes (new metrics).
+### Configuration changes.
+### Upgrade / Downgrade Strategy (especially if it might be breaking change).
+  * Data migration plan (it needs to be automated or avoided - we should not require user manual actions.)
+### Troubleshooting
+  * What are possible failure modes.
+    * Detection: How can it be detected via metrics?
+    * Mitigations: What can be done to stop the bleeding, especially for already
+      running user workloads?
+    * Diagnostics: What are the useful log messages and their required logging
+      levels that could help debug the issue?
+    * Testing: Are there any tests for failure mode? If not, describe why._
+### Operability
+  * What are the SLIs (Service Level Indicators) an operator can use to determine the health of the system.
+  * What are the expected SLOs (Service Level Objectives).
+### Test Plan
+  * Unit tests are expected. But are end to end test necessary. Do we need to extend vdk-heartbeat ?
+  * Are there changes in CICD necessary
+### Dependencies
+  * On what services the feature depends on ? Are there new (external) dependencies added?
+### Security and Permissions
+  How is access control handled?
+  * Is encryption in transport supported and how is it implemented?
+  * What data is sensitive within these components? How is this data secured?
+      * In-transit?
+      * At rest?
+      * Is it logged?
+  * What secrets are needed by the components? How are these secrets secured and attained?
+-->
+
+
+## Implementation stories
+<!--
+Optionally, describe what are the implementation stories (eventually we'd create github issues out of them).
+-->
+
+## Alternatives
+<!--
+Optionally, describe what alternatives has been considered.
+Keep it short - if needed link to more detailed research document.
+-->
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_design.png b/specs/vep-1739-multiple-python-versions/diagrams/high_level_design.png
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_design_plantUML_source.txt b/specs/vep-1739-multiple-python-versions/diagrams/high_level_design_plantUML_source.txt
@@ -0,0 +1,27 @@
+@startuml
+!include <awslib/AWSSimplified>
+!include <awslib/General/User>
+!include <awslib/Containers/ElasticContainerRegistry>
+!include <cloudinsight/file>
+
+caption Figure 1: High-level Design
+
+User(engineer, "Data\n<b>Engineer", " ")
+ElasticContainerRegistry(ecr, "Image\n<b>Registry", " ")
+
+
+rectangle "K8s Cluster" {
+   component "Control Service" as cs
+   rectangle "          Data Jobs\nBuilders/Deployments\n        Namespace" as djn
+   cs - djn
+}
+
+rectangle "<$file>\nData Job" as data_job
+
+
+engineer -- data_job : Set <b>python_version</b> in config.ini
+data_job -- cs : Deploy data job
+
+cs -- ecr : Pull data job base and \nvdk images based on \nprovided <b>python_version</b>
+
+@enduml
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence.png b/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence.png
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence_plantUML_source.txt b/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence_plantUML_source.txt
@@ -0,0 +1,18 @@
+@startuml
+actor DataEngineer as engineer
+participant ControlService as cs
+participant K8sNamespace_DataJobBuilders as namespace
+database Database as db
+database Registry as ecr
+
+engineer -> cs : vdk create
+cs -> db : Register new data job
+cs -> engineer : Download sample data job
+engineer -> engineer : Set <b>python_version</b> in the data job's config.ini
+engineer -> cs : vdk deploy
+cs -> cs : Read Deployment data
+cs -> db : Update data job configuration data
+cs -> ecr : Pull data job base image
+cs -> ecr : Pull vdk image
+cs -> namespace : Start data job builder
+@enduml