specs: Introduce VEP-1739

This change introduces VDP-1739, which aims at proposing an improvement to Versatile Data Kit by adding support for using different python versions for data job deployments. Testing Done: N/A Signed-off-by: Andon Andonov <[email protected]>
vmware · Mar 15, 2023 · da75797 · da75797
1 parent 553ed23
commit da75797
Show file tree

Hide file tree

Showing 5 changed files with 287 additions and 0 deletions.
diff --git a/specs/vep-1739-multiple-python-versions/README.md b/specs/vep-1739-multiple-python-versions/README.md
@@ -0,0 +1,239 @@
+
+# VEP-1739: Multiple Python Versions
+
+* **Author(s):** Andon Andonov ([email protected])
+* **Status:** draft
+
+<!-- Provide table of content as it's helpful. -->
+
+- [Summary](#summary)
+- [Glossary](#glossary)
+- [Motivation](#motivation)
+- [Requirements and goals](#requirements-and-goals)
+- [High-level design](#high-level-design)
+- [API Design](#api-design)
+- [Detailed design](#detailed-design)
+- [Alternatives](#alternatives)
+
+## Summary
+
+---
+Currently, when a data job is deployed, the Control Service uses vdk and data job base images set once and applied to
+all job deployments. If a user decides they want to use a different python version for their job, they need to
+ask their infrastructure administrator, or whoever is responsible for the Control Service deployment, to change the
+configuration of the service and re-deploy it to allow for data jobs with a different python version to be deployed.
+This, however, would break existing deployed data jobs, as the moment they are re-deployed, the new python version would
+be applied with unforeseeable consequences.
+
+We want to allow users to deploy data jobs with different python versions without needing to re-deploy the Control
+Service. To do this, we will extend the Control Service logic and API to support multiple python versions for data job
+deployments.
+
+## Glossary
+
+---
+* VDK: https://github.com/vmware/versatile-data-kit/wiki/dictionary#vdk
+* Control Service: https://github.com/vmware/versatile-data-kit/wiki/dictionary#control-service
+* Data Job: https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job
+* Data Job Deployment: https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job-deployment
+* Kubernetes: https://kubernetes.io/
+* User: a person who uses VDK to develop and execute data jobs. That person can be a data engineer, data scientist, or a developer.
+* Administrator (Admin): a person who sets up and maintains the VDK Control Service and VDK runtime environments.
+
+## Motivation
+
+---
+When a data job is deployed, the Control Service uses vdk and data job base images set once and applied to
+all job deployments. This is not an issue in general, as it is assumed that the Versatile Data Kit administrators, responsible
+for the Control Service deployment, have taken into account the users' tech stack.
+
+There are, however, situations when this might not be the case. For example, if the administrators of a Versatile Data
+Kit deployment decide to keep an older python version (say 3.8) for all data job deployments, but a user working
+on a special use case needs to use a dependency that does not support anything below python 3.10, they would not be able
+to deploy a data job, because the job would be build with python 3.8. To accommodate the special job, the administrators
+would need to re-configure and re-deploy the Control Service. Although this may not be a big issue (not taking into
+account the hassle of redeploying the whole Control Service for just one special job), it would break all jobs whose
+dependencies rely on python versions older than 3.10, as once set to 3.10, the Control Service will deploy all data jobs
+with it.
+
+In such cases, there are two main approaches that could be taken:
+1) Look for a different package -- this works in most cases, as there are often multiple packages that solve the same
+problems and are build for different python versions. However, depending on how specialized the problem at hand is, there
+might be no alternatives to a package, or there might be necessary to use an older version of the package which could
+expose the data job to vulnerabilities patched in the newer package releases.
+2) Deploy a separate Control Service instance -- with this solution, a new instance of the Control Service would need to be
+deployed and configured to use the newer python version. In addition, the vdk SDK would also need to be reconfigured to
+point to the new Control Service instance, which may cause more confusion among other users who may not be aware that
+there are multiple Control Services and SDKs with different configurations. This may be acceptable if the specialized
+data job is with high priority, but having a completely separate Control Service instance for a single job deployment is
+unreasonable.
+
+To avoid situations, where old or unsafe dependencies are used, or to avoid the necessity to deploy separate Control
+Service deployments, changes will be made to the Control Service API and deployment logic to allow for different python
+versions to be used per data job deployment. Additionally, minor changes will be made to the vdk-control-cli plugin to
+facilitate the selection of python version at job deployment.
+
+## Requirements and goals
+
+---
+### Goals
+* **Change API to accommodate passing the python version to be used in data job deployments**
+  * A user developing a data job wants to use specific python version for their job deployment. They need to be
+  able to specify what python version they want to use in the config.ini file (the file containing the job's configuration like job owner, cron schedule, etc.) of the job, or as part of the body of the
+  job deployment request in case they use the Control Service API directly, and not through the vdk SDK.
+* **Backwards compatibility**
+  * After implementing this proposal, backwards compatibility needs to be maintained, as to prevent issues with
+  existing Control Service and data job deployments.
+* **Introduce mechanism to configure what python versions are supported by the Control Service**
+  * An administrator needs to be able to configure what python versions are supported by a Control Service deployment,
+  and what vdk and job base images correspond to a certain python version.
+* **Save python version configuration in the Control Service's database.**
+  * The python version configuration related to a specific data job deployment needs to be stored in the database
+  alongside the rest of the data job's deployment configuration
+* **Add python version used for a job's deployment to the job's cronjob spec.**
+  * The python version used for a data job's deployment needs to be added as annotation to the job's cronjob spec.
+* **Update vdk-control-cli plugin to allow it to read the python version from config.ini**
+  * When a user creates a data job and updates the job's config.ini file to set a specific python version to be
+  used when the job is deployed, the python version needs to be read from the config file and passed to the Control
+  Service.
+### Non-Goals
+* **Python version validation at SDK level.**
+  * As there will be python version validation at the Control Service level, such validation will not be added at the
+  vdk SDK level.
+
+## High-level design
+
+---
+Thi diagram below shows a high-level design view of the changes that would be proposed
+in this VEP. The user would create a data job and set a **python_version** property in the job's _config.ini_ file. When the job is deployed, the python_version would be passed to the Control Service, which on its end would verify that the version is supported and would pull the vdk and job base images corresponding to the python version from the registry. The supported python versions would be configured by the administrator. In the final step, the pulled image links would be passed to the data job builder.
+
+![high_level_design.png](diagrams/high_level_design.png)
+
+The proposed design will introduce changes to the Control Service API and deployment logic, as well as to the database configuration and vdk SDK. Additionally, it will allow users to specify what python version their job needs to be deployed with, as part of the job's config.ini file.
+
+Once set, the python version will be passed from the config.ini to the Control Service through the vdk SDK, or it could be passed directly as part of the deployment request body in case the Control Service API is called directly. If no python version is passed to the Control Service, a predefined default version will be used.
+
+## API design
+
+---
+The public REST API of the Control Service would be changed to support a user-specified python version to be used for
+the data job deployment. This would be facilitated through the addition of an optional **python_version** property to the
+[DataJobDeployment](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/model/apidefs/datajob-api/api.yaml#L1068)
+resource schema.
+```yaml
+python_version:
+  description: Python version number to used for data job deployment
+  type: string
+  example: 3.11
+```
+The property would be optional to avoid enforcing python version selection. If the user decides to not set the property,
+a default value would be used.
+
+
+## Detailed design
+
+---
+As mentioned in the [High Level Design](#high-level-design) section above, changes to multiple components would be needed
+to allow users to deploy data jobs with different python versions. In this section we explain the design decisions made in
+greater detail. For the public API design changes that would be needed, see the [API design](#api-design) section above.
+
+### Supported Python Versions Configuration
+Currently, the only way to specify what job base image is to be used by data job builders is through the
+[datajobs.deployment.dataJobBaseImage](the only way to specify what job base image is to be used by data job builders is through the datajobs.deployment.dataJobBaseImage application property) application property. The vdk image to be used in job deployments, on the other hand,
+can be specified through a _DOCKER_VDK_BASE_IMAGE_ [environment variable](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/pipelines_control_service/src/main/resources/application-prod.properties#L40).
+This is not optimal, as the only way to change the vdk and base images is to change the configuration and re-deploy the whole Control Service.
+As mentioned in the previous sections, we want to make vdk and job base images configurable per data job deployment, not Control Service one.
+To facilitate this, we need a configuration what python versions and respective vdk and job base images are supported by the Control Service.
+
+To configure these, a few options were explored:
+
+#### Option 1: Introduce three environment variables, SUPPORTED_PYTHON_VERSIONS, JOB_BASE_IMAGE_TEMPLATE, and JOB_VDK_IMAGE_TEMPLATE.
+These can be set in the helm chart of the Control Service and in the application properties files.
+
+The env vars would be of type string, and would be used to construct the job base images passed to the job builders and the vdk images that would be used by the deployed data jobs. The _SUPPORTED_PYTHON_VERSIONS_ var would be a string of comma-separated python version numbers like **"3.7,3.8,3.9"**, the _JOB_BASE_IMAGE_TEMPLATE_ would be a string of the format "**registry.hub.docker.com/versatiledatakit/data-job-base-python-**", and the _JOB_VDK_IMAGE_TEMPLATE_ would be set as "**registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-**".
+
+After the variables are loaded, the Control Service can build the base and vdk images name. When a **python_version** is passed to the body of the request for job deployment, the Control Service would check if the version is supported (validate if the value is present in the _SUPPORTED_PYTHON_VERSIONS_), and if it is, it would load the _JOB_BASE_IMAGE_TEMPLATE_ and _JOB_VDK_IMAGE_TEMPLATE_ value and append the python_version alongside some predefined suffix.
+
+For example, given the following env var values
+```shell
+SUPPORTED_PYTHON_VERSIONS="3.7,3.8,3.9"
+JOB_BASE_IMAGE_TEMPLATE="registry.hub.docker.com/versatiledatakit/data-job-base-python-"
+JOB_VDK_IMAGE_TEMPLATE="registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-"
+```
+and **python_version** passed as "3.8", the final base image name could be something like
+```shell
+registry.hub.docker.com/versatiledatakit/data-job-base-python-3.8
+# or
+registry.hub.docker.com/versatiledatakit/data-job-base-python-3.8-latest
+```
+and the final vdk image name could be something like
+```shell
+registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.8
+# or
+registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.8-supercollider
+```
+**NOTE:** In this case, it would be very important how the base and vdk images are named, as to avoid situations where the pre-built base and vdk images in the registry have different names from the ones composed by the Control Service based on user input. Also, in the second case, the "-latest" and "-supercollider" suffixes would need to be part of the Control Service code, as users should not need add them as part of the **python_version**.
+
+In the helm chart, the configuration can easily be set in the values.yaml file as
+```yaml
+dataJobDeploymentSupportedPythonVersions: "3.7,3.8,3.9"
+dataJobDeploymentBaseImageTemplate: "registry.hub.docker.com/versatiledatakit/data-job-base-python-"
+dataJobDeploymentVdkImageTemplate: "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-"
+```
+and then in the _templates/deployment.yaml_ they can be injected as proper environment variables like so:
+```yaml
+- name: SUPPORTED_PYTHON_VERSIONS
+  value: "{{ .Values.dataJobDeploymentSupportedPythonVersions }}"
+- name: BASE_IMAGE_TEMPLATE
+  value: "{{ .Values.dataJobDeploymentBaseImageTemplate }}"
+- name: VDK_IMAGE_TEMPLATE
+  value: "{{ .Values.dataJobDeploymentVdkImageTemplate }}"
+```
+In the _application.properties_ files, the above could be consumed as
+```shell
+datajobs.deployment.supportedPythonVersions=${SUPPORTED_PYTHON_VERSIONS:3.7}
+datajobs.deployment.baseImageTemplate=${BASE_IMAGE_TEMPLATE:example.registry.com/somerepo/data-job-base-python-}
+datajobs.deployment.vdkImageTemplate=${VDK_IMAGE_TEMPLATE:example.registry.com/somerepo/data-job-vdk-python-}
+```
+
+#### Option 2: Use dictionaries to map supported python versions to corresponding base and vdk images.
+This option would utilize nested yaml dictionaries for configuration to map the supported version to a base and vdk images.
+
+In the helm chart values file, this configuration would be set like:
+```yaml
+supportedPythonVersions:
+   3.7:
+      baseImage: "registry.hub.docker.com/versatiledatakit/data-job-base-python-3.7-latest"
+      vdkImage: "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.7"
+   3.8:
+      baseImage: "registry.hub.docker.com/versatiledatakit/data-job-base-python-3.8-latest"
+      vdkImage: "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.8"
+   3.9:
+      baseImage: "registry.hub.docker.com/versatiledatakit/data-job-base-python-3.9-latest"
+      vdkImage: "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.9"
+```
+And in the application properties files, these could be set as a single property as:
+```shell
+datajobs.deployment.supportedPythonVersions={'3.7':{'baseImage': 'registry.hub.docker.com/versatiledatakit/data-job-base-python-3.7-latest', 'vdkImage': 'registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.7'}, '3.8':{'baseImage': 'registry.hub.docker.com/versatiledatakit/data-job-base-python-3.8-latest', 'vdkImage': 'registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.8'}, '3.9':{'baseImage': 'registry.hub.docker.com/versatiledatakit/data-job-base-python-3.9-latest', 'vdkImage': 'registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.9'}}
+```
+**NOTE:** As the supported versions would be dynamically set, the application properties files would need to be manually updated accordingly.
+
+#### Option 3: Configure the supported python versions in the vdk_options.ini file.
+This file is already being loaded in the Control Service through the [helm chart](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/helm_charts/pipelines-control-service/templates/deployment.yaml#L171) and the [application properties](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/pipelines_control_service/src/main/resources/application-prod.properties#L62) files, and is used to store some general configurations like resource limits, etc. In the options file, the supported versions would be stored like:
+```ini
+[supported_python_versions_base_images]
+PYTHON_3.7 = "registry.hub.docker.com/versatiledatakit/data-job-base-python-3.7-latest"
+PYTHON_3.8 = "registry.hub.docker.com/versatiledatakit/data-job-base-python-3.8-latest"
+PYTHON_3.9 = "registry.hub.docker.com/versatiledatakit/data-job-base-python-3.9-latest"
+
+[supported_python_versions_vdk_images]
+PYTHON_3.7 = "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.7"
+PYTHON_3.8 = "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.8"
+PYTHON_3.9 = "registry.hub.docker.com/versatiledatakit/quickstart-vdk:release-3.9"
+```
+These could then be read by the [VdkOptionsReader](https://github.com/vmware/versatile-data-kit/blob/main/projects/control-service/projects/pipelines_control_service/src/main/java/com/vmware/taurus/service/deploy/VdkOptionsReader.java#L29) and propagated the where necessary.
+In options 2 and 3, custom logic would need to be implemented to remove the prefixes of the different python versions and keep only the version numbers (3.7, 3.8, 3.9) to compare them to the python_version passed by users.
+
+#### Chosen solution
+At the end, **Option 2** was selected, as it would be relatively straightforward to implement in the helm chart of the Control Service, which
+would make it easily configurable by administrators.
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_design.png b/specs/vep-1739-multiple-python-versions/diagrams/high_level_design.png
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_design_plantUML_source.txt b/specs/vep-1739-multiple-python-versions/diagrams/high_level_design_plantUML_source.txt
@@ -0,0 +1,27 @@
+@startuml
+!include <awslib/AWSSimplified>
+!include <awslib/General/User>
+!include <awslib/Containers/ElasticContainerRegistry>
+!include <cloudinsight/file>
+
+User(engineer, "User", " ")
+User(admin, "Administrator", " ")
+ElasticContainerRegistry(ecr, "Image\n<b>Registry", " ")
+
+
+rectangle "K8s Cluster" {
+   component "Control Service" as cs
+   rectangle "          Data Jobs\nBuilders/Deployments\n        Namespace" as djn
+   cs - djn
+}
+
+rectangle "<$file>\nData Job" as data_job
+
+
+engineer -- data_job : Set <b>python_version</b> in config.ini
+data_job -- cs : Deploy data job
+admin -- cs : Configure supported\npython versions
+
+cs -- ecr : Pull data job base and \nvdk images based on \nprovided <b>python_version</b>
+
+@enduml
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence.png b/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence.png
diff --git a/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence_plantUML_source.txt b/specs/vep-1739-multiple-python-versions/diagrams/high_level_sequence_plantUML_source.txt
@@ -0,0 +1,21 @@
+@startuml
+actor User as engineer
+participant ControlService as cs
+participant K8sNamespace_DataJobBuilders as namespace
+database Database as db
+database Registry as ecr
+
+engineer -> cs : vdk create
+cs -> db : Register new data job
+engineer -> engineer : Set <b>python_version</b> in the data job's config.ini
+engineer -> cs : vdk deploy
+cs -> cs : Read Deployment data
+cs -> db : Update data job configuration data
+
+cs -> ecr : Request data job base image
+activate ecr
+cs -> ecr : Request vdk image
+return Return images
+
+cs -> namespace : Start data job builder
+@enduml