vdk-meta-jobs: Undelete the package for potential future fixes (#2106)

This PR brings back the vdk-meta-jobs plugin to the repo after it was deleted once the plugin was renamed to vdk-dag. Additionally, this PR includes a necessary fix stemming from a change in the vdk-control-service-api. Testing done: pipelines Signed-off-by: Gabriel Georgiev <[email protected]>
vmware · May 25, 2023 · 6a4c358 · 6a4c358
1 parent 3bca09a
commit 6a4c358
Show file tree

Hide file tree

Showing 45 changed files with 2,446 additions and 0 deletions.
diff --git a/projects/vdk-plugins/vdk-meta-jobs/.plugin-ci.yml b/projects/vdk-plugins/vdk-meta-jobs/.plugin-ci.yml
@@ -0,0 +1,22 @@
+# Copyright 2021-2023 VMware, Inc.
+# SPDX-License-Identifier: Apache-2.0
+
+image: "python:3.7"
+
+.build-vdk-meta-jobs:
+  variables:
+    PLUGIN_NAME: vdk-meta-jobs
+  extends: .build-plugin
+
+build-py37-vdk-meta-jobs:
+  extends: .build-vdk-meta-jobs
+  image: "python:3.7"
+
+build-py311-vdk-meta-jobs:
+  extends: .build-vdk-meta-jobs
+  image: "python:3.11"
+
+release-vdk-meta-jobs:
+  variables:
+    PLUGIN_NAME: vdk-meta-jobs
+  extends: .release-plugin
diff --git a/projects/vdk-plugins/vdk-meta-jobs/README.md b/projects/vdk-plugins/vdk-meta-jobs/README.md
@@ -0,0 +1,197 @@
+# Meta Jobs
+
+Express dependencies between data jobs.
+
+A plugin for Versatile Data Kit extends its Job API with an additional feature that allows users to trigger so-called Meta Jobs.
+
+A meta job is a regular Data Job that invokes other Data Jobs using Control Service Execution API.
+In this way, there's nothing different from other data jobs except for its purpose. See [Data Job types](https://github.com/vmware/versatile-data-kit/wiki/User-Guide#data-job-types) for more info.
+
+It's meant to be a much more lightweight alternative to complex and comprehensive workflows solution (like Airflow)
+as it doesn't require to provision any new infrastructure or to need to learn new tool.
+You install a new python library (this plugin itself) and you are ready to go.
+
+Using this plugin, you can specify dependencies between data jobs as a direct acyclic graph (DAG).
+See [Usage](#usage) for more information.
+
+## Usage
+
+```
+pip install vdk-meta-jobs
+```
+
+Then one would create a single [step](https://github.com/vmware/versatile-data-kit/wiki/dictionary#data-job-step) and
+define the jobs we want to orchestrate:
+
+```python
+def run(job_input):
+    jobs = [
+        {
+        "job_name": "name-of-job",
+        "team_name": "team-of-job",
+        "fail_meta_job_on_error": True or False,
+        "arguments": {"key1": value1, "key2": value2},
+        "depends_on": [name-of-job1, name-of-job2]
+        },
+        ...
+    ]
+    MetaJobInput().run_meta_job(jobs)
+```
+
+When defining a job to be run following attributes are supported:
+* **job_name**: required, the name of the data job.
+* **team_name:**: optional, the team of the data job. If omitted , it will use the meta job's team.
+* **fail_meta_job_on_error**: optional, default is true. If true, the meta job will abort and fail if the orchestrated job fails, if false, meta job won't fail and continue.
+* **arguments**: optional, the arguments that are passed to the underlying orchestrated data job.
+* **depends_on**: required (can be empty), list of other jobs that the orchestrated job depends on. The job will not be started until depends_on job have finished.
+
+
+### Example
+
+The following example dependency graph can be implemented with below code.
+
+
+![img_2.png](img_2.png)
+
+In this example what happens is:
+* Job 1 will execute.
+* After Job 1 is completed, jobs 2,3,4 will start executing in parallel.
+* Jobs 5 and 6 will start executing after job 3 completes, but will not wait for the completion of jobs 2 and 4.
+
+
+```python
+
+from vdk.api.job_input import IJobInput
+from vdk.plugin.meta_jobs.meta_job_runner import MetaJobInput
+
+JOBS_RUN_ORDER = [
+    {
+        "job_name": "job1",
+        "team_name": "team-awesome",
+        "fail_meta_job_on_error": True,
+        "arguments": {},
+        "depends_on": []
+    },
+
+    {
+        "job_name": "job2",
+        "team_name": "team-awesome",
+        "fail_meta_job_on_error": True,
+        "arguments": {},
+        "depends_on": ["job1"]
+    },
+    {
+        "job_name": "job3",
+        "team_name": "team-awesome",
+        "fail_meta_job_on_error": True,
+        "arguments": {},
+        "depends_on": ["job1"]
+    },
+    {
+        "job_name": "job4",
+        "team_name": "team-awesome",
+        "fail_meta_job_on_error": True,
+        "arguments": {},
+        "depends_on": ["job1"]
+    },
+
+    {
+        "job_name": "job5",
+        "team_name": "team-awesome",
+        "fail_meta_job_on_error": True,
+        "arguments": {},
+        "depends_on": ["job3"]
+    },
+    {
+        "job_name": "job6",
+        "team_name": "team-awesome",
+        "fail_meta_job_on_error": True,
+        "arguments": {},
+        "depends_on": ["job3"]
+    },
+]
+
+
+def run(job_input: IJobInput) - > None:
+    MetaJobInput().run_meta_job(JOBS_RUN_ORDER)
+```
+
+
+### Runtime sequencing
+
+The depends_on key stores the dependencies of each job - the jobs that have to finish before it starts.
+The DAG execution starts from the jobs with empty dependency lists - they start together in parallel.
+But what happens if they are too many? It could cause server overload. In order to avoid such unfortunate situations,
+a limit in the number of concurrent running jobs is set. This limit is
+a [configuration variable](https://github.com/vmware/versatile-data-kit/blob/main/projects/vdk-plugins/vdk-meta-jobs/src/vdk/plugin/meta_jobs/meta_configuration.py#L87)
+that you are able to set according to your needs. When the limit is exceeded, the execution of the rest of the jobs
+is not cancelled but delayed until a spot is freed by one of the running jobs. What's important here is that
+although there are delayed jobs due to the limitation, the overall sequence is not broken.
+
+
+### Data Job start comparison
+
+There are 3 types of jobs right now in terms of how are they started.
+
+* Started by Schedule
+   * When the time comes for a scheduled execution of a Data Job, if the one is currently running, it will be waited
+     to finish by retrying a few times. If it is still running then, this scheduled execution will be skipped.
+* Started by the user using UI or CLI
+   * If a user tries to start a job that is already running, one would get an appropriate error immediately and a
+     recommendation to try again later.
+* **Started by a DAG Job**
+   * If a DAG job tries to start a job and there is already running such job, the approach of the DAG job would be
+     similar to the schedule - retry later but more times.
+
+### FAQ
+
+
+**Q: Will the metajob retry on Platform Error?**<br>
+A: Yes, as any other job, up to N (configurable by the Control Service) attempts for each job it is orchestrating.
+   See Control Service documentation for more information
+
+**Q: If an orchestrated job fails, will the meta job fail?**<br>
+Only if fail_meta_job_on_error flag is set to True (which is teh default setting if omited)
+
+The meta job then will fail with USER error (regardless of how the orchestrated job failed)
+
+
+**Q: Am I able to run the metajob locally?**<br>
+A: Yes, but the jobs orchestrated must be deployed to the cloud (by the Control Service).
+
+**Q: Is there memory limit of the meta job?**<br>
+A: The normal per job limits apply for any jobs orchestrated/started by the meta job.
+
+**Q: Is there execution time limit of the meta job?**<br>
+A: Yes, the meta job must finish within the same limit as any normal data job.
+The total time of all data jobs started by the meta job must be less than the limit specified.
+The overall limit is controlled by Control Service administrators
+
+**Q: Is the metajob going to fail and not trigger the remaining jobs if any of the jobs it is orchestrating fails?**<br>
+A: This is configurable by the end user in the parameter fail_meta_job_on_error
+
+**Q: Can I schedule one job to run every hour and use it in the meta job at the same time?**<br>
+A: Yes, if the job is already running, the metajob will wait for the concurrent run to finish and then trigger the job again from the meta job,
+If the job is already running as part of the meta job, the concurrent scheduled run will be skipped
+
+
+### Build and testing
+
+```
+pip install -r requirements.txt
+pip install -e .
+pytest
+```
+
+In VDK repo [../build-plugin.sh](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins/build-plugin.sh) script can be used also.
+
+
+#### Note about the CICD:
+
+.plugin-ci.yaml is needed only for plugins part of [Versatile Data Kit Plugin repo](https://github.com/vmware/versatile-data-kit/tree/main/projects/vdk-plugins).
+
+The CI/CD is separated in two stages, a build stage and a release stage.
+The build stage is made up of a few jobs, all which inherit from the same
+job configuration and only differ in the Python version they use.
+They run according to rules, which are ordered in a way such that changes to a
+plugin's directory trigger the plugin CI, but changes to a different plugin does not.
diff --git a/projects/vdk-plugins/vdk-meta-jobs/img_2.png b/projects/vdk-plugins/vdk-meta-jobs/img_2.png
diff --git a/projects/vdk-plugins/vdk-meta-jobs/requirements.txt b/projects/vdk-plugins/vdk-meta-jobs/requirements.txt
@@ -0,0 +1,12 @@
+# this file is used to provide testing requirements
+# for requirements (dependencies) needed during installation update setup.py install_requires section
+
+
+pytest
+pytest-httpserver
+urllib3
+vdk-control-api-auth
+vdk-control-service-api
+vdk-core
+vdk-plugin-control-cli
+vdk-test-utils
diff --git a/projects/vdk-plugins/vdk-meta-jobs/setup.py b/projects/vdk-plugins/vdk-meta-jobs/setup.py
@@ -0,0 +1,42 @@
+# Copyright 2021-2023 VMware, Inc.
+# SPDX-License-Identifier: Apache-2.0
+import pathlib
+
+import setuptools
+
+"""
+Builds a package with the help of setuptools in order for this package to be imported in other projects
+"""
+
+__version__ = "0.1.0"
+
+setuptools.setup(
+    name="vdk-meta-jobs",
+    version=__version__,
+    url="https://github.com/vmware/versatile-data-kit",
+    description="Express dependecies between data jobs.",
+    long_description=pathlib.Path("README.md").read_text(),
+    long_description_content_type="text/markdown",
+    install_requires=[
+        "vdk-core",
+        "graphlib-backport",
+        "vdk-control-api-auth",
+        "vdk-plugin-control-cli",
+        "vdk-control-service-api",
+        "urllib3",
+    ],
+    package_dir={"": "src"},
+    packages=setuptools.find_namespace_packages(where="src"),
+    # This is the only vdk plugin specifc part
+    # Define entry point called "vdk.plugin.run" with name of plugin and module to act as entry point.
+    entry_points={"vdk.plugin.run": ["meta-jobs = vdk.plugin.meta_jobs.dags_plugin"]},
+    classifiers=[
+        "Development Status :: 2 - Pre-Alpha",
+        "License :: OSI Approved :: Apache Software License",
+        "Programming Language :: Python :: 3.7",
+        "Programming Language :: Python :: 3.8",
+        "Programming Language :: Python :: 3.9",
+        "Programming Language :: Python :: 3.10",
+        "Programming Language :: Python :: 3.11",
+    ],
+)
diff --git a/projects/vdk-plugins/vdk-meta-jobs/src/vdk/plugin/meta_jobs/api/meta_job.py b/projects/vdk-plugins/vdk-meta-jobs/src/vdk/plugin/meta_jobs/api/meta_job.py
@@ -0,0 +1,53 @@
+# Copyright 2021-2023 VMware, Inc.
+# SPDX-License-Identifier: Apache-2.0
+import abc
+from abc import abstractmethod
+from dataclasses import dataclass
+from dataclasses import field
+from typing import List
+
+
+@dataclass
+class SingleJob:
+    """
+    This class represents a single job to be executed.
+
+    :param job_name: the name of the job.
+    :param team_name: the name of the team that owns the job.
+    :param fail_dag_on_error: boolean flag indicating whether the job should be executed.
+    :param arguments: JSON-serializable dictionary of arguments to be passed to the job.
+    :param depends_on: list of names of jobs that this job depends on.
+    """
+
+    job_name: str
+    team_name: str = None
+    fail_meta_job_on_error: bool = True
+    arguments: dict = None
+    depends_on: List[str] = field(default_factory=list)
+
+
+@dataclass
+class MetaJob(SingleJob):
+    """
+    This class represents a DAG Job, which is a single job itself and consists of single jobs - the orchestrated ones.
+
+    :param jobs: list of the orchestrated jobs
+    """
+
+    jobs: List[SingleJob] = field(default_factory=list)
+
+
+class IMetaJobInput(abc.ABC):
+    """
+    This class is responsible for the DAG job run.
+    """
+
+    @abstractmethod
+    def run_meta_job(self, meta_job: MetaJob):
+        """
+        Runs the given DAG job.
+
+        :param dag: the DAG job to be run
+        :return:
+        """
+        pass