New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

VEP-554: Apache Airflow Integration #748

Merged

mivanov1988 merged 2 commits into main from topic/miroslavi/vep-1-vdk-apache-airflow-provider

Mar 7, 2022

Collaborator

mivanov1988 commented Mar 1, 2022 •

edited

Loading

This VEP outlines the architectural changes required to provide VDK users with the
ability to do multi-jobs non-linear analytics. The support for the definition of dependencies
between units of work (tasks or jobs) will be introduced by integrating VDK with Apache
Airflow.

Signed-off-by: Miroslav Ivanov [email protected]

vmwclabot added the cla-not-required label

antoniivanov requested changes

View reviewed changes

Collaborator

antoniivanov left a comment

Looks good so far. Let's make sure we finalize the Motivation and goals section as part of this PR . I wrote comments for other sections but the can be addressed in subsequent PR.

Idea is to motivation to answer why are we doing that while goals are what are we doing.

It's very similar to KEP. Those are good examples:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1898-hardened-exec#motivation or https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1539-hugepages#motivation

I also tried to make an example with https://github.com/vmware/versatile-data-kit/pull/730/files

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated Show resolved Hide resolved

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated Show resolved Hide resolved

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated Show resolved Hide resolved

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated Show resolved Hide resolved

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated

+              <!--
+              * Capacity Estimation and Constraints
+                  * Cost of data path: CPU cost per-IO, memory footprint, network footprint.
+                  * Cost of control plane including cost of APIs, expected timeliness from layers

Collaborator

antoniivanov Mar 1, 2022

As the operators are causing more load on the Control Service we need to answer the capacity and availability topics here.

What would be the cost of the Control Plane (API requests) , if we have N jobs and K workflows running in some average schedule S.

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated Show resolved Hide resolved

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated

Comment on lines 311 to 330

+              ) as dag:
+                 # [START sync_data_job]
+                 sync_data_job = VDKOperator(
+                    job_name='sync_data_job',
+                 )
+                 # [END sync_data_job]
+                 # [START async_data_job]
+                 start_async_data_job = VDKOperator(
+                    job_name='async_data_job',
+                    asynchronous=True,
+                 )
+                 async_data_job = VDKSensorOperator(
+                    job_name='async_data_job',
+                    job_execution_id=start_async_data_job.output,
+                 )
+                 # [END async_data_job]
+                 sync_data_job >> async_data_job

Collaborator

antoniivanov Mar 1, 2022

I imagined it a bit simpler.

Why can't we just do sync_data_job >> start_async_data_job ?

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated Show resolved Hide resolved

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated Show resolved Hide resolved

specs/vep-1-vdk-apache-airflow-provider/README.MD Outdated

+              * Configuration changes.
+              * Upgrade / Downgrade Strategy (especially if it might be breaking change).
+              * Troubleshooting
+                  * What are possible failure modes.

Collaborator

antoniivanov Mar 1, 2022

This definitely deserves some answer. It's completely fine if it's for another PR.

Now that users would schedule jobs using airflow.What happens if Job x fails. What if it fails with user error or platform error. How are restarts handled (by airflow or by vdk runtime).

mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch 2 times, most recently from 158dbbb to 3b7f925 Compare

March 6, 2022 16:37

mivanov1988 changed the title ~~[DRAFT] VEP-1: VDK Apache Airflow Provider~~ [DRAFT] VEP-554: Apache Airflow Integration

mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch 2 times, most recently from cd29a10 to 503d91d Compare

March 6, 2022 16:43

mivanov1988 changed the title ~~[DRAFT] VEP-554: Apache Airflow Integration~~ VEP-554: Apache Airflow Integration

mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch 2 times, most recently from c4c3e65 to 7418ffd Compare

March 7, 2022 12:58

antoniivanov reviewed

View reviewed changes

specs/vep-554-apache-airflow-integration/README.MD Outdated Show resolved Hide resolved

antoniivanov reviewed

View reviewed changes

specs/vep-554-apache-airflow-integration/README.MD Show resolved Hide resolved

antoniivanov reviewed

View reviewed changes

specs/vep-554-apache-airflow-integration/README.MD Outdated Show resolved Hide resolved

antoniivanov reviewed

View reviewed changes

specs/vep-554-apache-airflow-integration/README.MD Show resolved Hide resolved

antoniivanov reviewed

View reviewed changes

specs/vep-554-apache-airflow-integration/README.MD

+              the Airflow Worker.
+              * Availability.
+                  * The availability of the VDK Provider will be managed by Airflow, since it is going

Collaborator

antoniivanov Mar 7, 2022

In terms of availability our provider depends on Airflow but also on Control Service and Authorization server (per high-level-diagram).

If airflow is down, it will not be available .

If Control Service API or Auth Server is not working - would there be any re-tries? how it would handle networking issues (they tend to me intermittent) . What if we have high latency of the requests or bandwidth is low?

antoniivanov reviewed

View reviewed changes

specs/vep-554-apache-airflow-integration/README.MD Outdated Show resolved Hide resolved

antoniivanov approved these changes

View reviewed changes

Collaborator

antoniivanov left a comment

Looks good so far. If you want you can merge it and open new PR for the remaining parts.

Collaborator Author

mivanov1988 commented Mar 7, 2022

Looks good so far. If you want you can merge it and open new PR for the remaining parts.

Thank you for the review, Toni!

antoniivanov mentioned this pull request

vdk: add VDK enhancement proposal (VEP) spec template #727

Merged


          VEP-554: Apache Airflow Integration

dc438d2

This VEP outlines the architectural changes required to provide VDK users with the
ability to do multi-jobs non-linear analytics. The support for definition of dependencies
between units of work (tasks or jobs) will be introduced by integrating VDK with Apache
Airflow.

Signed-off-by: Miroslav Ivanov [email protected]

mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch from 7418ffd to dc438d2 Compare

March 7, 2022 15:42


          Merge branch 'main' into topic/miroslavi/vep-1-vdk-apache-airflow-pro…

267b981

…vider

mivanov1988 enabled auto-merge (squash)

March 7, 2022 15:43

mivanov1988 merged commit c06e89b into main

mivanov1988 deleted the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch

March 7, 2022 15:50

mivanov1988 added a commit that referenced this pull request


          VEP-554: address comments

622ce70

This PR aims to adress the comments from
#748.

Signed-off-by: Miroslav Ivanov [email protected]

mivanov1988 mentioned this pull request

VEP-554: address comments #763

Merged

antoniivanov pushed a commit that referenced this pull request


          VEP-554: Apache Airflow Integration (#748)

cb2f1a4

This VEP outlines the architectural changes required to provide VDK users with the
ability to do multi-jobs non-linear analytics. The support for definition of dependencies
between units of work (tasks or jobs) will be introduced by integrating VDK with Apache
Airflow.

Signed-off-by: Miroslav Ivanov [email protected]

mivanov1988 added a commit that referenced this pull request


          VEP-554: address comments

b26829e

This PR aims to adress the comments from
#748.

Signed-off-by: Miroslav Ivanov [email protected]

mivanov1988 added a commit that referenced this pull request


          VEP-554: address comments (#763)

931e514

This PR aims to adress the comments from
#748.

Signed-off-by: Miroslav Ivanov [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-not-required