Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VEP-554: Apache Airflow Integration #748

Merged
merged 2 commits into from
Mar 7, 2022

Conversation

mivanov1988
Copy link
Collaborator

@mivanov1988 mivanov1988 commented Mar 1, 2022

This VEP outlines the architectural changes required to provide VDK users with the
ability to do multi-jobs non-linear analytics. The support for the definition of dependencies
between units of work (tasks or jobs) will be introduced by integrating VDK with Apache
Airflow.

Signed-off-by: Miroslav Ivanov [email protected]

Copy link
Collaborator

@antoniivanov antoniivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. Let's make sure we finalize the Motivation and goals section as part of this PR . I wrote comments for other sections but the can be addressed in subsequent PR.

Idea is to motivation to answer why are we doing that while goals are what are we doing.

It's very similar to KEP. Those are good examples:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1898-hardened-exec#motivation or https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1539-hugepages#motivation

I also tried to make an example with https://github.com/vmware/versatile-data-kit/pull/730/files

<!--
* Capacity Estimation and Constraints
* Cost of data path: CPU cost per-IO, memory footprint, network footprint.
* Cost of control plane including cost of APIs, expected timeliness from layers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the operators are causing more load on the Control Service we need to answer the capacity and availability topics here.

What would be the cost of the Control Plane (API requests) , if we have N jobs and K workflows running in some average schedule S.

Comment on lines 311 to 330
) as dag:
# [START sync_data_job]
sync_data_job = VDKOperator(
job_name='sync_data_job',
)
# [END sync_data_job]

# [START async_data_job]
start_async_data_job = VDKOperator(
job_name='async_data_job',
asynchronous=True,
)

async_data_job = VDKSensorOperator(
job_name='async_data_job',
job_execution_id=start_async_data_job.output,
)
# [END async_data_job]

sync_data_job >> async_data_job
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagined it a bit simpler.

Why can't we just do sync_data_job >> start_async_data_job ?

* Configuration changes.
* Upgrade / Downgrade Strategy (especially if it might be breaking change).
* Troubleshooting
* What are possible failure modes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely deserves some answer. It's completely fine if it's for another PR.

Now that users would schedule jobs using airflow.What happens if Job x fails. What if it fails with user error or platform error. How are restarts handled (by airflow or by vdk runtime).

@mivanov1988 mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch 2 times, most recently from 158dbbb to 3b7f925 Compare March 6, 2022 16:37
@mivanov1988 mivanov1988 changed the title [DRAFT] VEP-1: VDK Apache Airflow Provider [DRAFT] VEP-554: Apache Airflow Integration Mar 6, 2022
@mivanov1988 mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch 2 times, most recently from cd29a10 to 503d91d Compare March 6, 2022 16:43
@mivanov1988 mivanov1988 changed the title [DRAFT] VEP-554: Apache Airflow Integration VEP-554: Apache Airflow Integration Mar 6, 2022
@mivanov1988 mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch 2 times, most recently from c4c3e65 to 7418ffd Compare March 7, 2022 12:58
the Airflow Worker.

* Availability.
* The availability of the VDK Provider will be managed by Airflow, since it is going
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of availability our provider depends on Airflow but also on Control Service and Authorization server (per high-level-diagram).

If airflow is down, it will not be available .

If Control Service API or Auth Server is not working - would there be any re-tries? how it would handle networking issues (they tend to me intermittent) . What if we have high latency of the requests or bandwidth is low?

Copy link
Collaborator

@antoniivanov antoniivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far. If you want you can merge it and open new PR for the remaining parts.

@mivanov1988
Copy link
Collaborator Author

Looks good so far. If you want you can merge it and open new PR for the remaining parts.

Thank you for the review, Toni!

This VEP outlines the architectural changes required to provide VDK users with the
ability to do multi-jobs non-linear analytics. The support for definition of dependencies
between units of work (tasks or jobs) will be introduced by integrating VDK with Apache
Airflow.

Signed-off-by: Miroslav Ivanov [email protected]
@mivanov1988 mivanov1988 force-pushed the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch from 7418ffd to dc438d2 Compare March 7, 2022 15:42
@mivanov1988 mivanov1988 enabled auto-merge (squash) March 7, 2022 15:43
@mivanov1988 mivanov1988 merged commit c06e89b into main Mar 7, 2022
@mivanov1988 mivanov1988 deleted the topic/miroslavi/vep-1-vdk-apache-airflow-provider branch March 7, 2022 15:50
mivanov1988 added a commit that referenced this pull request Mar 9, 2022
This PR aims to adress the comments from
#748.

Signed-off-by: Miroslav Ivanov [email protected]
antoniivanov pushed a commit that referenced this pull request Mar 16, 2022
This VEP outlines the architectural changes required to provide VDK users with the
ability to do multi-jobs non-linear analytics. The support for definition of dependencies
between units of work (tasks or jobs) will be introduced by integrating VDK with Apache
Airflow.

Signed-off-by: Miroslav Ivanov [email protected]
mivanov1988 added a commit that referenced this pull request Mar 22, 2022
This PR aims to adress the comments from
#748.

Signed-off-by: Miroslav Ivanov [email protected]
mivanov1988 added a commit that referenced this pull request Mar 22, 2022
This PR aims to adress the comments from
#748.

Signed-off-by: Miroslav Ivanov [email protected]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants