-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VEP-554: Apache Airflow Integration #748
VEP-554: Apache Airflow Integration #748
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. Let's make sure we finalize the Motivation and goals section as part of this PR . I wrote comments for other sections but the can be addressed in subsequent PR.
Idea is to motivation to answer why are we doing that while goals are what are we doing.
It's very similar to KEP. Those are good examples:
https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1898-hardened-exec#motivation or https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1539-hugepages#motivation
I also tried to make an example with https://github.com/vmware/versatile-data-kit/pull/730/files
<!-- | ||
* Capacity Estimation and Constraints | ||
* Cost of data path: CPU cost per-IO, memory footprint, network footprint. | ||
* Cost of control plane including cost of APIs, expected timeliness from layers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As the operators are causing more load on the Control Service we need to answer the capacity and availability topics here.
What would be the cost of the Control Plane (API requests) , if we have N jobs and K workflows running in some average schedule S.
) as dag: | ||
# [START sync_data_job] | ||
sync_data_job = VDKOperator( | ||
job_name='sync_data_job', | ||
) | ||
# [END sync_data_job] | ||
|
||
# [START async_data_job] | ||
start_async_data_job = VDKOperator( | ||
job_name='async_data_job', | ||
asynchronous=True, | ||
) | ||
|
||
async_data_job = VDKSensorOperator( | ||
job_name='async_data_job', | ||
job_execution_id=start_async_data_job.output, | ||
) | ||
# [END async_data_job] | ||
|
||
sync_data_job >> async_data_job |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagined it a bit simpler.
Why can't we just do sync_data_job >> start_async_data_job ?
* Configuration changes. | ||
* Upgrade / Downgrade Strategy (especially if it might be breaking change). | ||
* Troubleshooting | ||
* What are possible failure modes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely deserves some answer. It's completely fine if it's for another PR.
Now that users would schedule jobs using airflow.What happens if Job x fails. What if it fails with user error or platform error. How are restarts handled (by airflow or by vdk runtime).
158dbbb
to
3b7f925
Compare
cd29a10
to
503d91d
Compare
c4c3e65
to
7418ffd
Compare
the Airflow Worker. | ||
|
||
* Availability. | ||
* The availability of the VDK Provider will be managed by Airflow, since it is going |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of availability our provider depends on Airflow but also on Control Service and Authorization server (per high-level-diagram).
If airflow is down, it will not be available .
If Control Service API or Auth Server is not working - would there be any re-tries? how it would handle networking issues (they tend to me intermittent) . What if we have high latency of the requests or bandwidth is low?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good so far. If you want you can merge it and open new PR for the remaining parts.
Thank you for the review, Toni! |
This VEP outlines the architectural changes required to provide VDK users with the ability to do multi-jobs non-linear analytics. The support for definition of dependencies between units of work (tasks or jobs) will be introduced by integrating VDK with Apache Airflow. Signed-off-by: Miroslav Ivanov [email protected]
7418ffd
to
dc438d2
Compare
This PR aims to adress the comments from #748. Signed-off-by: Miroslav Ivanov [email protected]
This VEP outlines the architectural changes required to provide VDK users with the ability to do multi-jobs non-linear analytics. The support for definition of dependencies between units of work (tasks or jobs) will be introduced by integrating VDK with Apache Airflow. Signed-off-by: Miroslav Ivanov [email protected]
This PR aims to adress the comments from #748. Signed-off-by: Miroslav Ivanov [email protected]
This PR aims to adress the comments from #748. Signed-off-by: Miroslav Ivanov [email protected]
This VEP outlines the architectural changes required to provide VDK users with the
ability to do multi-jobs non-linear analytics. The support for the definition of dependencies
between units of work (tasks or jobs) will be introduced by integrating VDK with Apache
Airflow.
Signed-off-by: Miroslav Ivanov [email protected]