-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DatabricksWorkflowTaskGroup #39771
Add DatabricksWorkflowTaskGroup #39771
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's great to see how far you've advanced on this, @pankajkoti ! You're reimplementing the original feature in a better way. Thank you!
92b551f
to
1e911c6
Compare
f12d091
to
cc8d0a8
Compare
11c0b93
to
ffc4f25
Compare
a328ad8
to
a9125aa
Compare
docs/apache-airflow-providers-databricks/operators/workflow.rst
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pankajkoti I left some minor comments, but it looks great. It's impressive how you were able to keep the original interfaces, while being consistent with the Airflow Databricks provider. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Left some minor suggestions, most of them are just annotation improvements
tests/system/providers/databricks/example_databricks_workflow.py
Outdated
Show resolved
Hide resolved
tests/system/providers/databricks/example_databricks_workflow.py
Outdated
Show resolved
Hide resolved
tests/system/providers/databricks/example_databricks_workflow.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Wei Lee <[email protected]> Co-authored-by: Tatiana Al-Chueyr <[email protected]>
@tatiana @Lee-W @phanikumv I have addressed all the review comments so far. Would appreciate another review please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pankajkoti ! overall it looks great! left one suggestion but we probably can do it in the next PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @pankajkoti , thanks for addressing all the feedback, you made it better than the original implementation!
Some minor feedback that can be addressed in follow-up PRs:
- When using
DatabricksNotebookOperator
fromDatabricksWorkflowTaskGroup
, the Databricks Workflow Job Tasks should not be prefixed with the DAG & TaskGroup. Using the example shown in the PR description:- The task (notebook identifier) is named
workflow_notebook_1
in the Airflow Graph view - The task is currently named
example_Databricks_workflow_test_workflow_root_1234_run_workflow_notebook_1
in the Databricks Graph view - unnecessarily since tasks are not displayed as jobs in the Databricks job list. The current name in Databricks makes it hard to understand how different notebooks relate to each other in the Databricks UI.
- The task (notebook identifier) is named
- To have a clear bullet list or table showing how the concepts map between Airflow & Databricks:
- Airflow DAG - Databricks Workflow Job name prefix
- Airflow TaskGroup - Databricks Workflow Job
- Airflow Notebook task - Databricks Workflow Job Task
- Move a log line outside of the loop since it does not give meaningful data to be printed for each iteration of the for loop: Add DatabricksWorkflowTaskGroup #39771 (comment)
- Switch
Using token auth
log fromINFO
toDEBUG
(unrelated to the current PR, but it improves the log messages) - Refactor other parts of the Databricks provider to use the just introduced
Enum
RunLifeCycleState
This pull request introduces the [DatabricksWorkflowTaskGroup](https://github.com/astronomer/astro-provider-databricks/blob/main/src/astro_databricks/operators/workflow.py#L226) to the Airflow Databricks provider from the [astro-provider-databricks](https://github.com/astronomer/astro-provider-databricks/tree/main) repository. It marks another pull request aimed at contributing operators and features from that repository into the Airflow Databricks provider, the previous PR being apache#39178. The task group launches a [Databricks Workflow](https://docs.databricks.com/en/workflows/index.html) and runs the notebook jobs from within it, resulting in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute) when compared to executing ``DatabricksNotebookOperator`` outside of ``DatabricksWorkflowTaskGroup``. --------- Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Tatiana Al-Chueyr <[email protected]> Co-authored-by: Wei Lee <[email protected]>
This pull request introduces the [DatabricksWorkflowTaskGroup](https://github.com/astronomer/astro-provider-databricks/blob/main/src/astro_databricks/operators/workflow.py#L226) to the Airflow Databricks provider from the [astro-provider-databricks](https://github.com/astronomer/astro-provider-databricks/tree/main) repository. It marks another pull request aimed at contributing operators and features from that repository into the Airflow Databricks provider, the previous PR being apache#39178. The task group launches a [Databricks Workflow](https://docs.databricks.com/en/workflows/index.html) and runs the notebook jobs from within it, resulting in a [75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, $0.07/DBU for Jobs compute) when compared to executing ``DatabricksNotebookOperator`` outside of ``DatabricksWorkflowTaskGroup``. --------- Co-authored-by: Daniel Imberman <[email protected]> Co-authored-by: Tatiana Al-Chueyr <[email protected]> Co-authored-by: Wei Lee <[email protected]>
As part of Astronomer's internal plans and decisions, we've decided to contribute the existing functionality provided by the operators and plugins in this repository to the official Apache Airflow Databricks provider. To achieve this, we submitted the following PRs to the Airflow provider: 1. apache/airflow#39178 2. apache/airflow#39771 3. apache/airflow#40013 4. apache/airflow#40724 5. apache/airflow#39295 All functionality has now been contributed to the Airflow Databricks provider, and ongoing support will be maintained there. As a result, we're deprecating the operators and plugins in this repository. Users are encouraged to transition to the official Apache Airflow Databricks provider as soon as possible. The migration process is straightforward—simply update the import path to point to the Airflow provider and ensure that you install `apache-airflow-providers-databricks>=6.8.0`, which includes all the contributions mentioned above. closes: astronomer/issues-airflow#715
This pull request introduces the DatabricksWorkflowTaskGroup
to the Airflow Databricks provider from the astro-provider-databricks
repository.
It marks another pull request aimed at contributing
operators and features from that repository into the Airflow
Databricks provider, the previous PR being #39178.
The task group launches a Databricks Workflow
and runs the notebook jobs from within it, resulting in a
75% cost reduction ($0.40/DBU for all-purpose compute,
$0.07/DBU for Jobs compute) when compared to executing
DatabricksNotebookOperator
outside ofDatabricksWorkflowTaskGroup
.There are a few advantages to defining your Databricks Workflows in Airflow:
Below screenshots depict successful Airflow DAG runs and corresponding
successful Databricks job run
Airflow DAG view
![Screenshot 2024-05-29 at 10 14 19 PM](https://private-user-images.githubusercontent.com/10206082/334890693-e763cf18-0dd8-4635-92ca-8d63bc51e576.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NTkyNjcsIm5iZiI6MTczODk1ODk2NywicGF0aCI6Ii8xMDIwNjA4Mi8zMzQ4OTA2OTMtZTc2M2NmMTgtMGRkOC00NjM1LTkyY2EtOGQ2M2JjNTFlNTc2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA3VDIwMDkyN1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg4ZThhMjkwMTc5YWJkZmVjNmFjZTcyMWIyZWUzYTEzYjEyZTdkYWRlMjE3ODZlYWEyNWQxZTAyZTZjN2NjOGUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0._O-kj-So_kXCiVoapMkwD9aUMYi9lGdEuvWaBWcIT_k)
Datarbricks job Graph view
Co-authored by: @dimberman @tatiana in the original repo.
Co-authored-by: Daniel Imberman [email protected]
Co-authored-by: Tatiana Al-Chueyr [email protected]
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.