Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DatabricksWorkflowTaskGroup #39771

Merged
merged 10 commits into from
May 30, 2024

Conversation

pankajkoti
Copy link
Member

@pankajkoti pankajkoti commented May 23, 2024

This pull request introduces the DatabricksWorkflowTaskGroup
to the Airflow Databricks provider from the astro-provider-databricks
repository.
It marks another pull request aimed at contributing
operators and features from that repository into the Airflow
Databricks provider, the previous PR being #39178.

The task group launches a Databricks Workflow
and runs the notebook jobs from within it, resulting in a
75% cost reduction ($0.40/DBU for all-purpose compute,
$0.07/DBU for Jobs compute) when compared to executing
DatabricksNotebookOperator outside of DatabricksWorkflowTaskGroup.

There are a few advantages to defining your Databricks Workflows in Airflow:

via Databricks via Airflow
Authoring interface Web-based via Databricks UI Code via Airflow DAG
Workflow compute pricing
Notebook code in source control
Workflow structure in source control
Retry from beginning
Retry single task
Task groups within Workflows
Trigger workflows from other DAGs
Workflow-level parameters

Below screenshots depict successful Airflow DAG runs and corresponding
successful Databricks job run

Airflow DAG view
Screenshot 2024-05-29 at 10 14 19 PM

Datarbricks job Graph view

workflow_run_databricks_graph_view

Co-authored by: @dimberman @tatiana in the original repo.

Co-authored-by: Daniel Imberman [email protected]
Co-authored-by: Tatiana Al-Chueyr [email protected]


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

Copy link
Contributor

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's great to see how far you've advanced on this, @pankajkoti ! You're reimplementing the original feature in a better way. Thank you!

@pankajkoti pankajkoti force-pushed the databricks-workflow-taskgroup branch from 92b551f to 1e911c6 Compare May 28, 2024 06:33
@pankajkoti pankajkoti force-pushed the databricks-workflow-taskgroup branch from f12d091 to cc8d0a8 Compare May 28, 2024 12:59
@pankajkoti pankajkoti force-pushed the databricks-workflow-taskgroup branch from 11c0b93 to ffc4f25 Compare May 28, 2024 13:21
@pankajkoti pankajkoti force-pushed the databricks-workflow-taskgroup branch from a328ad8 to a9125aa Compare May 28, 2024 17:36
@pankajkoti pankajkoti marked this pull request as ready for review May 28, 2024 21:32
@pankajkoti pankajkoti requested review from tatiana, eladkal, Lee-W, sunank200, rawwar and phanikumv and removed request for tatiana May 28, 2024 21:32
Copy link
Contributor

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajkoti I left some minor comments, but it looks great. It's impressive how you were able to keep the original interfaces, while being consistent with the Airflow Databricks provider. Thank you!

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Left some minor suggestions, most of them are just annotation improvements

airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved
airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved
airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved
airflow/providers/databricks/operators/databricks.py Outdated Show resolved Hide resolved
Co-authored-by: Wei Lee <[email protected]>
Co-authored-by: Tatiana Al-Chueyr <[email protected]>
@pankajkoti pankajkoti requested review from Lee-W, tatiana and phanikumv May 29, 2024 16:53
@pankajkoti
Copy link
Member Author

@tatiana @Lee-W @phanikumv I have addressed all the review comments so far. Would appreciate another review please.

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pankajkoti ! overall it looks great! left one suggestion but we probably can do it in the next PR

Copy link
Contributor

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pankajkoti , thanks for addressing all the feedback, you made it better than the original implementation!

Some minor feedback that can be addressed in follow-up PRs:

  1. When using DatabricksNotebookOperator from DatabricksWorkflowTaskGroup, the Databricks Workflow Job Tasks should not be prefixed with the DAG & TaskGroup. Using the example shown in the PR description:
    • The task (notebook identifier) is named workflow_notebook_1 in the Airflow Graph view
    • The task is currently named example_Databricks_workflow_test_workflow_root_1234_run_workflow_notebook_1 in the Databricks Graph view - unnecessarily since tasks are not displayed as jobs in the Databricks job list. The current name in Databricks makes it hard to understand how different notebooks relate to each other in the Databricks UI.
  2. To have a clear bullet list or table showing how the concepts map between Airflow & Databricks:
    • Airflow DAG - Databricks Workflow Job name prefix
    • Airflow TaskGroup - Databricks Workflow Job
    • Airflow Notebook task - Databricks Workflow Job Task
  3. Move a log line outside of the loop since it does not give meaningful data to be printed for each iteration of the for loop: Add DatabricksWorkflowTaskGroup #39771 (comment)
  4. Switch Using token auth log from INFO to DEBUG (unrelated to the current PR, but it improves the log messages)
  5. Refactor other parts of the Databricks provider to use the just introduced Enum RunLifeCycleState

@pankajkoti pankajkoti merged commit 2ecf7fa into apache:main May 30, 2024
100 checks passed
@pankajkoti pankajkoti deleted the databricks-workflow-taskgroup branch May 30, 2024 09:19
fdemiane pushed a commit to fdemiane/airflow that referenced this pull request Jun 6, 2024
This pull request introduces the [DatabricksWorkflowTaskGroup](https://github.com/astronomer/astro-provider-databricks/blob/main/src/astro_databricks/operators/workflow.py#L226)
to the Airflow Databricks provider from the [astro-provider-databricks](https://github.com/astronomer/astro-provider-databricks/tree/main)
repository. 
It marks another pull request aimed at contributing 
operators and features from that repository into the Airflow 
Databricks provider, the previous PR being apache#39178.

The task group launches a [Databricks Workflow](https://docs.databricks.com/en/workflows/index.html) 
and runs the notebook jobs from within it, resulting in a 
[75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, 
$0.07/DBU for Jobs compute) when compared to executing 
``DatabricksNotebookOperator`` outside of ``DatabricksWorkflowTaskGroup``.

---------
Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Tatiana Al-Chueyr <[email protected]>
Co-authored-by: Wei Lee <[email protected]>
romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024
This pull request introduces the [DatabricksWorkflowTaskGroup](https://github.com/astronomer/astro-provider-databricks/blob/main/src/astro_databricks/operators/workflow.py#L226)
to the Airflow Databricks provider from the [astro-provider-databricks](https://github.com/astronomer/astro-provider-databricks/tree/main)
repository. 
It marks another pull request aimed at contributing 
operators and features from that repository into the Airflow 
Databricks provider, the previous PR being apache#39178.

The task group launches a [Databricks Workflow](https://docs.databricks.com/en/workflows/index.html) 
and runs the notebook jobs from within it, resulting in a 
[75% cost reduction](https://www.databricks.com/product/pricing) ($0.40/DBU for all-purpose compute, 
$0.07/DBU for Jobs compute) when compared to executing 
``DatabricksNotebookOperator`` outside of ``DatabricksWorkflowTaskGroup``.

---------
Co-authored-by: Daniel Imberman <[email protected]>
Co-authored-by: Tatiana Al-Chueyr <[email protected]>
Co-authored-by: Wei Lee <[email protected]>
pankajkoti added a commit to astronomer/astro-provider-databricks that referenced this pull request Aug 8, 2024
As part of Astronomer's internal plans and decisions, we've decided to contribute the existing functionality provided by the operators and plugins in this repository to the official Apache Airflow Databricks provider. To achieve this, we submitted the following PRs to the Airflow provider:

1. apache/airflow#39178
2. apache/airflow#39771
3. apache/airflow#40013
4. apache/airflow#40724
5. apache/airflow#39295

All functionality has now been contributed to the Airflow Databricks provider, and ongoing support will be maintained there. As a result, we're deprecating the operators and plugins in this repository. Users are encouraged to transition to the official Apache Airflow Databricks provider as soon as possible. The migration process is straightforward—simply update the import path to point to the Airflow provider and ensure that you install `apache-airflow-providers-databricks>=6.8.0`, which includes all the contributions mentioned above.

closes: astronomer/issues-airflow#715
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants