Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Install a global init script to collect HMS lineage for a few weeks. #513

Merged
merged 17 commits into from
Nov 2, 2023

Conversation

dipankarkush-db
Copy link
Contributor

Fixes #324

@codecov
Copy link

codecov bot commented Oct 26, 2023

Codecov Report

Merging #513 (e13414e) into main (f9b6117) will increase coverage by 0.13%.
The diff coverage is 89.58%.

@@            Coverage Diff             @@
##             main     #513      +/-   ##
==========================================
+ Coverage   80.98%   81.12%   +0.13%     
==========================================
  Files          31       33       +2     
  Lines        3392     3438      +46     
  Branches      658      667       +9     
==========================================
+ Hits         2747     2789      +42     
- Misses        491      493       +2     
- Partials      154      156       +2     
Files Coverage Δ
...x/hive_metastore/hms_lineage_global_init_script.py 100.00% <100.00%> (ø)
src/databricks/labs/ucx/runtime.py 46.34% <0.00%> (ø)
.../databricks/labs/ucx/hive_metastore/hms_lineage.py 92.30% <92.30%> (ø)
src/databricks/labs/ucx/install.py 81.85% <90.00%> (+0.29%) ⬆️

@dipankarkush-db dipankarkush-db marked this pull request as ready for review October 27, 2023 00:50
@dipankarkush-db dipankarkush-db requested a review from a team October 27, 2023 00:50
@dipankarkush-db
Copy link
Contributor Author


image

image

return base64.b64encode(_init_script_content.encode()).decode()

def _add_global_init_script(self):
self._ws.global_init_scripts.create(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we check for existing init scripts? Can we append to existing scripts? Is there a risk?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it's an easy add

Copy link
Contributor Author

@dipankarkush-db dipankarkush-db Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FastLee and @nfx - are you referring to adding the portion in an existing Global init scripts? If there are multiple global init script then which one to choose? Keeping it separate allows the user to disable it if needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep scripts separate, yes. But don't add lineage enabler if it's already there. Manual reproduction: run installer twice and verify only one script created by ucx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @nfx - This is what the code does -
if spark config present and the init script is enabled - skips creating a new one
If spark config present but the init script is disabled - asks if the user wants to enable it and enables/leave as it is based on user response
If spark config is not present then creates a new enabled global init script.

tests/unit/hive_metastore/test_hms_lineage.py Outdated Show resolved Hide resolved
src/databricks/labs/ucx/hive_metastore/hms_lineage.py Outdated Show resolved Hide resolved
src/databricks/labs/ucx/runtime.py Outdated Show resolved Hide resolved
@@ -187,6 +188,14 @@ def crawl_mounts(cfg: WorkspaceConfig):
mounts.inventorize_mounts()


@task("assessment", depends_on=[setup_schema])
def enable_hms_lineage(cfg: WorkspaceConfig):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be a better fit if the init script is rolled out in the installer. It'll be simpler to support. Ask it in the installation flow as a question

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in install.py

return base64.b64encode(_init_script_content.encode()).decode()

def _add_global_init_script(self):
self._ws.global_init_scripts.create(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it's an easy add

Copy link
Collaborator

@nfx nfx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't mark review comments as resolved unless you have really addressed them

return created_script

def add_spark_config_for_hms_lineage(self):
created_script = self._add_global_init_script()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a mandatory check for any global init script with "spark.databricks.dataLineage.enabled" string to skip the addition of this init script

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

self._add_sql_wh_config()
return created_script.script_id

def _add_sql_wh_config(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add empty methods, they confuse people while reading the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

@dipankarkush-db dipankarkush-db temporarily deployed to account-admin October 31, 2023 16:44 — with GitHub Actions Inactive
def check_lineage_spark_config_exists(self) -> GlobalInitScriptDetailsWithContent:
for script in self._ws.global_init_scripts.list():
gscript = self._ws.global_init_scripts.get(script_id=script.script_id)
if gscript:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Too much level of nesting, rewrite with "if gscript is None: continue"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took care of the same.

if "spark.databricks.dataLineage.enabled" in base64.b64decode(gscript.script).decode("utf-8"):
return gscript

def _get_init_script_content(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep the init script content in a file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FastLee - took care of this.

def _install_spark_config_for_hms_lineage(self):
if (
self._prompts
and self._question(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check first if the script exists, before asking the question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @FastLee - took care of this.

@nfx
Copy link
Collaborator

nfx commented Nov 2, 2023

merging PR, two test failures are unrelated:
#540
#541

@nfx nfx merged commit e13a399 into main Nov 2, 2023
@nfx nfx deleted the feature/add-hms-lineage branch November 2, 2023 14:44
nfx added a commit that referenced this pull request Nov 17, 2023
**Breaking changes** (existing installations need to reinstall UCX and re-run assessment jobs)

 * Switched local group migration component to rename groups instead of creating backup groups ([#450](#450)).
 * Mitigate permissions loss in Table ACLs by folding grants belonging to the same principal, object id and object type together ([#512](#512)).

**New features**

 * Added support for the experimental Databricks CLI launcher ([#517](#517)).
 * Added support for external Hive Metastores including AWS Glue ([#400](#400)).
 * Added more views to assessment dashboard ([#474](#474)).
 * Added rate limit for creating backup group to increase stability ([#500](#500)).
 * Added deduplication for mount point list ([#569](#569)).
 * Added documentation to describe interaction with external Hive Metastores ([#473](#473)).
 * Added failure injection for job failure message propagation  ([#591](#591)).
 * Added uniqueness in the new warehouse name to avoid conflicts on installation ([#542](#542)).
 * Added a global init script to collect Hive Metastore lineage ([#513](#513)).
 * Added retry set/update permissions when possible and assess the changes in the workspace ([#519](#519)).
 * Use `~/.ucx/state.json` to store the state of both dashboards and jobs ([#561](#561)).

**Bug fixes**

 * Fixed handling for `OWN` table permissions ([#571](#571)).
 * Fixed handling of keys with and without values. ([#514](#514)).
 * Fixed integration test failures related to concurrent group delete ([#584](#584)).
 * Fixed issue with workspace listing process on None type `object_type` ([#481](#481)).
 * Fixed missing group entitlement migration bug ([#583](#583)).
 * Fixed entitlement application for account-level groups ([#529](#529)).
 * Fixed assessment throwing an error when the owner of an object is empty ([#485](#485)).
 * Fixed installer to migrate between different configuration file versions ([#596](#596)).
 * Fixed cluster policy crawler to be aware of deleted policies ([#486](#486)).
 * Improved error message for not null constraints violated ([#532](#532)).
 * Improved integration test resiliency ([#597](#597), [#594](#594), [#586](#586)).
 * Introduced Safer access to workspace objects' properties. ([#530](#530)).
 * Mitigated permissions loss in Table ACLs by running appliers with single thread ([#518](#518)).
 * Running apply permission task before assessment should display message ([#487](#487)).
 * Split integration tests from blocking the merge queue ([#496](#496)).
 * Support more than one dashboard per step ([#472](#472)).
 * Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0 ([#505](#505)).
 * Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0 ([#575](#575)).
@nfx nfx mentioned this pull request Nov 17, 2023
nfx added a commit that referenced this pull request Nov 17, 2023
**Breaking changes** (existing installations need to reinstall UCX and
re-run assessment jobs)

* Switched local group migration component to rename groups instead of
creating backup groups
([#450](#450)).
* Mitigate permissions loss in Table ACLs by folding grants belonging to
the same principal, object id and object type together
([#512](#512)).

**New features**

* Added support for the experimental Databricks CLI launcher
([#517](#517)).
* Added support for external Hive Metastores including AWS Glue
([#400](#400)).
* Added more views to assessment dashboard
([#474](#474)).
* Added rate limit for creating backup group to increase stability
([#500](#500)).
* Added deduplication for mount point list
([#569](#569)).
* Added documentation to describe interaction with external Hive
Metastores ([#473](#473)).
* Added failure injection for job failure message propagation
([#591](#591)).
* Added uniqueness in the new warehouse name to avoid conflicts on
installation ([#542](#542)).
* Added a global init script to collect Hive Metastore lineage
([#513](#513)).
* Added retry set/update permissions when possible and assess the
changes in the workspace
([#519](#519)).
* Use `~/.ucx/state.json` to store the state of both dashboards and jobs
([#561](#561)).

**Bug fixes**

* Fixed handling for `OWN` table permissions
([#571](#571)).
* Fixed handling of keys with and without values.
([#514](#514)).
* Fixed integration test failures related to concurrent group delete
([#584](#584)).
* Fixed issue with workspace listing process on None type `object_type`
([#481](#481)).
* Fixed missing group entitlement migration bug
([#583](#583)).
* Fixed entitlement application for account-level groups
([#529](#529)).
* Fixed assessment throwing an error when the owner of an object is
empty ([#485](#485)).
* Fixed installer to migrate between different configuration file
versions ([#596](#596)).
* Fixed cluster policy crawler to be aware of deleted policies
([#486](#486)).
* Improved error message for not null constraints violated
([#532](#532)).
* Improved integration test resiliency
([#597](#597),
[#594](#594),
[#586](#586)).
* Introduced Safer access to workspace objects' properties.
([#530](#530)).
* Mitigated permissions loss in Table ACLs by running appliers with
single thread ([#518](#518)).
* Running apply permission task before assessment should display message
([#487](#487)).
* Split integration tests from blocking the merge queue
([#496](#496)).
* Support more than one dashboard per step
([#472](#472)).
* Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0
([#505](#505)).
* Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0
([#575](#575)).
pritishpai pushed a commit that referenced this pull request Nov 21, 2023
**Breaking changes** (existing installations need to reinstall UCX and
re-run assessment jobs)

* Switched local group migration component to rename groups instead of
creating backup groups
([#450](#450)).
* Mitigate permissions loss in Table ACLs by folding grants belonging to
the same principal, object id and object type together
([#512](#512)).

**New features**

* Added support for the experimental Databricks CLI launcher
([#517](#517)).
* Added support for external Hive Metastores including AWS Glue
([#400](#400)).
* Added more views to assessment dashboard
([#474](#474)).
* Added rate limit for creating backup group to increase stability
([#500](#500)).
* Added deduplication for mount point list
([#569](#569)).
* Added documentation to describe interaction with external Hive
Metastores ([#473](#473)).
* Added failure injection for job failure message propagation
([#591](#591)).
* Added uniqueness in the new warehouse name to avoid conflicts on
installation ([#542](#542)).
* Added a global init script to collect Hive Metastore lineage
([#513](#513)).
* Added retry set/update permissions when possible and assess the
changes in the workspace
([#519](#519)).
* Use `~/.ucx/state.json` to store the state of both dashboards and jobs
([#561](#561)).

**Bug fixes**

* Fixed handling for `OWN` table permissions
([#571](#571)).
* Fixed handling of keys with and without values.
([#514](#514)).
* Fixed integration test failures related to concurrent group delete
([#584](#584)).
* Fixed issue with workspace listing process on None type `object_type`
([#481](#481)).
* Fixed missing group entitlement migration bug
([#583](#583)).
* Fixed entitlement application for account-level groups
([#529](#529)).
* Fixed assessment throwing an error when the owner of an object is
empty ([#485](#485)).
* Fixed installer to migrate between different configuration file
versions ([#596](#596)).
* Fixed cluster policy crawler to be aware of deleted policies
([#486](#486)).
* Improved error message for not null constraints violated
([#532](#532)).
* Improved integration test resiliency
([#597](#597),
[#594](#594),
[#586](#586)).
* Introduced Safer access to workspace objects' properties.
([#530](#530)).
* Mitigated permissions loss in Table ACLs by running appliers with
single thread ([#518](#518)).
* Running apply permission task before assessment should display message
([#487](#487)).
* Split integration tests from blocking the merge queue
([#496](#496)).
* Support more than one dashboard per step
([#472](#472)).
* Update databricks-sdk requirement from ~=0.11.0 to ~=0.12.0
([#505](#505)).
* Update databricks-sdk requirement from ~=0.12.0 to ~=0.13.0
([#575](#575)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Install a global init script to collect HMS lineage for a few weeks
4 participants