Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend documentation with additional steps to get started #947

Merged
merged 5 commits into from
Feb 15, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
107 changes: 84 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ![UCX by Dataricks Labs](docs/logo-no-background.png)
# ![UCX by Databricks Labs](docs/logo-no-background.png)

Your best companion for upgrading to Unity Catalog. It helps you to upgrade all Databricks workspace assets:
Legacy Table ACLs, Entitlements, AWS instance profiles, Clusters, Cluster policies, Instance Pools, Databricks SQL warehouses, Delta Live Tables, Jobs, MLflow experiments, MLflow registry, SQL Dashboards & Queries, SQL Alerts, Token and Password usage permissions that are set on the workspace level, Secret scopes, Notebooks, Directories, Repos, Files.
Expand All @@ -18,59 +18,120 @@ UCX leverages Databricks Lakehouse platform to upgrade itself. The upgrade proce

By running the installation you install the assessment job and several upgrade jobs. The assessment and upgrade jobs are outlined in the custom-generated README.py that is created by the installer.

The custom-generated `README.py`, `config.yaml`, and other assets are placed into your Databricks workspace home folder, into a subfolder named `.ucx`. See [interactive tutorial](https://app.getreprise.com/launch/zXPxBZX/).
The custom-generated `README.py`, `config.yaml`, and other assets are placed into your Databricks workspace home folder, into a sub-folder named `.ucx`. See [interactive tutorial](https://app.getreprise.com/launch/zXPxBZX/).


Once the custom Databricks jobs are installed, begin by triggering the assessment job. The assessment job can be found under your workflows or via the active link in the README.py. Once the assessment job is complete, you can review the results in the custom-generated Databricks dashboard (linked to by the custom README.py found in the workspace folder created for you).


You will need an account, unity catalog, and workspace administrative authority to complete the upgrade process. To run the installer, you will need to setup `databricks-cli` and a credential, [following these instructions.](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) Additionally, the interim metadata and config data being processed by UCX will be stored into a Hive Metastore database schema generated at install time.
You will need an account, unity catalog, and workspace administrative authority to complete the upgrade process. To run the installer, you will need to set up `databricks-cli` and a credential, [following these instructions.](https://docs.databricks.com/en/dev-tools/cli/databricks-cli.html) Additionally, the interim metadata and config data being processed by UCX will be stored into a Hive Metastore database schema generated at install time.


For questions, troubleshooting or bug fixes, please see your Databricks account team or submit an issue to the [Databricks UCX github repo](https://github.com/databrickslabs/ucx)
For questions, troubleshooting or bug fixes, please see your Databricks account team or submit an issue to the [Databricks UCX GitHub repo](https://github.com/databrickslabs/ucx)

## Installation
### Prerequisites
1. Get trained on UC [[free instructor-led training 2x week](https://customer-academy.databricks.com/learn/course/1683/data-governance-with-unity-catalog?generated_by=302876&hash=4eab6668f83636ba44d109880002b293e8dda6dd)] [[full training schedule](https://files.training.databricks.com/static/ilt-sessions/half-day-workshops/index.html)]
2. You will need a desktop computer, running Windows, MacOS, or Linux; This computer is used to install the UCX toolkit onto the Databricks workspace, the computer will also need:
- Network access to your Databricks Workspace
- Network access to the Internet to retrieve additional Python packages (e.g. PyYAML, databricks-sdk,...) and access github.com
- Python 3.10 or later - [Windows instructions](https://www.python.org/downloads/)
- Databricks CLI with a workspace [configuration profile](https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication) for workspace - [instructions](https://docs.databricks.com/en/dev-tools/cli/install.html)
- Your windows computer will need a shell environment (GitBash or ([WSL](https://learn.microsoft.com/en-us/windows/wsl/about))
1. Get trained on UC [[free instructor-led training 2x week]](https://customer-academy.databricks.com/learn/course/1683/data-governance-with-unity-catalog?generated_by=302876&hash=4eab6668f83636ba44d109880002b293e8dda6dd) [[full training schedule]](https://files.training.databricks.com/static/ilt-sessions/half-day-workshops/index.html)
2. You will need a desktop computer, running Windows, macOS, or Linux; This computer is used to install the UCX toolkit onto the Databricks workspace, the computer will also need:
- Network access to your Databricks Workspace
- Network access to the Internet to retrieve additional Python packages (e.g. PyYAML, databricks-sdk,...) and access https://github.com
- Python 3.10 or later - [Windows instructions](https://www.python.org/downloads/)
- Databricks CLI with a workspace [configuration profile](https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication) for workspace - [instructions](https://docs.databricks.com/en/dev-tools/cli/install.html)
- Your Windows computer will need a shell environment (GitBash or ([WSL](https://learn.microsoft.com/en-us/windows/wsl/about))
3. Within the Databricks Workspace you will need:
- Workspace administrator access permissions
- The ability for the installer to upload Python Wheel files to DBFS and Workspace FileSystem
- A PRO or Serverless SQL Warehouse
- The Assessment workflow will create a legacy "No Isolation Shared" and a legacy "Table ACL" jobs clusters needed to inventory Hive Metastore Table ACLS
- If your Databricks Workspace relies on an external Hive Metastore (such as glue), make sure to read the [External HMS Document](docs/external_hms_glue.md).
4. [[AWS](https://docs.databricks.com/en/administration-guide/users-groups/best-practices.html)] [[Azure](https://learn.microsoft.com/en-us/azure/databricks/administration-guide/users-groups/best-practices)] [[GCP](https://docs.gcp.databricks.com/administration-guide/users-groups/best-practices.html)] Account level Identity Setup
5. [[AWS](https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html)] [[Azure](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/create-metastore)] [[GCP](https://docs.gcp.databricks.com/data-governance/unity-catalog/create-metastore.html)] Unity Catalog Metastore Created (per region)

### Download & Install

We only support installations and upgrades through [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html), as UCX requires an installation script run to make sure all the necessary and correct configurations are in place.
- Workspace administrator access permissions
- The ability for the installer to upload Python Wheel files to DBFS and Workspace FileSystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some commands also requires Account Admin permissions AFAIK

- A PRO or Serverless SQL Warehouse
- The Assessment workflow will create a legacy "No Isolation Shared" and a legacy "Table ACL" jobs clusters needed to inventory Hive Metastore Table ACLS
- If your Databricks Workspace relies on an external Hive Metastore (such as AWS Glue), make sure to read the [External HMS Document](docs/external_hms_glue.md).
4. [[AWS]](https://docs.databricks.com/en/administration-guide/users-groups/best-practices.html) [[Azure]](https://learn.microsoft.com/en-us/azure/databricks/administration-guide/users-groups/best-practices)] [[GCP]](https://docs.gcp.databricks.com/administration-guide/users-groups/best-practices.html) Account level Identity Setup
5. [[AWS]](https://docs.databricks.com/en/data-governance/unity-catalog/create-metastore.html) [[Azure]](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/create-metastore) [[GCP]](https://docs.gcp.databricks.com/data-governance/unity-catalog/create-metastore.html) Unity Catalog Metastore Created (per region)

#### Installing Databricks CLI on macOS
![macos_install_databricks](docs/macos_1_databrickslabsmac_installdatabricks.gif)

#### Install Databricks CLI via curl on Windows
![winos_install_databricks](docs/winos_1_databrickslabsmac_installdatabricks.gif)

### Download & Install

We only support installations and upgrades through [Databricks CLI](https://docs.databricks.com/en/dev-tools/cli/index.html), as UCX requires an installation script run to make sure all the necessary and correct configurations are in place.

#### Install UCX
Install UCX via Databricks CLI:
```commandline
databricks labs install ucx
```

This will start an interactive installer with a number of configuration questions:
- Select a workspace profile that has been defined in `~/.databrickscfg`
- Provide the name of the inventory database where UCX will store the assessment results. This will be in the workspace `hive_metastore`. Defaults to `ucx`
- Create a new or select an existing SQL warehouse to run assessment dashboards on. The existing warehouse must be Pro or Serverless.
- Configurations for workspace local groups migration:
- Provide a backup prefix. This is used to rename workspace local groups after they have been migrated. Defaults to `db-temp-`
- Select a workspace local groups migration strategy. UCX offers matching by name or external ID, using a prefix/suffix, or using regex.
- Provide a specific list of workspace local groups (or all groups) to be migrated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the group_name_conflict doc

- Select a Python log level, e.g. `DEBUG`, `INFO`. Defaults to `INFO`
- Provide the level of parallelism, which limit the number of concurrent operation as UCX scans the workspace. Defaults to 8.
- Select whether UCX should connect to the external HMS, if a cluster policy with external HMS is detected. Defaults to no.

After this, UCX will be installed locally and a number of assets will be deployed in the selected workspace. These assets are available under the installation folder, i.e. `/Users/<your user>/.ucx/`

![macos_install_ucx](docs/macos_2_databrickslabsmac_installucx.gif)

#### Upgrade UCX
Verify that UCX is installed
```text
databricks labs installed

Name Description Version
ucx Unity Catalog Migration Toolkit (UCX) <version>
```
Upgrade UCX via Databricks CLI:
```commandline
databricks labs upgrade ucx
```
The prompts will be similar to [Installation](#install-ucx)

![macos_upgrade_ucx](docs/macos_3_databrickslabsmac_upgradeucx.gif)

#### Uninstall UCX
Uninstall UCX via Databricks CLI:
```commandline
databricks labs uninstall ucx
```

Databricks CLI will confirm a few options:
- Whether you want to remove all ucx artefacts from the workspace as well. Defaults to no.
- Whether you want to delete the inventory database in `hive_metastore`. Defaults to no.

![macos_uninstall_ucx](docs/macos_4_databrickslabsmac_uninstallucx.gif)

### Using UCX

After installation, a number of UCX workflows will be available in the workspace. `<installation_path>/README` contains further instructions and explanations of these workflows.
UCX also provides a number of command line utilities accessible via `databricks labs ucx`.

#### Understanding assessment report

After UCX assessment workflow is executed, the assessment dashboard will be populated with findings and common recommendations.
[This guide](docs/assessment.md) talks about them in more details.

#### Synchronising workspace info
Use to upload workspace config to all workspaces in the account where UCX is installed. UCX will prompt you to select an account profile that has been defined in `~/.databrickscfg`

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this will allow the user to do ? Is it a UCX cross-workspace installation ?

```commandline
databricks labs ucx sync-workspace-info
```

#### Saving AWS instance profiles
Use to identify all instance profiles in the workspace, and map their access to S3 buckets. This requires `awscli` to be installed and configured.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Command is missing

## Star History

[![Star History Chart](https://api.star-history.com/svg?repos=databrickslabs/ucx&type=Date)](https://star-history.com/#databrickslabs/ucx)

## Project Support
Please note that all projects in the /databrickslabs github account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.
Please note that all projects in the databrickslabs GitHub account are provided for your exploration only, and are not formally supported by Databricks with Service Level Agreements (SLAs). They are provided AS-IS, and we do not make any guarantees of any kind. Please do not submit a support ticket relating to any issues arising from the use of these projects.

Any issues discovered through the use of this project should be filed as GitHub Issues on the Repo. They will be reviewed as time permits, but there are no formal SLAs for support.
20 changes: 10 additions & 10 deletions docs/group_name_conflict.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,18 +11,18 @@ During the installation process we pose the following question:
"Do you need to rename the workspace groups to match the account groups' name?"


If the answer is "Yes" a follow up question will be:
If the answer is "Yes" a follow-up question will be:
<br/>
"Choose How to rename the workspace groups:"
1. Apply a Prefix
2. Apply a Suffix
3. Use Regular Expression Substitution
4. User Regular Expression to extract a value from the account and the workspace
4. Map using External Group ID
5. Map using External Group ID

The user then input the Prefix/Suffix/Regular Expression.
The install process will validate the regular expression.
The install process will register the selection as regular expression in the configuration YAML file.
The installation process will validate the regular expression.
The installation process will register the selection as regular expression in the configuration YAML file.

We introduce 3 more parameters to the configuration yaml and the group manager:
- workspace_group_regex
Expand All @@ -34,9 +34,9 @@ When we run the migration process the regular expression substitution will be ap

Group Translation Scenarios:

| Scenario | User Input | workspace_group_regex | workspace_group_replace | account_group_regex | Example |
|----------|-----------------------------------------------------------|-----------------------|-------------------------|---------------------|----------------------------------------|
| Prefix | prefix: [Prefix] | ^ | [Prefix] | [EMPTY] | data_engineers --> prod_data_engineers |
| Suffix | suffix: [Prefix] | $ | [Suffix] | [EMPTY] | data_engineers --> data_engineers_prod |
| Substitution | Search Regex: [Regex]<br/>Replace Text:[Replacement_Text] | [WS_Regex] | [ [Replacement_Text] | [Empty] | corp_tech_data_engineers --> prod_data_engineers |
| Partial Lookup | Workspace Regex: [WS_Regex]<br/> Account Regex: [Acct Regex] | [WS_Regex]| [Empty] | [Acct_Regex] | data_engineers(12345) --> data_engs(12345) |
| Scenario | User Input | workspace_group_regex | workspace_group_replace | account_group_regex | Example |
|----------------|--------------------------------------------------------------|-----------------------|-------------------------|---------------------|--------------------------------------------------|
| Prefix | prefix: [Prefix] | ^ | [Prefix] | [EMPTY] | data_engineers --> prod_data_engineers |
| Suffix | suffix: [Prefix] | $ | [Suffix] | [EMPTY] | data_engineers --> data_engineers_prod |
| Substitution | Search Regex: [Regex]<br/>Replace Text:[Replacement_Text] | [WS_Regex] | [ [Replacement_Text] | [Empty] | corp_tech_data_engineers --> prod_data_engineers |
| Partial Lookup | Workspace Regex: [WS_Regex]<br/> Account Regex: [Acct Regex] | [WS_Regex] | [Empty] | [Acct_Regex] | data_engineers(12345) --> data_engs(12345) |
Loading