The Hive Metastore migration process will upgrade the following Assets:
- Tables ond DBFS root
- External Tables
- Views
We don't expect this process to be a "one and done" process. The table migration typically is an iterative process and may require a few runs. The migration process is implemented within a workflow that can be invoked multiple times. Each time it will upgrade tables it can and report the ones it can't.
We keep track of the migration and provide the user continuous feedback of the progress and status of the upgrade. This feedback is presented in the migration dashboard:
-
One view per workspace summarizing all the table inventory and various counters
-
By default, we create a single catalog per HMS (<prefix (optional)>_<workspace_name>), happens at the account level.
-
Workspace Name would be set up as part of the installation at the account level.
-
Consider other mappings of environments/database to catalog/database.
- The user will be able to specify a default catalog for the workspace.
-
We have to annotate the status of assets that were migrated.
-
We will roll up the migration status to the workspace/account level. Showing migration state.
-
Aggregation of migration failures
- View of object migration:
Object Type Object ID Migrated Migration Failures View hive_metastore.finance.transaction_vw 1 [] Table hive_metastore.finance.transactions 0 ["Table uses SERDE: csv"] Table hive_metastore.finance.accounts 0 [] Cluster klasd-kladef-01265 0 ["Uses Passthru authentication"] -
By default, the target is the target_catalog/database_name
-
The assessment will generate a mapping file/table. The file will be in CSV format.
Source Database Target Catalog Target Database finance de_dev finance hr de_dev human_resources sales ucx-dev_ws sales -
The user can download the mapping file, override the targets and upload it to the workspace .csx folder.
-
By default, we copy the table content (CTAS)
-
Allow skipping individual tables/databases
-
Explore sizing tables or another threshold (recursively count bytes)
-
By default, we copy the table into a managed table/managed location
-
Allow overriding target to an external table
-
We should migrate ACLs for the tables (where applicable). We should highlight cases where we can't (no direct translation/conflicts)
-
We should consider automating ACLs based on Instance Profiles / Service Principals and other legacy security mechanisms
- By default, we copy the table content (CTAS)
- Allow skipping individual tables/databases
- Explore sizing tables or another threshold (recursively count bytes)
- By default, we copy the table into a managed table/managed location
- Allow overriding target to an external table
- Allow an exception list in case we want to skip certain tables
- Verify that we have the external locations for these tables
- Automate creation of External Locations (Future)
- Use sync to upgrade these tables "in place". Use the default or override catalog.database destination.
- Update the source table with "upgraded_to" property
- Copy this table using CTAS or Deep Clone. (Consider bringing history)
- Copy the Metadata
- Skip tables as needed based on size threshold or an exception list
- Update the source table with "upgraded_to" property
- Make a "best effort" attempt to upgrade view
- Create a view in the new location
- Upgrade table reference to the new tables (based on the upgraded_to table property)
- Handle nested views
- Handle or highlight other cases (functions/storage references/ETC)
- Create an exception list with views failures
- We should migrate (if possible) functions
- Address incompatibilities
- HMS on multiple workspaces may point to the same assets. We need to dedupe upgrades.
- Allow running assessment on all the accounts workspaces or on a group of workspaces.
- We have to test on Glue and other external Metastores
- Create an exception list at the account level the list should contain
- Tables that show up on more than one workspace (pointing to the same cloud storage location)
- Tables that show up on more than one workspace with different metadata
- Tables that show up on more than one workspace with different ACLs
- Addressing table conflicts/duplications require special processing we have the following options
- Define a "master" and create derivative objects as views
- Flag and skip the dupes
- Duplicate the data and create dupes
- Consider upgrading a workspace at a time. Highlight the conflict with prior upgrades.
- Allow workspace admins to upgrade more than one workspace.
- How do we manage/surface potential cost of the assessment run in case of many workspaces.
- How do we handle conflicts between workspaces
- What mechanism do we use to map source to target databases
- How to list workspaces in Azure/AWS
The code provided is a Python module that defines a Grant
dataclass and a GrantsCrawler
class. The Grant
dataclass
represents a grant of privileges in a database system, with attributes for the principal, action type, catalog, database,
table, view, UDF, and flags for any file and anonymous function. The GrantsCrawler
class is a crawler that fetches grants
for databases, tables, views, UDFs, and anonymous functions in a Hive metastore.
It uses a TablesCrawler
and UdfsCrawler
to fetch table and UDF information, respectively. The GrantsCrawler
class
provides methods for fetching grants based on different parameters and returning them as an iterable of Grant
objects.
It also provides methods for getting grants for a specific table or schema. The code includes a _type_and_key
method
that normalizes the input parameters and returns a tuple of the object type and key, which is used to fetch grants for
the specified object. The code also includes methods for generating SQL statements to grant and revoke privileges in
Hive and Unity Catalog (UC) systems.
The module includes two classes, ExternalLocations
and Mounts
, which inherit from CrawlerBase
.
ExternalLocations
is a class for crawling and managing external locations used by tables in a Databricks workspace.
It has methods for creating a list of external locations based on tables in a given schema and a method for generating
Terraform definitions for any missing external locations. The class has a _external_locations
method that filters and p
rocesses the external locations based on certain conditions.
Mounts
is a class for managing mounts in a Databricks workspace. It has methods for listing and deduplicating mounts,
as well as a method for creating a snapshot of the current mounts. The _deduplicate_mounts
method removes any duplicate
mounts based on their name and source.
The module includes two dataclasses, Rule
and TableToMigrate
, which encapsulate information about the source and target tables for migration.
The Rule
dataclass includes information about the source and target catalog, schema, and table names, as well as a method for generating
the unique key for the target table in the Unity Catalog (UC) and the Hive Metastore (HMS). The TableToMigrate
dataclass includes
a Table
object representing the source table and a Rule
object representing the migration rule for that table.
At the heart of the module is the TableMapping
class, which is the main class for managing table mappings.
The TableMapping
class includes several methods for managing the table mappings, such as loading and saving
the mappings to a file, skipping tables and schemas, and checking if a table is already migrated or marked to be skipped.
The TableMapping
class is initialized with an Installation
object, a WorkspaceClient
object, and a SqlBackend
object,
which are used to interact with the Unity Catalog, the workspace, and to execute SQL queries.
The TablesMigrate
class is designed for migrating tables from one schema to another within a Databricks workspace.
This class requires instances of TablesCrawler
, WorkspaceClient
, SqlBackend
, and TableMapping
as inputs.
The migrate_tables
method is responsible for migrating tables and takes an optional argument what
to filter tables
based on their type. This method internally calls the _migrate_table
method which is responsible for migrating
the actual table and determining the appropriate migration method based on the table's type.
The _migrate_external_table
, _migrate_dbfs_root_table
, and _migrate_view
methods are used to migrate external
tables, DBFS root tables, and views, respectively. The _init_seen_tables
, _table_already_upgraded
, _get_tables_to_revert
,
and _revert_migrated_table
methods are used for managing the state of the migration process. The _init_seen_tables
method
initializes the list of tables that have been seen during the migration process. The _table_already_upgraded
method checks
if a table has already been upgraded. The _get_tables_to_revert
method retrieves the list of tables that can be reverted.
The _revert_migrated_table
method is responsible for reverting the migration of a table.
The is_upgraded
method checks if a table has been upgraded or not. The print_revert_report
method generates a report
of the tables that can be reverted.
The TableMove
class is a newly developed feature that enables the movement or aliasing of tables and views from one
schema to another within UC. This class requires an instance of WorkspaceClient
and SqlBackend
as inputs and provides
two primary methods: move_tables
and alias_tables
. The move_tables
method moves tables to a new schema, while
the alias_tables
method creates aliases of tables and views in a different schema.
The _move_table
, _alias_table
, and _move_view
methods are responsible for performing the actual movement, aliasing,
and recreating of the table or view in the destination schema, taking into account any dependencies or permissions
associated with the object. The _reapply_grants
method reapplies the grants on the migrated table or view, ensuring
that the necessary permissions are maintained. The _recreate_table
and _recreate_view
methods recreate the table or
view in the destination schema, including any dependencies or permissions associated with the object.
The Table Size Crawler is a new feature of the data crawler system that calculates the size of tables in a Hive Metastore.
The TableSizeCrawler
class is developed to inherit from CrawlerBase
and is initialized with a SQL Execution Backend
and a schema name. This class uses the TablesCrawler
class to obtain a snapshot of tables and then iterates over them
to calculate the size of each table using the _safe_get_table_size
method which queries the Spark SQL engine.
The TableSizeCrawler
class has several methods, including snapshot
, _try_load
, and _crawl
. The snapshot
method
returns a list of TableSize
objects representing the snapshot of tables, filtered to include only those with a non-null
size. The _try_load
method tries to load table information from the database and raises a TABLE_OR_VIEW_NOT_FOUND
error if the table cannot be found. The _crawl
method crawls and lists tables using the tables_crawler
object and
calculates the size of DBFS root tables, skipping tables that are not of type TABLE
or are not DBFS root tables.
The TablesCrawler
is designed for crawling and listing tables within Hive Metastore. It can fetch detailed information
about each table, including the table's name, external location, and storage format. This information can be used to
better understand the structure and contents of the tables in the Databricks workspace.
The Table
class has the following methods:
is_delta
: ReturnsTrue
if the table format is DELTA, elseFalse
.key
: Returns the key of the table in the format<catalog>.<database>.<name>
.kind
: Returns the kind of the table, either 'TABLE' or 'VIEW'.sql_alter_to
: Returns an SQL command to alter the table to a target table key.sql_alter_from
: Returns an SQL command to alter the table from a target table key with a workspace ID.sql_unset_upgraded_to
: Returns an SQL command to unset the upgraded_to property of the table.is_dbfs_root
: ReturnsTrue
if the table location is in the DBFS root directory, elseFalse
.is_format_supported_for_sync
: ReturnsTrue
if the table format is supported for synchronization, elseFalse
.is_databricks_dataset
: ReturnsTrue
if the table location is a Databricks dataset, elseFalse
.what
: Returns the type of the table based on its properties.sql_migrate_external
: Returns an SQL command to migrate an external table.sql_migrate_dbfs
: Returns an SQL command to migrate a table located in DBFS.sql_migrate_view
: Returns an SQL command to migrate a view.