Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE]: Connect _migrate_table_create_ctas to migrate-tables workflow #332

Closed
2 of 5 tasks
Tracked by #103
nfx opened this issue Sep 29, 2023 · 4 comments · Fixed by #1621
Closed
2 of 5 tasks
Tracked by #103

[FEATURE]: Connect _migrate_table_create_ctas to migrate-tables workflow #332

nfx opened this issue Sep 29, 2023 · 4 comments · Fixed by #1621
Assignees
Labels
feat/workflow triggered as a Databricks Job managed by UCX migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs

Comments

@nfx
Copy link
Collaborator

nfx commented Sep 29, 2023

most of the code is done in:

Diff:

def _migrate_table_create_ctas(self, src_table: Table, rule: Rule, grants: list[Grant] | None = None):
table_migrate_sql = self._get_create_ctas_sql(src_table, rule)
logger.debug(f"Migrating table (Create Like) {src_table.key} to using SQL query: {table_migrate_sql}")
self._backend.execute(table_migrate_sql)
self._backend.execute(src_table.sql_alter_to(rule.as_uc_table_key))
self._backend.execute(src_table.sql_alter_from(rule.as_uc_table_key, self._ws.get_workspace_id()))
return self._migrate_acl(src_table, rule, grants)
def _get_create_ctas_sql(self, src_table: Table, rule: Rule) -> str:
create_sql = (
f"CREATE TABLE IF NOT EXISTS {escape_sql_identifier(rule.as_uc_table_key)} "
f"AS SELECT * FROM {src_table.safe_sql_key}"
)
return create_sql

Tasks:

  • Copy this table using CTAS
  • Copy the Metadata with CREATE TABLE ... LIKE ...
  • Skip tables as needed based on size threshold or an exception list
  • Update the source table with upgraded_to property
  • Link it to the workflow as task
@nfx nfx added enhancement New feature or request migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step labels Sep 29, 2023
@nfx nfx changed the title Migrate non-Parquet/non-Delta tables from mounted locations into external locations Migrate MANAGED and EXTERNAL non-Parquet/non-Delta tables Sep 29, 2023
@nfx nfx added the migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs label Sep 29, 2023
@pohlposition
Copy link
Contributor

Can we split issues between MANAGED (step:managed tables) and EXTERNAL (step:sync tables).

This maps to go/uc/upgrade

@pohlposition pohlposition added this to the non-prioritised milestone Oct 2, 2023
@nfx nfx added this to UCX Oct 3, 2023
@nfx nfx moved this to Triage in UCX Oct 3, 2023
@pohlposition pohlposition changed the title Migrate MANAGED and EXTERNAL non-Parquet/non-Delta tables Migrate MANAGED non-Parquet/non-Delta tables Oct 5, 2023
@pohlposition
Copy link
Contributor

I'm changing the name of this issue from:
Migrate MANAGED and EXTERNAL non-Parquet/non-Delta tables

to

Migrate MANAGED non-Parquet/non-Delta tables

because:

  1. It separates the workstreams to align to upgrade steps
  2. The focus on cloning is applicable to managed tables, but not external tables.

@pohlposition
Copy link
Contributor

pohlposition commented Oct 5, 2023

One additional consideration to make with Managed tables, which is different to external tables, is that it needs to be executed as much as possible as an atomic transaction:

BEGIN TRANSACTION

  1. "Lock old table to prevent writes". (maybe stop a streaming job or something else)
  2. Clone Table
  3. Update WRITE references to new table
  4. Update READ references to new table
  5. "DEPRECATE" old managed table - Revoke permissions; maybe rename it with a prefix or something
    COMMIT TRANSACTION

At some point, after an acceptable retention period, you will want to purge that old table's data too.

This is to prevent data from being written to the old table while the upgrade is being performed.

Additionally, you want to prevent data from being written to the old table after you have upgraded to the new table.

@pohlposition
Copy link
Contributor

One option we may want to provide customers is to convert the managed table into an external table first (still using HMS).

If that completes successfully, then that external table can be upgraded more easily using our external table upgrade process.

External tables are easier to upgrade because you can upgrade their metadata independent of the job. You don't need to worry about the atomic transaction so much.

@pohlposition pohlposition modified the milestones: non-prioritised, 3 months Oct 5, 2023
@nfx nfx moved this from Triage to Quarter Backlog in UCX Oct 5, 2023
@pohlposition pohlposition removed the migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step label Oct 28, 2023
@nfx nfx moved this from Quarter Backlog to Triage in UCX Dec 6, 2023
@nfx nfx moved this from Triage to Month Backlog in UCX Dec 6, 2023
@pohlposition pohlposition moved this from Month Backlog to Active Backlog in UCX Apr 8, 2024
@pohlposition pohlposition moved this from Active Backlog to Month Backlog in UCX Apr 8, 2024
@nfx nfx unassigned FastLee Apr 15, 2024
@nfx nfx added feat/workflow triggered as a Databricks Job managed by UCX and removed enhancement New feature or request labels Apr 22, 2024
@nfx nfx removed this from the 3 months milestone Apr 22, 2024
@nfx nfx changed the title Migrate MANAGED non-Parquet/non-Delta tables [FEATURE]: Connect _migrate_table_create_ctas to migrate-tables workflow Apr 23, 2024
@HariGS-DB HariGS-DB self-assigned this May 1, 2024
github-merge-queue bot pushed a commit that referenced this issue May 7, 2024
…ate_table workflow (#1621)

## Changes

There are 3 workflows for migrating tables.

migrate_tables: covers EXTERNAL_SYNC, DBFS_ROOT_DELTA, VIEW
migrate-external-hiveserde-tables-in-place-experimental: covers
EXTERNAL_HIVESERDE and VIEW
migrate-external-tables-ctas: covers EXTERNAL_NO_SYNC,
EXTERNAL_HIVESERDE and VIEW

- This change adds the remaining scenario DBFS_ROOT_NON_DELTA to the
migrate_tables workflow
 - Also adds AclMigrationWhat.PRINCIPAL to the ACL migration strategy



Resolves #332 

### Functionality 

- [ ] added relevant user documentation
- [ ] added new CLI command
- [ ] modified existing command: `databricks labs ucx ...`
- [ ] added a new workflow
- [X] modified existing workflow: `...`
- [ ] added a new table
- [ ] modified existing table: `...`

### Tests
<!-- How is this tested? Please see the checklist below and also
describe any other relevant tests -->

- [ ] manually tested
- [X] added unit tests
- [X] added integration tests
- [ ] verified on staging environment (screenshot attached)

---------

Co-authored-by: vuong-nguyen <[email protected]>
Co-authored-by: Vuong <[email protected]>
@nfx nfx closed this as completed in #1621 May 7, 2024
@github-project-automation github-project-automation bot moved this from Month Backlog to Archive in UCX May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/workflow triggered as a Databricks Job managed by UCX migrate/managed go/uc/upgrade Upgrade Managed Tables and Jobs
Projects
Archived in project
4 participants