Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[590] Add Hudi HMS Catalog Sync Implementation #648

Merged
merged 1 commit into from
Feb 19, 2025

Conversation

vamsikarnika
Copy link

@vamsikarnika vamsikarnika commented Feb 13, 2025

Important Read

  • Please ensure the GitHub issue is mentioned at the beginning of the PR

What is the purpose of the pull request

Adds support for syncing Hudi table to HMS Catalog

Brief change log

  • Added implementation for HudiHMSCatalogTableBuilder to create and refresh hudi tables sync to HMS
  • Added implementation for HMSCatalogPartitionSyncOperations to sync hudi partitions to HMS

Verify this pull request

This change added tests and can be verified as follows:

  • Added unit tests
  • Manually verified the change by running a job locally.

@vamsikarnika vamsikarnika force-pushed the hudi_hms_catalog_sync_v2 branch from c778a19 to 619672e Compare February 13, 2025 11:09
@@ -43,7 +43,7 @@
import org.apache.xtable.model.storage.CatalogType;
import org.apache.xtable.model.storage.TableFormat;

public class HMSCatalogSyncClientTestBase {
public class HMSCatalogSyncTestBase {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed this class since this class used by not just test for HMS Client classes as well as partiton sync classes. This naming conventions also matches with GlueCatalogSyncTestBase

@@ -45,6 +45,7 @@ datasets:
- sourceCatalogTableIdentifier:
tableIdentifier:
hierarchicalId: "source-database-1.source-1"
partitionSpec: "cs_sold_date_sk:VALUE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the source is not hudi, then this shouldn't be required right?

Copy link
Author

@vamsikarnika vamsikarnika Feb 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need partitionSpec when target is HUDI, since during catalogSync we convert target table to source table before syncing to catalogs.

tableProperties.put(HUDI_METADATA_CONFIG, "true");
Map<String, String> sparkTableProperties =
HudiCatalogTableUtils.getSparkTableProperties(
partitionFields, "", hmsCatalogConfig.getSchemaLengthThreshold(), schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sparkVersion is passed as an empty string, should it have a value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sparkVersion is an optional metadata field. It's not actually used by spark while reading the table. Since. we don't have a clear way of fetching spark version, have set it to empty for now.

private CatalogPartitionSyncOperations mockHMSPartitionSyncOperations;

void setupCommonMocks() {
mockHMSPartitionSyncOperations =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still find the naming confusing here, why do we call the actual instance the mock here?

@vamsikarnika vamsikarnika force-pushed the hudi_hms_catalog_sync_v2 branch from c3543fa to 95f0329 Compare February 18, 2025 06:23
Comment on lines 90 to 91
CatalogPartitionSyncOperations hmsCatalogPartitionSyncOperations =
new HMSCatalogPartitionSyncOperations(metaStoreClient, hmsCatalogConfig);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid this line by passing
new HMSCatalogPartitionSyncOperations(metaStoreClient, hmsCatalogConfig) inline. Is this called only during _init ? Then it should be part of the same function to avoid mis-use of getPartitionSyncTool

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is only used from _init method now. I've moved it to the same function.

@vamsikarnika vamsikarnika force-pushed the hudi_hms_catalog_sync_v2 branch from 887aab8 to 2b4eef2 Compare February 19, 2025 09:04
@vinishjail97 vinishjail97 merged commit f194f4c into apache:main Feb 19, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants