Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spurious warnings and bogus index when reflecting Iceberg tables #520

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

metadaddy
Copy link
Member

Description

Previously, during reflection, TrinoDialect.get_indexes() called _get_partitions(), which executed

SELECT * FROM {schema}."{table_name}$partitions"

and assumed that, if this query executed successfully, the table in hand was a Hive table. The issue (#518) was that this same query also succeeds for Iceberg tables, resulting in a spurious index (containing no columns) being added to the table, and a series of warnings from SQLAlchemy about index keys not being located in the column names for the table.

This PR adds a check to TrinoDialect.get_indexes() to ensure that the catalog in hand is a Hive catalog before calling _get_partitions().

I looked at creating a test method, but I couldn't see how to do so. Instead, I bench-tested the fix against Hive and Iceberg catalogs with the test app from #518. The output for an Iceberg table is:

Calling reflect()

Listing tables:
Table name: drivestats

For a Hive table it is:

Calling reflect()

Listing tables:
Table name: drivestats
	Index name: partition
		Column name: drivestats.year
		Column name: drivestats.month

Non-technical explanation

This PR fixes spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

* Fix spurious warnings and a bogus index being added to the metadata when reflecting Iceberg tables. ({issue}`518`)

@cla-bot cla-bot bot added the cla-signed label Jan 15, 2025
@metadaddy
Copy link
Member Author

I'll see if I can figure out a way to make it work on 351.

@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from c396080 to ac86862 Compare January 16, 2025 02:06
@metadaddy
Copy link
Member Author

metadaddy commented Jan 16, 2025

The 351 check passes now. This is the query that the code uses to check whether the catalog's connector is Hive. If it returns 1, then it's Hive.

SELECT
    COUNT(*)
FROM "system"."metadata"."table_properties"
WHERE "catalog_name" = :catalog_name
  AND "property_name" = 'bucketing_version'

If there's a better way to do this that passes the 351 test, I'd be happy to change the code!

@hashhar
Copy link
Member

hashhar commented Jan 16, 2025

for reference the earlier alternative was reading from "system"."metadata"."catalogs". I think for this case it's fine to let the warnings happen for versions which don't have system.metadata.catalogs. I don't like replying on table properties since nothing guarantees the property won't get added to Iceberg for example.

For the test then you can make it "pass" by mark.skipif (or whatever it's called).

cc: @damian3031 @ebyhr @dungdm93 what do you think?

@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch 3 times, most recently from 264ea95 to 35343f1 Compare January 24, 2025 01:59
@metadaddy
Copy link
Member Author

@hashhar I realized I could query "system"."information_schema"."columns" to see if the connector_name column is there. So, the warning is shown on old versions and all tests pass without any additional @pytest.mark.skipif annotations.

@metadaddy metadaddy force-pushed the fix-iceberg-indexes-return-partitions branch from 35343f1 to 4d1f272 Compare January 24, 2025 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants