Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: add variant type support #11831

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

aihuaxu
Copy link
Contributor

@aihuaxu aihuaxu commented Dec 19, 2024

This is to add some required changes in API and core module for Variant support, including:

  • Add isVariantType() method for variant type
  • Add variant support in schema visitors such as AssignIdFreshIds, ReassignIds, PruneColumns, etc. and TypeUtil.
  • Add variant support in avro projection.
  • Add test coverage to test out schema visitors, TypeUtil and Avro projection.

Part of: #10392

@aihuaxu aihuaxu marked this pull request as ready for review December 19, 2024 23:00
@aihuaxu
Copy link
Contributor Author

aihuaxu commented Dec 19, 2024

cc @rdblue, @RussellSpitzer, @flyrain and @JonasJ-ap. This is to add the changes in core to support variant type.

@@ -166,6 +169,10 @@ public static String toJson(Schema schema, boolean pretty) {

private static Type typeFromJson(JsonNode json) {
if (json.isTextual()) {
if (VARIANT.equalsIgnoreCase(json.asText())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fromPrimitiveString should handle this.

@aihuaxu aihuaxu requested a review from rdblue December 21, 2024 04:45
@aihuaxu aihuaxu force-pushed the variant-type-core branch 2 times, most recently from b276d3f to fe6038a Compare December 21, 2024 16:42
@shohamyamin
Copy link

@aihuaxu very important feature that will allow a lot more options for iceberg, thank you for your contribution

@@ -61,6 +61,14 @@ private Types() {}
private static final Pattern DECIMAL =
Pattern.compile("decimal\\(\\s*(\\d+)\\s*,\\s*(\\d+)\\s*\\)");

public static Type typeFromTypeString(String typeString) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variant isn't a primitive, but it is still better to reuse fromPrimitiveString. Callers are using this to parse type names, not to restrict parsing to only primitive types.

Copy link
Contributor

@rdblue rdblue Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that if we want to have an additional method to avoid confusion (i.e. "I don't want to fromPrimitiveString that because I need support for variant") then we can add a synonym. In that case, I think there's probably a better name, like fromString. That's not great because it implies that it would support structs, maps, and lists, but so does typeFromTypeString so it's at least more direct. Unfortunately, we don't have a word in Iceberg that means a type that is expressed in a single string (vs a JSON-defined type).

I would probably just leave this as fromPrimitiveString and not worry about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding new typeFromTypeString(), I change fromPrimitiveString() to fromTypeString() to avoid confusion. Let me know if that works for you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that fromPrimitiveString is a little misleading so prefer fromTypeString. But let me know and I can change back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's public API and renaming actually will cause incompatibility. I renamed back to fromPrimitiveString() now.

Copy link
Contributor Author

@aihuaxu aihuaxu Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it will introduces incompatibility since the return type changes. Let me know if that works or maybe we can introduce a new one fromTypeString() as I did initially.

old: method org.apache.iceberg.types.Type.PrimitiveType org.apache.iceberg.types.Types::fromPrimitiveString(java.lang.String)
new: method org.apache.iceberg.types.Type org.apache.iceberg.types.Types::fromPrimitiveString(java.lang.String)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be +1 on fromTypeString.

@@ -56,6 +56,10 @@ class BuildAvroProjection extends AvroCustomOrderSchemaVisitor<Schema, Schema.Fi
@Override
@SuppressWarnings("checkstyle:CyclomaticComplexity")
public Schema record(Schema record, List<String> names, Iterable<Schema.Field> schemaIterable) {
if (current.isVariantType()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the corresponding visit method should be updated to call a new visitor variant method. That will be cleaner.

The visitor should look for the variant logical type, so we will need to implement the logical type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. This is to workaround before the logical type is added in Avro by using Iceberg variant type. Let me add a comment for this.

@aihuaxu aihuaxu force-pushed the variant-type-core branch 3 times, most recently from e0a430f to 1b56bf5 Compare January 30, 2025 05:12
@github-actions github-actions bot added the data label Jan 30, 2025
@aihuaxu aihuaxu requested a review from rdblue January 30, 2025 05:20
@aihuaxu aihuaxu force-pushed the variant-type-core branch 2 times, most recently from 7c3f60a to 760ed7d Compare January 30, 2025 16:47
assertThat(variantSchema.getType()).isEqualTo(org.apache.avro.Schema.Type.RECORD);
assertThat(variantSchema.getFields().size()).isEqualTo(2);
assertThat(variantSchema.getField("metadata")).isNotNull();
assertThat(variantSchema.getField("value")).isNotNull();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have stronger assertions on the types of these fields (they should be bytes)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants