Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Clarify partition equality #9125

Merged
merged 7 commits into from
Dec 4, 2023

Conversation

emkornfield
Copy link
Contributor

Based on conversation on mailing list: https://lists.apache.org/thread/z5y4qgf1hpc5y5krptb71lx0q1yx9gtx

I made an assumption on how floating point partitions should be considered for equality which I can remove if this is inccorect.

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Nov 21, 2023
format/spec.md Outdated
@@ -305,6 +305,10 @@ The source column, selected by id, must be a primitive type and cannot be contai

Partition specs capture the transform from table data to partition values. This is used to transform predicates to partition predicates, in addition to transforming data values. Deriving partition predicates from column predicates on the table data is used to separate the logical queries from physical storage: the partitioning can change and the correct partition filters are always derived from column predicates. This simplifies queries because users don’t have to supply both logical predicates and partition predicates. For more information, see Scan Planning below.

Two partition specs are considered compatible with each other if they have the same number of fields
and for each corresponding field, the fields have the same source column ID, transform definition
and partition name. Writers must not create a new parition spec if there already exists a compatible partition
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: extra space before 'Writers'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to make sure that this is a bit more limited. For the purpose of deduplication, we want this "compatible" definition. But if we have two specs with different partition-field-id values, they are NOT compatible. We only want this to mean if you can use an existing one (instead of assigning new IDs) you must do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed spacing.

@rdblue I'm not sure if I understood this comment exactly. I added a sentence below to capture not using a new partition field ids below (I'm not sure if this is what you were referring to or if the original text you believed was too broad). If I didn't capture your intent would mind making some suggested edits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now. Thanks!

format/spec.md Outdated
@@ -305,6 +305,10 @@ The source column, selected by id, must be a primitive type and cannot be contai

Partition specs capture the transform from table data to partition values. This is used to transform predicates to partition predicates, in addition to transforming data values. Deriving partition predicates from column predicates on the table data is used to separate the logical queries from physical storage: the partitioning can change and the correct partition filters are always derived from column predicates. This simplifies queries because users don’t have to supply both logical predicates and partition predicates. For more information, see Scan Planning below.

Two partition specs are considered compatible with each other if they have the same number of fields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'considered compatible': is it compatibility or rather equality? (Also at end of L310)

format/spec.md Outdated
@@ -607,6 +611,8 @@ Notes:

1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan.
2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected.
3. Floating point partition values are considered equal if there IEEE 754 floating-point “single format” bit layout
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 'if their IEEE'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? I don't think you should delete based on floating point values since there are so many ways for it to go wrong. Maybe we should disallow deleting by floating points instead of clarifying this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed typo.

@rdblue this is a separate clarification from #8981 for equality deletes floating points are already prohibited (as they are for a primary key). For partitions it appears identity transforms are allowed to apply to floating point numbers, this tries to clarify how comparisons should be done in this scenario (I assume we can't exclude floating point numbers from identity transforms at this point?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think that we can exclude floating point numbers from identity transforms. I think it's fine though, as long as equality is well understood.

format/spec.md Outdated
@@ -607,6 +611,8 @@ Notes:

1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan.
2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected.
3. Floating point partition values are considered equal if there IEEE 754 floating-point “single format” bit layout
are equal (the equivelant of calling `Float.floatToIntBits`` in Java). The avro specification encodes all floating point values in this format.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1: `Float.floatToIntBits`` I haven't checked how it would look like when opening a page but for me it seems that there is a double closing backtick while having a single opening one.

2: nit: There is an extra space before 'The avro'

3: nit: avro -> Avro

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the comment about Avro. What is the relevance here? All serialization formats should preserve the bits they are passed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed typos I think.

@rdblue I clarified that NaNs are also normalized which is why I included this detail (since manifests are required to be in Avro the bits should be directly compatible. I can drop that if you don't think it is worth mentioning.)

format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated Show resolved Hide resolved
format/spec.md Outdated
@@ -607,6 +614,8 @@ Notes:

1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan.
2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected.
3. Floating point partition values are considered equal if their IEEE 754 floating-point “single format” bit layout
are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification requires all all floating point values are encoded in this format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would we define when NaN values are equal? Aren't they never equal?

Copy link
Contributor Author

@emkornfield emkornfield Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't define NaN values as equal, as I read the spec there is no way to specify positition delete files for values that have an NaN partition value, if that is already the case can be I can change this to match the reference implementation.

format/spec.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clarification makes sense to me.

Not for this PR, but the same principle also applies to schemas and sort-orders.

format/spec.md Outdated Show resolved Hide resolved
Co-authored-by: Fokko Driesprong <[email protected]>
@emkornfield
Copy link
Contributor Author

emkornfield commented Dec 1, 2023

Not for this PR, but the same principle also applies to schemas and sort-orders.

@Fokko could you clarify what you mean by schemas?

I thought for sorting the following part of the specification applies:

Sorting floating-point numbers should produce the following behavior: -NaN < -Infinity < -value < -0 < 0 < value < Infinity < NaN. This aligns with the implementation of Java floating-point types comparisons.

Would it make sense in this PR to make more a distinguishing statement only for joining delete files?

I guess you mean the deduplication logic. I can add those in a follow-up.

@rdblue rdblue merged commit 1ed1b4b into apache:main Dec 4, 2023
2 checks passed
@rdblue
Copy link
Contributor

rdblue commented Dec 4, 2023

Thanks for the update, @emkornfield!

lisirrx pushed a commit to lisirrx/iceberg that referenced this pull request Jan 4, 2024
devangjhabakh pushed a commit to cdouglas/iceberg that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants