From e6b2138d004d04ee8db3a7d808bd32981f247dbf Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 21 Nov 2023 12:11:08 -0800 Subject: [PATCH 1/7] Spec: Clarify partition equality --- format/spec.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index 27e2762f7724..21be451bfb2c 100644 --- a/format/spec.md +++ b/format/spec.md @@ -305,6 +305,10 @@ The source column, selected by id, must be a primitive type and cannot be contai Partition specs capture the transform from table data to partition values. This is used to transform predicates to partition predicates, in addition to transforming data values. Deriving partition predicates from column predicates on the table data is used to separate the logical queries from physical storage: the partitioning can change and the correct partition filters are always derived from column predicates. This simplifies queries because users don’t have to supply both logical predicates and partition predicates. For more information, see Scan Planning below. +Two partition specs are considered compatible with each other if they have the same number of partition columns +and for each corresponding partition field in the spec, it has the same source column ID, transform definition +and partition name. Writers must not create a new parition spec if there already exists a compatible partition +spec defined in the table. #### Partition Transforms @@ -595,7 +599,7 @@ Delete files that match the query filter must be applied to data files at read t - The data file's partition (both spec and partition values) is equal to the delete file's partition * An _equality_ delete file must be applied to a data file when all of the following are true: - The data file's data sequence number is _strictly less than_ the delete's data sequence number - - The data file's partition (both spec and partition values) is equal to the delete file's partition _or_ the delete file's partition spec is unpartitioned + - The data file's partition (both spec id and partition values) is equal to the delete file's partition _or_ the delete file's partition spec is unpartitioned In general, deletes are applied only to data files that are older and in the same partition, except for two special cases: @@ -607,6 +611,8 @@ Notes: 1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan. 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected. +3. Floating point partition values are considered equal if there IEEE 754 floating-point “single format” bit layout +are equal (the equivelant of calling `Float.floatToIntBits`` in Java). The avro specification encodes all floating point values in this format. #### Snapshot Reference From 73163598a08b7e44ed682c2e8c371eebb9298c45 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Tue, 21 Nov 2023 12:15:14 -0800 Subject: [PATCH 2/7] Fix some terminology --- format/spec.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/format/spec.md b/format/spec.md index 21be451bfb2c..1b875dea8c7c 100644 --- a/format/spec.md +++ b/format/spec.md @@ -305,8 +305,8 @@ The source column, selected by id, must be a primitive type and cannot be contai Partition specs capture the transform from table data to partition values. This is used to transform predicates to partition predicates, in addition to transforming data values. Deriving partition predicates from column predicates on the table data is used to separate the logical queries from physical storage: the partitioning can change and the correct partition filters are always derived from column predicates. This simplifies queries because users don’t have to supply both logical predicates and partition predicates. For more information, see Scan Planning below. -Two partition specs are considered compatible with each other if they have the same number of partition columns -and for each corresponding partition field in the spec, it has the same source column ID, transform definition +Two partition specs are considered compatible with each other if they have the same number of fields +and for each corresponding field, the fields have the same source column ID, transform definition and partition name. Writers must not create a new parition spec if there already exists a compatible partition spec defined in the table. From ebd343ffb5eaf4eff054bb80f51a473d361ef291 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 24 Nov 2023 08:05:53 -0800 Subject: [PATCH 3/7] Address some comments --- format/spec.md | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/format/spec.md b/format/spec.md index 1b875dea8c7c..55d6bfa9d267 100644 --- a/format/spec.md +++ b/format/spec.md @@ -305,11 +305,14 @@ The source column, selected by id, must be a primitive type and cannot be contai Partition specs capture the transform from table data to partition values. This is used to transform predicates to partition predicates, in addition to transforming data values. Deriving partition predicates from column predicates on the table data is used to separate the logical queries from physical storage: the partitioning can change and the correct partition filters are always derived from column predicates. This simplifies queries because users don’t have to supply both logical predicates and partition predicates. For more information, see Scan Planning below. -Two partition specs are considered compatible with each other if they have the same number of fields +Two partition specs are considered equivalent with each other if they have the same number of fields and for each corresponding field, the fields have the same source column ID, transform definition -and partition name. Writers must not create a new parition spec if there already exists a compatible partition +and partition name. Writers must not create a new parition spec if there already exists a compatible partition spec defined in the table. +Partition field IDs must be reused if an existing partition spec +contains an equivalent field. + #### Partition Transforms | Transform name | Description | Source types | Result type | @@ -612,7 +615,7 @@ Notes: 1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan. 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected. 3. Floating point partition values are considered equal if there IEEE 754 floating-point “single format” bit layout -are equal (the equivelant of calling `Float.floatToIntBits`` in Java). The avro specification encodes all floating point values in this format. +are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification encodes all floating point values in this format. #### Snapshot Reference From a96e3d140ab9f8d8dd66838671133558758942ba Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 24 Nov 2023 08:06:34 -0800 Subject: [PATCH 4/7] one more type --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index 55d6bfa9d267..ab5db87e8e98 100644 --- a/format/spec.md +++ b/format/spec.md @@ -614,7 +614,7 @@ Notes: 1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan. 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected. -3. Floating point partition values are considered equal if there IEEE 754 floating-point “single format” bit layout +3. Floating point partition values are considered equal if their IEEE 754 floating-point “single format” bit layout are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification encodes all floating point values in this format. #### Snapshot Reference From 64cfbcee5c8803fbdd648b854872474cc89df0f2 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Fri, 24 Nov 2023 08:07:25 -0800 Subject: [PATCH 5/7] clarify sentence --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index ab5db87e8e98..846b21ec57de 100644 --- a/format/spec.md +++ b/format/spec.md @@ -615,7 +615,7 @@ Notes: 1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan. 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected. 3. Floating point partition values are considered equal if their IEEE 754 floating-point “single format” bit layout -are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification encodes all floating point values in this format. +are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification requires all all floating point values are encoded in this format. #### Snapshot Reference From d9490bb23e2dcfe2fb243bc9d7bdbb0f2450a0b7 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Sun, 26 Nov 2023 18:04:57 -0800 Subject: [PATCH 6/7] address some more comments --- format/spec.md | 11 +++-------- 1 file changed, 3 insertions(+), 8 deletions(-) diff --git a/format/spec.md b/format/spec.md index 846b21ec57de..abaf3f2e4e71 100644 --- a/format/spec.md +++ b/format/spec.md @@ -305,13 +305,9 @@ The source column, selected by id, must be a primitive type and cannot be contai Partition specs capture the transform from table data to partition values. This is used to transform predicates to partition predicates, in addition to transforming data values. Deriving partition predicates from column predicates on the table data is used to separate the logical queries from physical storage: the partitioning can change and the correct partition filters are always derived from column predicates. This simplifies queries because users don’t have to supply both logical predicates and partition predicates. For more information, see Scan Planning below. -Two partition specs are considered equivalent with each other if they have the same number of fields -and for each corresponding field, the fields have the same source column ID, transform definition -and partition name. Writers must not create a new parition spec if there already exists a compatible partition -spec defined in the table. +Two partition specs are considered equivalent with each other if they have the same number of fields and for each corresponding field, the fields have the same source column ID, transform definition and partition name. Writers must not create a new parition spec if there already exists a compatible partition spec defined in the table. -Partition field IDs must be reused if an existing partition spec -contains an equivalent field. +Partition field IDs must be reused if an existing partition spec contains an equivalent field. #### Partition Transforms @@ -614,8 +610,7 @@ Notes: 1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan. 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected. -3. Floating point partition values are considered equal if their IEEE 754 floating-point “single format” bit layout -are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification requires all all floating point values are encoded in this format. +3. Floating point partition values are considered equal if their IEEE 754 floating-point "single format" bit layout are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification requires all all floating point values are encoded in this format. #### Snapshot Reference From 21e3b21cbcf3ddf2752318062a7f607ce4a46126 Mon Sep 17 00:00:00 2001 From: emkornfield Date: Thu, 30 Nov 2023 22:21:20 -0800 Subject: [PATCH 7/7] Update format/spec.md Co-authored-by: Fokko Driesprong --- format/spec.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/format/spec.md b/format/spec.md index abaf3f2e4e71..5d6dded5ee76 100644 --- a/format/spec.md +++ b/format/spec.md @@ -610,7 +610,7 @@ Notes: 1. An alternative, *strict projection*, creates a partition predicate that will match a file if all of the rows in the file must match the scan predicate. These projections are used to calculate the residual predicates for each file in a scan. 2. For example, if `file_a` has rows with `id` between 1 and 10 and a delete file contains rows with `id` between 1 and 4, a scan for `id = 9` may ignore the delete file because none of the deletes can match a row that will be selected. -3. Floating point partition values are considered equal if their IEEE 754 floating-point "single format" bit layout are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification requires all all floating point values are encoded in this format. +3. Floating point partition values are considered equal if their IEEE 754 floating-point "single format" bit layout are equal with NaNs normalized to have only the the most significant mantissa bit set (the equivelant of calling `Float.floatToIntBits` or `Double.doubleToLongBits` in Java). The Avro specification requires all floating point values to be encoded in this format. #### Snapshot Reference