-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spec: Support geo type #10981
base: main
Are you sure you want to change the base?
Spec: Support geo type #10981
Conversation
a096921
to
19f24a4
Compare
format/spec.md
Outdated
XZ2 is based on the paper [XZ-Ordering: A Space-Filling Curve for Objects with Spatial Extensions]. | ||
|
||
Notes: | ||
1. Resolution must be a positive integer. Defaults to TODO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiayuasu do you have any suggestion for default here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12 sounds fine. CC @Kontinuation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeoMesa uses a high XZ2 resolution when working with key-value stores such as Accumulo and HBase, it is not appropriate to always use a resolution that high for partitioning data (for instance, GeoMesa on FileSystems).
XZ2 resolution 11~12 works for city-scale data, but will generate too many partitions for country-scale or world-scale data. I'd like to have a smaller default value such as 7 to be safe on various kinds of data.
format/spec.md
Outdated
| **`struct`** | `group` | | | | ||
| **`list`** | `3-level list` | `LIST` | See Parquet docs for 3-level representation. | | ||
| **`map`** | `3-level map` | `MAP` | See Parquet docs for 3-level representation. | | ||
| **`geometry`** | `binary` | `GEOMETRY` | WKB format, see Appendix G. Logical type annotation optional for supported Parquet format versions [1]. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could add this section later too, once its implemented (same for ORC below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Appendix G](#appendix-g)
19f24a4
to
d7096e4
Compare
format/spec.md
Outdated
| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | | ||
| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | | ||
| _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | | ||
| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For Geometry type, this is a Point composed of the min value of each dimension in all Points in the Geometry. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work? Does Iceberg need to interpret each WKB to produce this value? Will it be provided by Parquet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, once we switch to Geometry logical type from Parquet we will get these stats from Parquet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that it is the parquet type BoundingBox
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea will add a footnote here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szehon-ho BTW, the reason why we had a separate bbox statistics in havasu is to be compatible with existing Iceberg tables. Since this is to add the native geometry support, so lower_bound
and upper_bound
are good choices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that the bounds were stored as WKB-encoded points (according to Appendix D and G), and WKB encodes dimensions of geometries in the header. It is more consistent to make bound values the same type/representation as the field data type.
More sophisticated coverings in Parquet statistics cannot be easily mapped to lower_bounds
and upper_bounds
, so do we simply use the bbox
statistics and ignore the coverings
for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is ok since these two bounds are optional and in case they are not presented, it still follow the spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it support different dimensions like XY, XYZ, XYM, XYZM? If yes, how can we tell if the binary is for XYZ or XYM?
We should say For Geometry type, this is a WKB-encoded Point composed of the min value of each dimension in all Points in the Geometry.
Then we don't have to worry about the Z and M value.
CC @szehon-ho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should say For Geometry type, this is a WKB-encoded Point composed of the min value of each dimension in all Points in the Geometry. Then we don't have to worry about the Z and M value.
@jiayuasu @Kontinuation @wgtmac Done, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that it is the parquet type BoundingBox?
Actually looking again after some time, not sure how to mention this here, as that is filetype specific. This is an optional field, and only set if type is parquet and bounding_box is set, but that's implementation detail .
format/spec.md
Outdated
| **`void`** | Always produces `null` | Any | Source type or `int` | | ||
| Transform name | Description | Source types | Result type | | ||
|-------------------|--------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|----------------------| | ||
| **`identity`** | Source value, unmodified | Any | Source type | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except for geometry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe that's fine if it is comparable, but practically people will always use xz2
, right? I'm not sure, but wondering if there is some implications, e.g., too expensive, or super high cardinality, so that we don't recommend user to use the original GEO value as the partition spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea I think its possible to do (it's just the wkb value after all), you are right , not sure if any good use case. Yea we have to get the wkb in any case, i am not sure if its that expensive, but can check. But I guess the cardinality is the same consideration as any other type (uuid for example), and we let the user choose ?
format/spec.md
Outdated
| Transform name | Description | Source types | Result type | | ||
|-------------------|--------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------|----------------------| | ||
| **`identity`** | Source value, unmodified | Any | Source type | | ||
| **`bucket[N]`** | Hash of value, mod `N` (see below) | `int`, `long`, `decimal`, `date`, `time`, `timestamp`, `timestamptz`, `timestamp_ns`, `timestamptz_ns`, `string`, `uuid`, `fixed`, `binary` | `int` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to support bucketing on GEO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its possible, again not sure the utility. Geo boils down to just WKB bytes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel that the argument for identity
can apply here as well. In that case, we can support it, but it's users' call to use it or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to change this to be like identity
, using Any except [...]
.
I would not include geo as a source column for bucketing because there is not a clear definition of equality for geo. The hash would depend on the structure of the object and weird things happen when two objects are "equal" (for some definition) but have different hash values.
format/spec.md
Outdated
@@ -198,6 +199,9 @@ Notes: | |||
- Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). | |||
- Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). | |||
3. Character strings must be stored as UTF-8 encoded byte arrays. | |||
4. Coordinate Reference System, i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When we say
OGC:CRS84
, the value you put in this field should be the following PROJJSON string (see GeoParquet spec)
{
"$schema": "https://proj.org/schemas/v0.5/projjson.schema.json",
"type": "GeographicCRS",
"name": "WGS 84 longitude-latitude",
"datum": {
"type": "GeodeticReferenceFrame",
"name": "World Geodetic System 1984",
"ellipsoid": {
"name": "WGS 84",
"semi_major_axis": 6378137,
"inverse_flattening": 298.257223563
}
},
"coordinate_system": {
"subtype": "ellipsoidal",
"axis": [
{
"name": "Geodetic longitude",
"abbreviation": "Lon",
"direction": "east",
"unit": "degree"
},
{
"name": "Geodetic latitude",
"abbreviation": "Lat",
"direction": "north",
"unit": "degree"
}
]
},
"id": {
"authority": "OGC",
"code": "CRS84"
}
}
- Both
crs
andcrs_kind
field are optional. But when theCRS
field presents, thecrs_kind
field must present. In this case, since we hard code thiscrs
field in this phase, then we need to setcrs_kind
field (string) toPROJJSON
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include this example in the parquet spec as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, should we advise accepted forms or values for CRS and Edges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac We should add this value to the Parquet spec for sure. CC @zhangfengcdt
@szehon-ho There is another situation mentioned in the GeoParquet spec: If the CRS field presents but its value is null, it means the data is in unknown CRS
. This situation happens sometimes because the writer somehow cannot find or lose the CRS info. Do we want to support this? I think we can use the empty string
to cover this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accepted forms or values for CRS and Edges
If we borrow the conclusion from Parquet Geometry proposal, then C,T,E fields are the follows:
C is a string. Based on what I understand from this PR, @szehon-ho made this field a required field, which is fine.
T is optional and a string. Currently, it only allows this value PROJJSON
. When it is not provided, it defaults to PROJJSON
too.
E is a string. The only allowed value is PLANAR
in this phase. Based on what I understand from this PR, @szehon-ho made this field a required field, which is fine. @szehon-ho According to our meeting with Snowflake, I think maybe we can allow SPHERICAL
too? We can add in the spec that: currently it is unsafe to perform partition transform / bounding box filtering when E = SPHERICAL
because they are built based on PLANAR
edges. It is the reader's responsibility to decide if they want to use partition transform / bounding box filtering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does it make sense to include the following CRS84 example from the Parquet Geometry PR?
/**
* Coordinate Reference System, i.e. mapping of how coordinates refer to
* precise locations on earth. Writers are not required to set this field.
* Once crs is set, crs_encoding field below MUST be set together.
* For example, "OGC:CRS84" can be set in the form of PROJJSON as below:
* {
* "$schema": "https://proj.org/schemas/v0.5/projjson.schema.json",
* "type": "GeographicCRS",
* "name": "WGS 84 longitude-latitude",
* "datum": {
* "type": "GeodeticReferenceFrame",
* "name": "World Geodetic System 1984",
* "ellipsoid": {
* "name": "WGS 84",
* "semi_major_axis": 6378137,
* "inverse_flattening": 298.257223563
* }
* },
* "coordinate_system": {
* "subtype": "ellipsoidal",
* "axis": [
* {
* "name": "Geodetic longitude",
* "abbreviation": "Lon",
* "direction": "east",
* "unit": "degree"
* },
* {
* "name": "Geodetic latitude",
* "abbreviation": "Lat",
* "direction": "north",
* "unit": "degree"
* }
* ]
* },
* "id": {
* "authority": "OGC",
* "code": "CRS84"
* }
* }
*/
- It is ok to have them all fixed to default values for this phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jiayuasu i put it in the example (if you render the page). let me know if its not what you meant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this example is non-standard. A CRS JSON encoding is in progress at OGC, but we don't know yet what it will look like. The only standards at this time are GML, WKT, or simply put a reference to the EPSG database or OGC registry (e.g. https://www.opengis.net/def/crs/OGC/0/CRS84 or "urn:ogc:def:crs:OGC::CRS84"
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Coordinate Reference System, i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84".
Maybe remove the word "precise"? Not all CRS are precise. The proposed default, OGC:CRS84
, has an uncertainty of about 2 meters (unrelated to floating point precision). Maybe it would be worth to change the sentence to: Defaults to "OGC:CRS84", which provides an accuracy of about 2 meters.
Another reason for removing the "precise" word is that the CRS alone is not sufficient for really precise locations. We also need the epoch, which is a topic that has not been covered at all in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed precise
format/spec.md
Outdated
@@ -190,6 +190,7 @@ Supported primitive types are defined in the table below. Primitive types added | |||
| | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | |||
| | **`fixed(L)`** | Fixed-length byte array of length L | | | |||
| | **`binary`** | Arbitrary-length byte array | | | |||
| [v3](#version-3) | **`geometry(C, T, E)`** | An object of the simple feature geometry model as defined by Appendix G; This may be any of the Geometry subclasses defined therein; coordinate reference system C [4], coordinate reference system type T [5], edges E [6] | C, T, E are fixed. Encoded as WKB, see Appendix G. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What syntax to use for an engine to create the geometry type? Does it require C/T/E to appear in the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to above comment, I think these will all be optional (take a default value if not specified).
Hi all fyi i have unfortunately encountered some problems while remote and probably cant update this, will come back to this after i get back home in two weeks. |
75326dc
to
0591f68
Compare
@jiayuasu @Kontinuation @wgtmac @flyrain @rdblue sorry for the delay, as I only got access now. Updated the pr. |
format/spec.md
Outdated
|
||
Notes: | ||
|
||
1. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). | ||
2. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). | ||
3. Character strings must be stored as UTF-8 encoded byte arrays. | ||
4. Crs (coordinate reference system) is a mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution. | ||
5. Crs-encoding (coordinate reference system encoding) is the type of crs field. Must be set if crs is set. Defaults to "PROJJSON". Fixed and cannot be changed by schema evolution. | ||
6. Edges is the interpretation for non-point geometries in geometry object, i.e. whether an edge between points represent a straight cartesian line or the shortest line on the sphere. Defaults to "planar". Fixed and cannot be changed by schema evolution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we maybe explicitly mention here that both "planar" and "spherical" are supported as edge type enum values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmitrykoval i was debating this.
I guess we talked about it before, but the Java reference implementation, we cannot easily do pruning (file level, row level, or partition level) because the JTS library and the XZ2 only support non-spherical. We would need new metrics types, new Java libraries , and new partition transform proposals if we wanted to support it in Java reference implementation.
But if we want to support it, Im ok to list it here and have checks to just skip pruning for spherical geometry columns.
@flyrain @jiayuasu @Kontinuation does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think if "planar" is the default edge type, then there shouldn't be many changes to the planar geometry code path, except for additional checks to skip some partitioning/pruning cases, right?
Regarding the reference implementation of the "spherical" type, do we need to fully support it from day one, or can we maybe mark it as optional in the initial version of the spec? For example, it would work if the engine supports it, but by default, we would fall back to the planar edge type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could list spherical
as an allowed edge type here. Maybe just mark it that it is not safe to perform partition transform or lower_bound/upper_bound filtering when the edge is spherical
. We did the same in the Parquet Geometry
PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea , forgot to mention explicitly that in Iceberg, pruning is always an optional feature for reads, so no issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Crs-encoding (coordinate reference system encoding) is the type of crs field.
- Please change "is the type" by "is the encoding". The CRS type is something completely different than the information put in this field.
- (minor detail) "crs" should be upper-case "CRS" as it is an acronym for "Coordinate Reference System".
Defaults to "PROJJSON".
I suggest to remove this specification. PROJJSON is non-standard and may never be (we don't know yet what will be OGC's decision). I don't think that we want to tie Iceberg forever to a format specific to one project.
format/spec.md
Outdated
| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | | ||
| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | | ||
| _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | | ||
| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For geometry type, this is a WKB-encoded point composed of the min value of each dimension among all component points of all geometry objects for the file. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For geometry type, this is a WKB-encoded point composed of the min value of each dimension among all component points of all geometry objects for the file.
As we are finishing the PoC on the Parquet side, the remaining issue is what value to write to min_value/max_value fields of statistics and page index. To give some context, Parquet requires min_value/max_value fields to be set for page index and statistics are used to generate page index. The C++ PoC is omitting min_value/max_value values and the Java PoC is pretending geometry values are plain binary values while collecting the stats. Should we do similar things here? Then the Iceberg code can directly consume min_value/max_value from statistics instead of issuing another call to get the specialized GeometryStatistics
which is designed for advanced purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wgtmac do you mean that Iceberg uses the Parquet Geometry GeometryStatistics or Parquet Geometry uses the min_value/max_value idea from Iceberg?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the latter. The ColumnOrder
of the new geometry type is undefined
as specified at https://github.com/apache/parquet-format/pull/240/files#diff-834c5a8d91719350b20995ad99d1cb6d8d68332b9ac35694f40e375bdb2d3e7cR1144. It means that the min_value/max_value fields are meaningless and should not be used. I'm not sure if it is a good idea to set min_value/max_value fields in the same way as lower_bounds/upper_bounds of Iceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest defining the sort order of geometry columns as WKB-encoded points in the parquet format spec. This is the most simple yet useful way of defining the min and max bounds for geometry columns, and the sort order is better to be well-defined rather than left undefined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it is better to explicitly define the column order than being undefined. If we go with this approach, the format PR and two PoC impls need to reflect this change, which might get more complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything for this specific line we need to change? As long as we get from Parquet some way we are ok here, but is the format of the lower/upper bound still ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I was thinking if Parquet could do better by doing similar things in the future.
b459eaf
to
1ee5fad
Compare
format/spec.md
Outdated
@@ -200,12 +200,16 @@ Supported primitive types are defined in the table below. Primitive types added | |||
| | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | |||
| | **`fixed(L)`** | Fixed-length byte array of length L | | | |||
| | **`binary`** | Arbitrary-length byte array | | | |||
| [v3](#version-3) | **`geometry(C, CE, E)`** | An object of the simple feature geometry model as defined by Appendix G; This may be any of the geometry subclasses defined therein; crs C [4], crs-encoding CE [5], edges E [6] | C, CE, E are fixed, and if unset will take default values. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think maybe we should just link out for the requirements here since it's a bit complicated.
The description as well could be
Simple feature geometry Appendix G, Parameterized by ....
I also don't think we should allow it to be unset ... can we just require that a subclass is always picked? We could recommend a set of defaults for engines to set on field creation but I'm not sure we need to be that opinionated here.
@@ -1506,6 +1523,8 @@ This serialization scheme is for storing single values as individual binary valu | |||
| **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format | | |||
| **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format | | |||
| **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format | | |||
| **`geometry`** | **`JSON string`** | `POINT (30 10)` | Stored using WKT representation, see [Appendix G](#appendix-g-geospatial-notes) | | |||
| **`geography`** | **`JSON string`** | `POINT (30 10)` | Stored using WKT representation, see [Appendix G](#appendix-g-geospatial-notes) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like above, this table covers arbitrary geometry
and geography
objects. I think it is fine to use WKT here and not have a separate way to store points, but we should be aware that the implication is that WKT is required for any JSON representation.
7d06e1f
to
c6bd5ce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
format/spec.md
Outdated
@@ -205,15 +205,40 @@ Supported primitive types are defined in the table below. Primitive types added | |||
| | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | |||
| | **`fixed(L)`** | Fixed-length byte array of length L | | | |||
| | **`binary`** | Arbitrary-length byte array | | | |||
| [v3](#version-3) | **`geometry(C)`** | Geometry features from [OGC – Simple feature access][1001]. Edge-interpolation is always linear/planar. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C. If not specified, C is `OGC:CRS84`. | | | |||
| [v3](#version-3) | **`geography(C, A)`** | Geometry features from [OGC – Simple feature access][1001]. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C and edge-interpolation algoritm A. If not specified, C is `OGC:CRS84`. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor clarification: since we are not specifying the default value for A, if the user provides a single argument geography specialization, it would be treated as geography(A)
, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, what do you think to put 'spherical' as default? cc @jiayuasu @mkaravel @paleolimbot as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think making 'spherical' a default is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @szehon-ho!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections from me in either direction!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
For details on how to serialize a schema to JSON, see Appendix C. | ||
|
||
[1001]: <https://portal.ogc.org/files/?artifact_id=25355> "OGC Simple feature access" | ||
|
||
##### CRS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does Custom CRS
mean ? Is it any non-default CRS value ?
Can values of CRS be of format format $authorithy:$identifier
similar to that of default value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, $authority:$identifier
is the intent.
format/spec.md
Outdated
|
||
The default CRS value `OGC:CRS84` means that the objects must be stored in longitude, latitude based on the WGS84 datum. | ||
|
||
Custom CRS values can be specified by a string of the format `$type:$content`, where `type` is one of the following values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think the idea is that this is an interpolated string, but we don't use $
elsewhere so people may embed the character. What about simply "type:identifer
, where type
is ..."
format/spec.md
Outdated
@@ -205,15 +205,40 @@ Supported primitive types are defined in the table below. Primitive types added | |||
| | **`uuid`** | Universally unique identifiers | Should use 16-byte fixed | | |||
| | **`fixed(L)`** | Fixed-length byte array of length L | | | |||
| | **`binary`** | Arbitrary-length byte array | | | |||
| [v3](#version-3) | **`geometry(C)`** | Geometry features from [OGC – Simple feature access][1001]. Edge-interpolation is always linear/planar. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C. If not specified, C is `OGC:CRS84`. | | | |||
| [v3](#version-3) | **`geography(C, A)`** | Geometry features from [OGC – Simple feature access][1001]. See [Appendix G](#appendix-g-geospatial-notes). Parameterized by CRS C and edge-interpolation algoritm A. If not specified, C is `OGC:CRS84` and A is `spherical`. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it accurate to use Geometry features
for geography
type? Or Geometry features
is a standard term?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed to use geospatial, as discussed offline
e0cfa18
to
03fefb7
Compare
03fefb7
to
b7a49a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for updating this!
@@ -468,7 +494,7 @@ Partition field IDs must be reused if an existing partition spec contains an equ | |||
|
|||
| Transform name | Description | Source types | Result type | | |||
|-------------------|--------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|-------------| | |||
| **`identity`** | Source value, unmodified | Any except for `variant` | Source type | | |||
| **`identity`** | Source value, unmodified | Any except for `geometry`, `geography`, and `variant` | Source type | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
The default CRS value `OGC:CRS84` means that the objects must be stored in longitude, latitude based on the WGS84 datum. | ||
|
||
Custom CRS values can be specified by a string of the format `type:identifier`, where `type` is one of the following values: | ||
|
||
* `srid`: [Spatial reference identifier](https://en.wikipedia.org/wiki/Spatial_reference_system#Identifier), `identifier` is the SRID itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is some discrepency in the description here. CRS values are mentioned to be specified in two format:
- srid:identifier
- projjson: identifier
The default value of OGC:CRS84
doesn't follow the pattern. If OGC:CRS84
is the identifier itself, then should be mentioend as srid:OGC:CRS84
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@redblackcoder OGC:CRS84
is not an identifier. It is corresponding to the following projjson string: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#ogccrs84-details
It is also equivalent to srid:4326 but flip the axis order to lon/lat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't SRID database specific? From the open specification, it is like following:
SRID
, AUTH_NAME
, AUTH_SRID
, SRTEXT
OGC:CRS84
corresponds to AUTH_NAME
=OGC
and AUTH_SRID
=CRS84
Not sure what does SRID:4326 mean from Auth_Name and Auth_Srid perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does SRID:4326
mean EPSG:4326
? Is it common to assume EPSG as the default authority for the codes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does SRID:4326 mean EPSG:4326? Is it common to assume EPSG as the default authority for the codes?
Not exactly. SRID usually means "the primary key in the context of those data". In a spatial database, this is the primary key of the spatial_ref_sys
table. Theoretically, this is independent of EPSG codes. However, it is a common practice to use the same numerical values for convenience.
For mapping SRID to EPSG code in a spatial database, we need to look at the other columns. If and only if the value of AUTH_NAME
is EPSG
, then the value of AUTH_SRID
is the EPSG code. But we could also have the IAU
value in the authority column, in which case the authority code is for some astronomical body (Moon, Mars, Jupiter, etc.).
Note that even when SRID
= 4326, AUTH_NAME
= "EPSG" and AUTH_SRID
= 4326, we do not really have "SRID:4326" = "EPSG:4326" because the authoritative definition of the latter has (latitude, longitude) axis order. Any other axis order is not EPSG:4326, but rather some derivative of EPSG:4326. Therefore, SRID:4326
should rather be understood as "related to EPSG:4326" instead of "same as EPSG:4326".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
In absense of any spec around spatial_ref_sys
table for iceberg, should the spec stick with AUTH_NAME:AUTH_SRID
as the identifier when using srid
type as the custom CRS? Currently, it is skipping the AUTH_NAME
from the identifier and assuming AUTH_SRID
to be unique across all.
For `geometry` and `geography` types, `lower_bounds` and `upper_bounds` are both points of the following coordinates X, Y, Z, and M (see [Appendix G](#appendix-g-geospatial-notes)) which are the lower / upper bound of all objects in the file. For the X values only, xmin may be greater than xmax, in which case an object in this bounding box may match if it contains an X such that `x >= xmin` OR`x <= xmax`. In geographic terminology, the concepts of `xmin`, `xmax`, `ymin`, and `ymax` are also known as `westernmost`, `easternmost`, `southernmost` and `northernmost`, respectively. For `geography` types, these points are further restricted to the canonical ranges of [-180 180] for X and [-90 90] for Y. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do lower_bounds
and upper_bounds
mean the bounding box for the geospatial data points in the file? Can it be expanded to explain that bounds are based on a bounding box and these values are corners of the bounding box. Maybe this can be added to the Appendix.
This is the spec change for #10260.
Also this is based closely on the decisions taken in the Parquet proposal for the same : apache/parquet-format#240