Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Proof-of-concept Parquet GEOMETRY logical type implementation #43196

Closed
wants to merge 29 commits into from

Conversation

paleolimbot
Copy link
Member

Rationale for this change

In apache/parquet-format#240 a GEOMETRY logical type for Parquet is proposed with a proof-of-concept Java implementation ( apache/parquet-java#1379 ). This is a PR to explore what an implementation would look like in C++.

What changes are included in this PR?

Work in progress!

Are these changes tested?

They will be!

Are there any user-facing changes?

Yes! (And will eventually be documented)

@paleolimbot paleolimbot force-pushed the parquet-geometry branch 3 times, most recently from aa5424d to b34b1c2 Compare August 12, 2024 19:06
@@ -24,6 +24,7 @@
#include <string>
#include <utility>

#include "parquet/geometry_util.h"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use forward declaration? This is a public header.

}
};

class WKBBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move them into arrow/util instead of parquet only?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking twice, I think we can keep this in the parquet module. But it should be renamed to geometry_util_internal.h to avoid making it public.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Aug 17, 2024
@jiayuasu
Copy link
Member

@paleolimbot Hey Dewey, I know you have been busy at other projects. We really want to speed up the process of the Parquet Geometry PR since the proposal and Java POC is already done. Do you have an estimated timeline for the C++ POC implementation?

If you have too much on your plate, we will be happy to take over or help with the C++ POC.

@paleolimbot
Copy link
Member Author

Sorry for the slow reply, I've been out of office! You are of course welcome to continue on top of this PR...I'm back in office Thursday and can definitely review. I think the main remaining portion is the Arrow IO and serde of the statistics.

@jiayuasu
Copy link
Member

jiayuasu commented Sep 3, 2024

@paleolimbot Thanks for the message. We will soon create a PR against your branch!

crs = type.GEOMETRY.crs;
}

LogicalType::GeometryEdges::edges edges = LogicalType::GeometryEdges::UNKNOWN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we throw when edges is UNKNOWN? It is a required field by design.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was emulating how enum parameters are handled in the date/time types here (but I may not have been!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in fb134d3

}

LogicalType::GeometryEncoding::geometry_encoding encoding =
LogicalType::GeometryEncoding::UNKNOWN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, this is a required field by design.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed in fb134d3

@@ -135,6 +186,7 @@ class PARQUET_EXPORT EncodedStatistics {
bool has_max = false;
bool has_null_count = false;
bool has_distinct_count = false;
bool has_geometry_statistics = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know other fields already have has_xxx flags before C++11. Should we use std::optional<EncodedGeometryStatistics> to make it simpler?

Copy link
Member Author

@paleolimbot paleolimbot Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My sense would be to handle changing that pattern with a dedicated PR but I have no opinions either way 🙂

return std::static_pointer_cast<TypedStatistics<DType>>(Statistics::Make(
descr, encoded_min, encoded_max, num_values, null_count, distinct_count,
has_min_max, has_null_count, has_distinct_count, pool));
int64_t distinct_count, const EncodedGeometryStatistics& geometry_statistics,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this will break downstream users as it is a public API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would adding the two new arguments to the end of the argument list (with default values) make this a non-breaking change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to make it ABI-compatible or source-compatible? If ABI compatibility is required we must add a new override version of this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think API-compatible is sufficient. Please correct me if I'm wrong @pitrou

bounder_.ReadGeometryTypes(other.bounder_.WkbTypes());
}

void Update(const ByteArray* values, int64_t num_values, int64_t null_count) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a UpdateSpaced variant which deals with inputs with validity bits, like what TypeStatistics::UpdateSpaced does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added UpdateSpaced in fb134d3, will add tests for it later.

}
};

class WKBBuffer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking twice, I think we can keep this in the parquet module. But it should be renamed to geometry_util_internal.h to avoid making it public.

class WKBGenericSequenceBounder {
public:
WKBGenericSequenceBounder()
: xy_(chunk_),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to keep 8 variants?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is me being fancy and trying to avoid branching 🙂 . A runtime switch is probably fine for a POC and it's possibly worth benchmarking before introducing this complexity.

@wgtmac
Copy link
Member

wgtmac commented Sep 5, 2024

Thanks @paleolimbot for initiating the PoC! I've left some inline comments on this PR.

Thanks @jiayuasu and @Kontinuation to continue the effort! Let's work together to make it to the finish line.

@paleolimbot
Copy link
Member Author

Closing in favour of #43977 !

paleolimbot pushed a commit to paleolimbot/arrow that referenced this pull request Feb 6, 2025
paleolimbot pushed a commit to paleolimbot/arrow that referenced this pull request Mar 10, 2025
Geometry value writer could make use of the geometry statistics class to
populate geometry statistics

Geometry column writer now populates correct statistics

format/tidy

Run clang-tidy

Added a test that writes and reads a parquet file containing a geometry column

Remove redundant include

Fix problems found by reviewers

Try to make it build properly on other platforms

Address review comments in apache#43196

Resolve compile errors for MSVC

Expose getters in GeometryStatistics, Change geometry_types from
std::vector<uint32_t> to std::vector<int32_t>, several other minor fixes.

Add test case for UpdateSpaced, don't generate min/max stats for geometry
columns.

Support covering

MakeStatistics and Statistics::Make should not be a breaking change

ColumnIndex, as well as some other fixes and refacturings

Fix compiler warnings on AMD platforms as well as sanitizer warnings

Remove all newly added include directives

include cmath for std::isnan

Test writing WKB encoded geometries using WriteArrow

Change the sort order of geometry from unknown to unsigned; resolved several review comments

Add generate_covering_ member to be explicit that' we'll generate the coverings from bounding box when populating the encoded statistics

Refactor unscoped enums in geometry_util_internal to enum classes

Revert more special case handling for unknown sort order

Fix WKB covering test to take native endianness into consideration

min/max of geometry columns are the WKB representation of lower-left and upper-right points

Address latest review comments

A better implementation of geometry min/max statistics

Update the code to accomodate the latest changes of the standard:
1. Remove metadata property of geometry logical types
2. Remove covering from geometry statistics

Fix problem decoding WKB geometries with more than 32 coordinates

Re-implemented geometry statistics according to the updated spec:
1. geometry statistics moved out of statistics, it is now a field of column metadata
2. geometry statistics is removed from page index

Revert some unnecessary changes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants