Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add spatial support #12053

Draft
wants to merge 14 commits into
base: main
Choose a base branch
from
Draft

Conversation

wraymo
Copy link

@wraymo wraymo commented Jan 10, 2025

Summary:
This draft PR introduces spatial data support (#11814) to Velox, including the following key updates:

  • Added support for geometry types within Velox.
  • Introduced the GEOS library as a dependency for spatial operations.
  • Implemented serialization/deserialization for geometry types using GEOS.
  • Added cast operators to convert between VARCHAR (WKT) or VARBINARY (WKB) and geometry types.
  • Implemented some spatial functions, including ST_Point and ST_Contains.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2025
Copy link

netlify bot commented Jan 10, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit f48cabc
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/679ac5fd343f2c0008456dd9

Copy link
Collaborator

@czentgr czentgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First initial thoughts.

include_guard(GLOBAL)

# GEOS Configuration
set(VELOX_GEOS_BUILD_VERSION 3.11.1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this as a non-bundled dependency as well to the setup scripts?
Also please update the README in this folder.


void registerGeometryType();

} // namespace facebook::velox
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new line.

namespace facebook::velox {

/// Represents Geometry as a string.
class GeometryType : public VarcharType {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose we use varchar here because we serialize the GEOS object into a string via the StringWriter, right?
This would then be the current internal representation and handled like a regular varchar in exchanges and such.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, I can update this to VarbinaryType because both types are serialized into variable-width blocks, and in Velox, we can use StringWriter for handling both.

@jagill
Copy link

jagill commented Jan 29, 2025

Sorry for the delay, @wraymo! I'll get to this by the end of the week. Thanks for your patience.

@jagill
Copy link

jagill commented Jan 31, 2025

Thanks @wraymo for this work! Since this is just a draft, I'll keep most of my comments high-level.

  1. To keep the commit history clean, please use git commit --amend and git rebase freely on your local machine, then use git push --force to have this PR always have a clean commit history. It's great to separate out the commits into clean, functionally complete blocks. For example, for geospatial I'd have one commit that does all the CMake stuff, one that defines the basic type structure (include de/serialization), one that adds a function with the basic functional skeleton, etc. I'm not focused on the actual specifics for the commits, just that the commits should be easily understandable on their own to someone looking at this PR for the first time.

  2. The CMake bundling is getting close to ready. If you want, we can separate that out into it's own PR, and rebase this PR on top of that. Then we can ship that once it's done while we discuss the rest!

  3. Let's use Varbinary as the parent type of Geometry, with HyperLogLog as a model, including the way it does serialization/deserialization.

  4. For code organization, let's make a separate folder velox/functions/prestosql/types/geo to hold most of the logic. We can revisit this later.

I want to emphasize this is a great draft, and you don't have to do everything yourself! I'm happy to take on this work if you don't have the time for any or all of it, but if you wanted to start finalizing parts of the PR we can start shipping initial bits.

@wraymo
Copy link
Author

wraymo commented Jan 31, 2025

Thanks @wraymo for this work! Since this is just a draft, I'll keep most of my comments high-level.

  1. To keep the commit history clean, please use git commit --amend and git rebase freely on your local machine, then use git push --force to have this PR always have a clean commit history. It's great to separate out the commits into clean, functionally complete blocks. For example, for geospatial I'd have one commit that does all the CMake stuff, one that defines the basic type structure (include de/serialization), one that adds a function with the basic functional skeleton, etc. I'm not focused on the actual specifics for the commits, just that the commits should be easily understandable on their own to someone looking at this PR for the first time.
  2. The CMake bundling is getting close to ready. If you want, we can separate that out into it's own PR, and rebase this PR on top of that. Then we can ship that once it's done while we discuss the rest!
  3. Let's use Varbinary as the parent type of Geometry, with HyperLogLog as a model, including the way it does serialization/deserialization.
  4. For code organization, let's make a separate folder velox/functions/prestosql/types/geo to hold most of the logic. We can revisit this later.

I want to emphasize this is a great draft, and you don't have to do everything yourself! I'm happy to take on this work if you don't have the time for any or all of it, but if you wanted to start finalizing parts of the PR we can start shipping initial bits.

Thanks @jagill for the thoughtful feedback! That all makes sense. I'll use git commit --amend if that's the preferred way.

Although this is a draft PR, we've deployed it in one of our clusters alongside a basic local spatial join (using a nested loop) and tested it on several queries. So far, the results match those from the Java workers, which is promising.

Regarding splitting this PR, my plan aligns closely with this issue—seems like we're on the same page! For the first PR, do you prefer introducing the Geometry type (modeled like HyperLogLogType and adding a test in velox/functions/prestosql/types/tests), or should we first add GEOS as a dependency in the CMake files? If the latter, what parts are missing from this PR?

For code organization, would it make sense to place serialization and related utility functions in velox/functions/prestosql/geo, similar to how velox/functions/prestosql/json is structured?

I have plenty of time to work on this and am happy to contribute however I can. Let me know how you'd like to proceed, and I'll push updates accordingly.

Thanks again for your guidance and for offering to help!

@jagill jagill self-assigned this Jan 31, 2025
@jagill
Copy link

jagill commented Jan 31, 2025

Although this is a draft PR, we've deployed it in one of our clusters alongside a basic local spatial join (using a nested loop) and tested it on several queries. So far, the results match those from the Java workers, which is promising.

This is great to hear!

For the first PR, do you prefer introducing the Geometry type ..., or should we first add GEOS as a dependency ...?

For steps:

  1. Let's first add GEOS as a dependency, because we'll need that for almost anything interesting.
  2. Then let's add the geometry and very basic serialization/deserialization. There will probably be a bunch of discussion about some of the choices, in particular the API. For serialization/deserialization, we'll probably consider several options. But for v0, maybe we can just use built-in WKB? This is not efficient! But it will unlock the ability to start adding Geospatial functions.
  3. In parallel, we can...
    A. start adding geospatial functions, and
    B. interested parties can think about better serialization.

For what's needed for the GEOS dependency, let's make sure:

  1. GEOS import and geospatial capabilities are gated by a flag VELOX_ENABLE_GEO.
  2. If it's enabled, people can either use a bundled version or a non-bundled version. The flag geos_SOURCE determines this: the values can be AUTO , SYSTEM, or BUNDLED (AUTO is for cmake to figure out what works best). But let's test this.

For code organization, would it make sense to place serialization and related utility functions in velox/functions/prestosql/geo, similar to how velox/functions/prestosql/json is structured?

Yes, this sounds good!

I have plenty of time to work on this and am happy to contribute however I can. Let me know how you'd like to proceed, and I'll push updates accordingly.

Great to have you contributing! Let's start with the Geos import. We'll have some people more familiar with Velox build give feedback.

@wraymo
Copy link
Author

wraymo commented Feb 4, 2025

@jagill Thanks for your comment! I just submitted a PR to add GEOS as an optional dependency. My current PR serializes data in Shapefile format. Should we use Shapefile first, like the Java worker, to avoid compatibility issues (e.g. if the coordinator sends Shapefile-serialized data to a Velox worker)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants