Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geometry extension with GeoArrow (non-binary) geometry #703

Open
kylebarron opened this issue Sep 11, 2024 · 3 comments
Open

Geometry extension with GeoArrow (non-binary) geometry #703

kylebarron opened this issue Sep 11, 2024 · 3 comments

Comments

@kylebarron
Copy link

👋 I'm a core contributor to GeoArrow, but I'm new to substrait and just reading through the docs.

I was surprised to see that a geospatial extension was already started/added in #543, #552, #554, #556.

But the geometry type's structure is defined as binary, which is (seems) incompatible with GeoArrow's layout of nested lists/structs.

types:
- name: geometry
structure: "BINARY"

What's the right way to handle this? Some potential implementations like DuckDB/Postgres would indeed have an opaque binary blob, but newer GeoArrow-based implementations would not have a binary object. Should GeoArrow be creating its own substrait extension that build on GeoArrow data types? That would mean multiple substrait geospatial extensions that use binary blobs or native geometries as a base.

For reference, I'm quite interested in geospatial support in substrait, with a long term goal of building a geospatial substrait executor backed by DataFusion and geoarrow-rs once DataFusion gets user-defined type support apache/datafusion#11513.

cc @jorisvandenbossche, @paleolimbot as fellow GeoArrow contributors and since they commented in #543

@kylebarron kylebarron changed the title Geometry extension with non-binary geometry Geometry extension with GeoArrow (non-binary) geometry Sep 11, 2024
@paleolimbot
Copy link

I think I was vaguely responsible for pushing the initial bit towards BINARY in review, which I did because the number of types explodes if you give each geometry_type/dimension its own type. Because many (most, all?) implementations right now have a "geometry" type where each element could be any geometry type/dimension, we would still need those type/function definitions to exist.

It would be very cool to define the geometry type and/or dimension-specific types and overloads of the functions that give a more precise output type (for example, intersection(point, point) will always be a point). I forget whether the existing yaml syntax for defining these things made this impossible or just made it very verbose.

@westonpace
Copy link
Member

What's the right way to handle this? Some potential implementations like DuckDB/Postgres would indeed have an opaque binary blob, but newer GeoArrow-based implementations would not have a binary object. Should GeoArrow be creating its own substrait extension that build on GeoArrow data types? That would mean multiple substrait geospatial extensions that use binary blobs or native geometries as a base.

The "structure" field in Substrait is (to the best of my knowledge) not used much yet. Types in Substrait are a little different than types in other libraries because Substrait doesn't really need to know how to create arrays. In fact, Substrait really only cares about types with regards to functions "what functions are allowed for a given type?"

Some examples:

  • INT32 and INT8 are different Substrait types because the add(200, 200) function behaves differently between the two types (it gives an error for INT8 and 400 for INT32.
  • STRING and DICTIONARY are not different types because all functions should behave the same

In practice, there is one other spot where we need to know some details about the type, and that is when we need to create a type literal (e.g. if we want to represent the plan SELECT x + 2 FROM my_table then we need some way of encoding 2 into protobuf). The structure can be used for this but we can also use a protobuf Any message. I'm not sure anyone is using user defined types in Substrait extensively enough to worry too much about the structure field yet. I've personally been using the protobuf Any approach in arrow-cpp (details here).

So, if you have a better suggestion, go ahead and change it. I'd encourage anything other than BINARY which gives zero information on how to interpret the value. It actually doesn't matter what an engine is using under the hood. For example, if we change it to nested list / struct and some implementation is using opaque binary then that implementation should still be able to consume the plan (it will just need to convert literals from the nested list / struct representation to the opaque blob representation).

For reference, I'm quite interested in geospatial support in substrait, with a long term goal of building a geospatial substrait executor backed by DataFusion and geoarrow-rs once DataFusion gets user-defined type

I'd recommend reviewing the available functions and making sure you agree with them.

Is it common to include geospatial literals in queries? If so, we might want to canonicalize a message that can be used as a geospatial literal. I'm not enitrely sure where it would go but would encourage you to share some protobuf you come up with and we can find a spot.

It would be very cool to define the geometry type and/or dimension-specific types and overloads of the functions that give a more precise output type (for example, intersection(point, point) will always be a point). I forget whether the existing yaml syntax for defining these things made this impossible or just made it very verbose.

User defined types should be parameterizable if that is what you are after. I'm not entirely sure I understand though.

@kylebarron
Copy link
Author

Thanks! This is really insightful background.

So, if you have a better suggestion, go ahead and change it. I'd encourage anything other than BINARY which gives zero information on how to interpret the value. It actually doesn't matter what an engine is using under the hood. For example, if we change it to nested list / struct and some implementation is using opaque binary then that implementation should still be able to consume the plan (it will just need to convert literals from the nested list / struct representation to the opaque blob representation).

So if I understand correctly, this structure field is really the structure of literals? Nested list/struct that matches GeoArrow would be my own preference, especially because for a single literal that's also the logical coordinate structure of GeoJSON, which most systems should be able to parse.

I'd recommend reviewing the available functions and making sure you agree with them.

I figure that the substrait functions should more or less mirror PostGIS functions (built on GEOS). At a glance the existing substrait functions seem reasonable.

Is it common to include geospatial literals in queries? If so, we might want to canonicalize a message that can be used as a geospatial literal. I'm not enitrely sure where it would go but would encourage you to share some protobuf you come up with and we can find a spot.

I think it's pretty common. Say you want to find the distance of all geometries from a specific point, then you want to include that point in your input query.

I haven't worked with protobuf before. I figure there's some way to represent something GeoJSON-like in protobuf (nested lists of variable-size-list coordinates).

It would be very cool to define the geometry type and/or dimension-specific types and overloads of the functions that give a more precise output type (for example, intersection(point, point) will always be a point). I forget whether the existing yaml syntax for defining these things made this impossible or just made it very verbose.

That would be really cool. Sounds like maybe a good deal of work to stabilize, but if these functions only have to be defined once and can be pretty stable, perhaps it's worth it. But for now we can work with generically-typed geometry objects.

User defined types should be parameterizable if that is what you are after. I'm not entirely sure I understand though.

I think we'd need some parameterization to declare whether a geometry is a point, LineString, Polygon, etc. And maybe some type-level parameterization about the Coordinate Reference System, but maybe that's a black hole 🕳️ .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants