Replies: 6 comments 27 replies
-
I'm wondering whether the recommendations should be a bit more nuanced. |
Beta Was this translation helpful? Give feedback.
-
@cholmes Just modified my data to add all those things in. Take a look and let me know if that matches. |
Beta Was this translation helpful? Give feedback.
-
Thanks for kicking this off and for the initial comparisons. For Overture, since we use Spark + Sedona I wanted something that just worked natively there rather than a separate process. Here's what I came up with after some experimentation:
We try to target ~1GB files and the As you mentioned, it's nothing fancy and pretty naive but at the time seemed to be good enough for a first cut, so I'm pleased it seems to be performing well. It amounts to a distributed global sort on the level 15 geohash. There's nothing special about geohash (vs. quadkeys or S2) other than it was easily available in Sedona.
The file size skew issue was when I experimented with a recursive quadtree partitioning. If you keep splitting until a partition is below a certain threshold, you eventually end up with partitions with just a handful of rows and others with a hundred thousand. I abandoned that approach after that. It was also really slow. The I don't think we put much thought into target file size or row group size either. Something like a kd-tree is probably ideal here (and maybe what Fused is doing?). It would be interesting to come up with a framework to understand relative performance of all these different approaches but there are so many tradeoffs and they all depend on the downstream use case it'll be hard to really rank them. I think improving acquisition of the parquet metadata will drive the next big jump in performance. I started a discussion here: OvertureMaps/data#219 although maybe STAC-geoparquet is a better way. |
Beta Was this translation helpful? Give feedback.
-
I put together a (currently impractical, since the geography extension isn't published yet) reproducible notebook with some strategies. Those are DuckDB/S2-based (sort by cell and write by file size, partition by cell) but I think the same idea applies to a hilbert sort or a geo hash sort...it's tricky to get both file size constraints and spatial partitioning constraints satisfied (unless your data is homogenous in space, which is like never the case). I'd love to see a Sedona version of this too I'm just not that good at Sedona yet 😬 https://gist.github.com/paleolimbot/4bbfaf9dd79a306e21e59156004c7e33 I think ultimately a multi-pass approach could get us both:
|
Beta Was this translation helpful? Give feedback.
-
Very interested indeed, I would like to convince my organization to publish the WDPA as geoparquet. I hope it will be easy to produce it in the most performant manner! https://www.protectedplanet.net/en/thematic-areas/wdpa?tab=WDPA About this dataset: 300 000+ (multi) polygons and 629 323 141 vertex equivalent to 2GB GDB file. Some places are much more complex than others, for instance, South of Chile, Western Canada, Germany, Switzerland, Austria, Sweden... have highest density of vertex/sqkm2 ... and make it slower to process |
Beta Was this translation helpful? Give feedback.
-
My default universal approach would be:
|
Beta Was this translation helpful? Give feedback.
-
Over the thanksgiving break I hacked on a new QGIS plugin to download GeoParquet data. I have sections for Overture, Source Cooperative and Hugging Face, and can add more. But I realized that there is a wide variety in what people are actually hosting online, and that the performance of downloading specific areas could be improved in many.
I'm going to work on a PR to the main repo to point people at 'publishing best practices'. Thinking of also making a little CLI tool to quickly 'test' the best practices, and perhaps include it in the plugin too. On the last call we talked a bit about them, and we'll refine in the PR, but the quick version is:
And then the other main advice is 'spatially partition the files', but there's not been great guidelines. So I'm hoping to use this discussion topic to gather experience from more people on what's working, how it performs, what we want to try next. I'll give some quick overviews of what I saw:
Overture
Generally performed best, in my limited testing. @jwass said he'd give an overview of what they're doing, so I won't do a deep dive. He did mention it's still pretty naive, with no 'balancing', so some of the partitions are quite big and some are quite small. But the fact that partitions are done and everything else is right puts it ahead of most.
s3://overturemaps-us-west-2/release/2024-11-13.0/theme=buildings/type=building/*
Vida
I was surprised this didn't perform as well as Overture, but didn't have time to dig into why not (and it may have just been some user error / random). They use the admin-partitioned geoparquet technique I wrote about, which I still find promising, and also split by s2 cells for any thing over ~2gb (I believe).
s3://us-west-2.opendata.source.coop/vida/google-microsoft-osm-open-buildings/geoparquet/by_country/country_iso=*/*.parquet
is by countrys3://us-west-2.opendata.source.coop/vida/google-microsoft-osm-open-buildings/geoparquet/by_country_s2/country_sio=*/*.parquet
The latter wasn't working quite right with glob search due to a schema difference, so I mostly used the former, which may have been why it was slower, as much of my testing was in the US, which is 20gb.
Fused on Source (overture and FSQ)
This seems to be one of the better spatial partitions, though I'm not that sure of the details. But they seem a bit more weighted / dynamically made, with smaller boxes in more populous areas.
Hopefully they can share more about the partitioning strategy. I haven't yet done much performance testing with it, as their first release did not have the bbox struct (but perhaps I just entirely missed it?) and I only added the non-bbox reading code recently. Their most recent release seems to have the bbox struct, so I'll start testing it out soon.
They do still mark the geoparquet version as 1.0.0-beta.1, but I think with the bbox struct in there it doesn't really matter.
s3://us-west-2.opendata.source.coop/fused/fsq-os-places/2024-12-03/places/*
US Structures by Wherobots
https://source.coop/wherobots/usa-structures/geoparquet
s3://us-west-2.opendata.source.coop/wherobots/usa-structures/geoparquet/*.parquet
This one for sure gets a 'pass', as it was Matt's first publishing of this data, but it makes me wonder about the defaults in Sedona.
It is GeoParquet 1.1, but no bbox struct, and not zstd, but it does seem to be spatially sorted internally. It's got a top level hive partition by building type, which would be useful to speed any queries with the building type in the WHERE clause, but for writing my tool I'd first have it go across all of them. It looks like a regular grid for the partitioning, which works for residential, but some of the other types will have like 700kb files, which seems less optimal.
Foursquare on Hugging Face
The initial release did not do GeoParquet, but the second did 1.1 with bbox, zstd and spatially sorted files. But each file is still global, so they did not yet get to partitioning spatially. They do seem quite interested in doing that for the next release, and are part of the reason I wanted to get this discussion going. The performance in my plugin is definitely less good compared to spatially partitioned data, so it's a good data point that partitions of any sort seem to help.
GBIF on Source
https://source.coop/cboettig/gbif/2024-10-01
Another one I didn't get a chance to dig into as it didn't have bbox struct so it missed my initial cut. But it's got 11 levels of h2 indexes built into every row, so should be able to do some fast queries leveraging those.
Notes / thoughts
Please add any others! I'm sure I missed some.
I didn't manage to do much testing with the different source ones, as something was off. The source.coop urls http url's work, but you can't do the glob (*) matching on http. The s3 links aren't displayed quite right on source.coop right now. I did get the Vida s3 url, and it works in my plugin, but I had a blip where it wasn't working in DuckDB. It now works, but I only had time to add the url for one fused one. Will try to add more later.
I do think it'd be good to get to some 'best practice' for balanced partitioning of files, with a reference implementation, and ideally start to get code for it into any libraries that offer partitioning.
The most straightforward seems to be an s2 or quadkey option where you can just set a 'max size' (ideally based on file size, but could just be number of rows), and then it'd split the cells until it gets down to a cell size where all are below the max size threshold. And I remain interested in the admin-boundary partitions, for a more natural file that people would download directly.
I do think it'd be good to get some straight forward 'default' way that we recommend, and then also work to experiment and actually compare. I do think leveraging the hive partitions remains intriguing - especially having the client calculate its own country_iso, quadkey, s2 or h3 and including that in the WHERE clause. It seems like it'd cut out a lot of metadata queries asking lots of files about their bounds. And then some sort of 'index file' might be worth exploring - perhaps STAC, perhaps another Geoparquet. (Though then I start thinking about building overviews into an index file so we can display zoomed out data, but that's another whole topic).
Please sound in with any experience you have with partitioning, even if you were just playing around, so we can all learn from each other.
Beta Was this translation helpful? Give feedback.
All reactions