Cloud-hosted & partitioned files best practices #251

cholmes · 2024-12-05T05:55:13Z

cholmes
Dec 5, 2024
Maintainer

Over the thanksgiving break I hacked on a new QGIS plugin to download GeoParquet data. I have sections for Overture, Source Cooperative and Hugging Face, and can add more. But I realized that there is a wide variety in what people are actually hosting online, and that the performance of downloading specific areas could be improved in many.

I'm going to work on a PR to the main repo to point people at 'publishing best practices'. Thinking of also making a little CLI tool to quickly 'test' the best practices, and perhaps include it in the plugin too. On the last call we talked a bit about them, and we'll refine in the PR, but the quick version is:

Use GeoParquet 1.1 with the bbox struct, and spatially organize the data (r-tree, geohash, whatever is easy with your tool) for efficient spatial querying (found 1.0-beta.1, 1.0-dev, and 1.1's without BBOX)
Use ZSTD for compression (I found a lot of snappy and even some uncompressed - smaller files will download faster)
Use a row group size between 100,000 and 200,00 rows per group (there's more subtleties with this one, but there were seveal with more than 1,000,000 per row group, which doesn't work great.

And then the other main advice is 'spatially partition the files', but there's not been great guidelines. So I'm hoping to use this discussion topic to gather experience from more people on what's working, how it performs, what we want to try next. I'll give some quick overviews of what I saw:

Overture

Generally performed best, in my limited testing. @jwass said he'd give an overview of what they're doing, so I won't do a deep dive. He did mention it's still pretty naive, with no 'balancing', so some of the partitions are quite big and some are quite small. But the fact that partitions are done and everything else is right puts it ahead of most.

s3://overturemaps-us-west-2/release/2024-11-13.0/theme=buildings/type=building/*

Vida

I was surprised this didn't perform as well as Overture, but didn't have time to dig into why not (and it may have just been some user error / random). They use the admin-partitioned geoparquet technique I wrote about, which I still find promising, and also split by s2 cells for any thing over ~2gb (I believe).

s3://us-west-2.opendata.source.coop/vida/google-microsoft-osm-open-buildings/geoparquet/by_country/country_iso=*/*.parquet is by country
s3://us-west-2.opendata.source.coop/vida/google-microsoft-osm-open-buildings/geoparquet/by_country_s2/country_sio=*/*.parquet

The latter wasn't working quite right with glob search due to a schema difference, so I mostly used the former, which may have been why it was slower, as much of my testing was in the US, which is 20gb.

Fused on Source (overture and FSQ)

This seems to be one of the better spatial partitions, though I'm not that sure of the details. But they seem a bit more weighted / dynamically made, with smaller boxes in more populous areas.

Hopefully they can share more about the partitioning strategy. I haven't yet done much performance testing with it, as their first release did not have the bbox struct (but perhaps I just entirely missed it?) and I only added the non-bbox reading code recently. Their most recent release seems to have the bbox struct, so I'll start testing it out soon.

They do still mark the geoparquet version as 1.0.0-beta.1, but I think with the bbox struct in there it doesn't really matter.

s3://us-west-2.opendata.source.coop/fused/fsq-os-places/2024-12-03/places/*

US Structures by Wherobots

https://source.coop/wherobots/usa-structures/geoparquet

s3://us-west-2.opendata.source.coop/wherobots/usa-structures/geoparquet/*.parquet

This one for sure gets a 'pass', as it was Matt's first publishing of this data, but it makes me wonder about the defaults in Sedona.

It is GeoParquet 1.1, but no bbox struct, and not zstd, but it does seem to be spatially sorted internally. It's got a top level hive partition by building type, which would be useful to speed any queries with the building type in the WHERE clause, but for writing my tool I'd first have it go across all of them. It looks like a regular grid for the partitioning, which works for residential, but some of the other types will have like 700kb files, which seems less optimal.

Foursquare on Hugging Face

The initial release did not do GeoParquet, but the second did 1.1 with bbox, zstd and spatially sorted files. But each file is still global, so they did not yet get to partitioning spatially. They do seem quite interested in doing that for the next release, and are part of the reason I wanted to get this discussion going. The performance in my plugin is definitely less good compared to spatially partitioned data, so it's a good data point that partitions of any sort seem to help.

GBIF on Source

https://source.coop/cboettig/gbif/2024-10-01

Another one I didn't get a chance to dig into as it didn't have bbox struct so it missed my initial cut. But it's got 11 levels of h2 indexes built into every row, so should be able to do some fast queries leveraging those.

Notes / thoughts

Please add any others! I'm sure I missed some.

I didn't manage to do much testing with the different source ones, as something was off. The source.coop urls http url's work, but you can't do the glob (*) matching on http. The s3 links aren't displayed quite right on source.coop right now. I did get the Vida s3 url, and it works in my plugin, but I had a blip where it wasn't working in DuckDB. It now works, but I only had time to add the url for one fused one. Will try to add more later.

I do think it'd be good to get to some 'best practice' for balanced partitioning of files, with a reference implementation, and ideally start to get code for it into any libraries that offer partitioning.

The most straightforward seems to be an s2 or quadkey option where you can just set a 'max size' (ideally based on file size, but could just be number of rows), and then it'd split the cells until it gets down to a cell size where all are below the max size threshold. And I remain interested in the admin-boundary partitions, for a more natural file that people would download directly.

I do think it'd be good to get some straight forward 'default' way that we recommend, and then also work to experiment and actually compare. I do think leveraging the hive partitions remains intriguing - especially having the client calculate its own country_iso, quadkey, s2 or h3 and including that in the WHERE clause. It seems like it'd cut out a lot of metadata queries asking lots of files about their bounds. And then some sort of 'index file' might be worth exploring - perhaps STAC, perhaps another Geoparquet. (Though then I start thinking about building overviews into an index file so we can display zoomed out data, but that's another whole topic).

Please sound in with any experience you have with partitioning, even if you were just playing around, so we can all learn from each other.

m-mohr · 2024-12-05T13:28:46Z

m-mohr
Dec 5, 2024
Collaborator

I'm wondering whether the recommendations should be a bit more nuanced.
The recommended values (esp. row group size) are likely different for large scale data processing compared to browser consumption. Probably some experiments needed, I'd guess something in the lower five digit numbers might be good. Compression with ZStd seems to be an okay compromise for browsers, at least there's a pure JS encoder available. But this might also need some experiments.
(I acknowledge that browser consumption is probably more a niche usecase right now, but I could see that become more viable in the future.)

6 replies

rbavery Dec 5, 2024

It might be helpful to call out that specifically for stac-geoparquet, defining smaller row groups (10s or low hundreds of rows) might be helpful if those row groups are used to dictate how much raster data is fetched in batch. Going with the above suggestion, 100,000 256x256x3 uint8 rasters (very small) still adds up to a bit over 18Gb

cholmes Dec 5, 2024
Maintainer Author

Interesting - my default for working with stac-geoparquet isn't to load any raster data, but I get that it's easy with sedona / wherobots. Though are the row groups used to dictate how much raster data is fetched in a batch? I think yours is the only implementation I know that does this...

I would say this recommendation would belong in stac-geoparquet - and would be great to have there. You want to just do a PR to that repo?

TomAugspurger Dec 5, 2024
Maintainer

if those row groups are used to dictate how much raster data is fetched in batch

IMO, that wouldn't be a good assumption for clients to make. I think that clients consuming these files will generally have more information about what they're doing and so can adjust the batch size by splitting "large" row groups if they're doing a memory-intensive operation like loading rasters.

kylebarron Dec 5, 2024
Maintainer

10s to 100s is extremely low as a suggestion for stac geoparquet, especially with the general behavior that STAC is only a reference to raster data and doesn't include it directly

rbavery Dec 5, 2024

yours is the only implementation I know that does this...

just to be clear we aren't implementing stac-geoparquet for anything currently, but as the spec matures and tooling around it matures I think we should.

I see your points, leaving it to the client to split large row groups makes sense.

mbforr · 2024-12-05T19:16:36Z

mbforr
Dec 5, 2024

@cholmes Just modified my data to add all those things in. Take a look and let me know if that matches.

9 replies

mbforr Dec 6, 2024

@cholmes new files coming in now. Let me know how these look - I think my math is right for splitting everything but let me know. And I might need to adjust the row sizing since it is disk size not row size.

mbforr Dec 6, 2024

Here is my new code:

df_p.repartitionByRange(10, "geohash") \
    .sortWithinPartitions("geohash") \
    .drop("geohash15") \
    .write \
    .format("geoparquet") \
    .option("geoparquet.version", "1.1.0") \
    .option("geoparquet.crs", projjson) \
    .option("geoparquet.covering", "bbox") \
    .save(user_uri + "usa_structures", mode='overwrite', compression='zstd')```

cholmes Dec 6, 2024
Maintainer Author

Awesome - it works! Will add it to the next release of the plugin. Not quite as fast as Overture (maybe it's faster with less than 2 gigs?) but seems to be under a minute.

cholmes Dec 7, 2024
Maintainer Author

I think things might go faster if the files were more like squares, see:

They overlap a bunch, so I fear that many requests need to fully go through 2-3 of the files since the buildings 'might' be within the bbox. If they were boxes then more of the files could be completely skipped, since the bounds wouldn't match at all.

cholmes Dec 12, 2024
Maintainer Author

@mbforr - I included the dataset in the latest release (it's not yet approved on qgis plugin distribution, but you can get it on github). It is a good bit slower than overture, so would be great to try to see about getting the boxes right. And would be really valuable to document the right call in Sedona / make a tutorial for best practices of distributing large data with sedona.

jwass · 2024-12-05T22:09:07Z

jwass
Dec 5, 2024
Collaborator

Thanks for kicking this off and for the initial comparisons.

For Overture, since we use Spark + Sedona I wanted something that just worked natively there rather than a separate process. Here's what I came up with after some experimentation:

df
        .withColumn("geohash15", expr("ST_GeoHash(geometry,15)"))
        .repartitionByRange(n, col("geohash15"))
        .sortWithinPartitions("geohash15")
        .drop("geohash15")
    )

We try to target ~1GB files and the n here is the estimate of how many 1GB files to create.

As you mentioned, it's nothing fancy and pretty naive but at the time seemed to be good enough for a first cut, so I'm pleased it seems to be performing well. It amounts to a distributed global sort on the level 15 geohash. There's nothing special about geohash (vs. quadkeys or S2) other than it was easily available in Sedona.

He did mention it's still pretty naive, with no 'balancing', so some of the partitions are quite big and some are quite small.

The file size skew issue was when I experimented with a recursive quadtree partitioning. If you keep splitting until a partition is below a certain threshold, you eventually end up with partitions with just a handful of rows and others with a hundred thousand. I abandoned that approach after that. It was also really slow. The repartitionByRange above takes care of keeping the output files relatively similar sizes.

I don't think we put much thought into target file size or row group size either.

Something like a kd-tree is probably ideal here (and maybe what Fused is doing?). It would be interesting to come up with a framework to understand relative performance of all these different approaches but there are so many tradeoffs and they all depend on the downstream use case it'll be hard to really rank them.

I think improving acquisition of the parquet metadata will drive the next big jump in performance. I started a discussion here: OvertureMaps/data#219 although maybe STAC-geoparquet is a better way.

0 replies

paleolimbot · 2024-12-05T22:11:28Z

paleolimbot
Dec 5, 2024
Collaborator

I put together a (currently impractical, since the geography extension isn't published yet) reproducible notebook with some strategies. Those are DuckDB/S2-based (sort by cell and write by file size, partition by cell) but I think the same idea applies to a hilbert sort or a geo hash sort...it's tricky to get both file size constraints and spatial partitioning constraints satisfied (unless your data is homogenous in space, which is like never the case).

I'd love to see a Sedona version of this too I'm just not that good at Sedona yet 😬

https://gist.github.com/paleolimbot/4bbfaf9dd79a306e21e59156004c7e33

I think ultimately a multi-pass approach could get us both:

Do an initial partitioning by s2 cell at a sort-of-correct level
Split big files at a higher s2 cell level, merge smaller files into a parent s2 cell
Do it again until constraints are satisfied.

11 replies

paleolimbot Dec 11, 2024
Collaborator

I summarised a few strategies in a post (mostly for my own future self!): https://dewey.dunnington.ca/post/2024/partitioning-strategies-for-bigger-than-memory-spatial-data/ . I think adaptive S2 cells are great personally because they're super fast to construct but if even file sizes are important, there's no beating the KD-tree!

jiayuasu Dec 11, 2024

If you talk about load balanced partition, KDB tree is the best as explained in our research paper for Apache Sedona:

https://jiayuasu.github.io/files/talk/jia-icde19-tutorial.pdf
https://jiayuasu.github.io/files/paper/GeoSpark_Geoinformatica_2018.pdf

paleolimbot Dec 11, 2024
Collaborator

Nice! I'll follow up!

jwass Dec 11, 2024
Collaborator

@jiayuasu Is there a way in Sedona to generate the kdb-tree or r-tree and write the data clustered by those partitions? I know this is mainly used for spatial joins but this is what I was trying to get at here apache/sedona#1268 when I was first looking at this.

cholmes Dec 12, 2024
Maintainer Author

+1 to @jwass - I'm going to start working on the 'guide' and it'd be nice to point at one or two tools that will do things well, so having the clear Sedona answer to this would be great. And a DuckDB option would be good too. I guess in the immediate term I can point at the balanced kdtree code, and @paleolimbot's post. Could it make sense to have a simple DuckDB extension that just packages up those bits of code? Would love to make it easy for anyone to use DuckDB and write out good partitions.

WCMC-vblanque · 2024-12-15T00:54:02Z

WCMC-vblanque
Dec 15, 2024

Very interested indeed, I would like to convince my organization to publish the WDPA as geoparquet. I hope it will be easy to produce it in the most performant manner!

https://www.protectedplanet.net/en/thematic-areas/wdpa?tab=WDPA

About this dataset: 300 000+ (multi) polygons and 629 323 141 vertex equivalent to 2GB GDB file. Some places are much more complex than others, for instance, South of Chile, Western Canada, Germany, Switzerland, Austria, Sweden... have highest density of vertex/sqkm2 ... and make it slower to process

0 replies

jaakla · 2024-12-18T06:42:33Z

jaakla
Dec 18, 2024

My default universal approach would be:

admin partitions: country + optional next level for larger countries, states or regions. With nutiteq things back in the day we ended up using about 1000 global units with reasonable balance for OSM data (sans small countries, but these are cute anyways).
geom sorting/clustering: Hilbert curves.

1 reply

cholmes Dec 18, 2024
Maintainer Author

Thanks for the feedback, I'll try at least mention those.

For admin partitions, do you have any recommended dataset to use for the admin partitions? (Or anyone else on the thread) I used geoBoundaries as it was cc licensed. And how did you get to 1000 global units? Did you do varying levels of admin boundaries? I tried Overture early on but it had some holes in the admin boundaries, and I had trouble getting OSM to work.

I do think it'd be great to have at least 2-3 good tools (ie DuckDB, Sedona, GeoPandas, maybe GDAL/OGR) that can offer a couple 'mathematical' ones (seems like kd-tree and/or adaptive s2), and also admin-boundary ones (perhaps with different options for what to use for the boundaries, plus options to further sub-divide (admin level 2, one of the 'mathematical' algorithms, etc)).

With admin boundaries it'd also be great to standardize a bit on the column. Like what is the column name that you partition on, and what do you call the countries? We started one in fiboa - https://github.com/fiboa/administrative-division-extension but it should be more general than fiboa. ISO country codes seem like the good answer, and then just decide between 2 letter and 3 letter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cloud-hosted & partitioned files best practices #251

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 27 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Cloud-hosted & partitioned files best practices #251

cholmes Dec 5, 2024 Maintainer

Overture

Vida

Fused on Source (overture and FSQ)

US Structures by Wherobots

Foursquare on Hugging Face

GBIF on Source

Notes / thoughts

Replies: 6 comments · 27 replies

m-mohr Dec 5, 2024 Collaborator

cholmes Dec 5, 2024 Maintainer Author

TomAugspurger Dec 5, 2024 Maintainer

kylebarron Dec 5, 2024 Maintainer

cholmes Dec 6, 2024 Maintainer Author

cholmes Dec 7, 2024 Maintainer Author

cholmes Dec 12, 2024 Maintainer Author

jwass Dec 5, 2024 Collaborator

paleolimbot Dec 5, 2024 Collaborator

paleolimbot Dec 11, 2024 Collaborator

paleolimbot Dec 11, 2024 Collaborator

jwass Dec 11, 2024 Collaborator

cholmes Dec 12, 2024 Maintainer Author

cholmes Dec 18, 2024 Maintainer Author

cholmes
Dec 5, 2024
Maintainer

Replies: 6 comments 27 replies

m-mohr
Dec 5, 2024
Collaborator

cholmes Dec 5, 2024
Maintainer Author

TomAugspurger Dec 5, 2024
Maintainer

kylebarron Dec 5, 2024
Maintainer

cholmes Dec 6, 2024
Maintainer Author

cholmes Dec 7, 2024
Maintainer Author

cholmes Dec 12, 2024
Maintainer Author

jwass
Dec 5, 2024
Collaborator

paleolimbot
Dec 5, 2024
Collaborator

paleolimbot Dec 11, 2024
Collaborator

paleolimbot Dec 11, 2024
Collaborator

jwass Dec 11, 2024
Collaborator

cholmes Dec 12, 2024
Maintainer Author

cholmes Dec 18, 2024
Maintainer Author