-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experimental support for GDAL 3.6 columnar API #2036
Conversation
Thanks, interesting! Is it feasible to use libgdal without needing nanoarrow? I think https://gdal.org/development/rfc/rfc86_column_oriented_api.html implements locally without requiring access to Arrow itself? |
You will need either arrow or nanoarrow to get R objects...nanoarrow is brand new/in-development and will be zero-dependency for expressly this use-case! It's GDAL includes conversions to Numpy arrays for its Python bindings that are similar in spirit and dependency-ness to the converters that I'm writing in apache/arrow-nanoarrow#65 . |
src/stars.cpp
Outdated
@@ -509,7 +509,7 @@ void CPL_write_gdal(NumericMatrix x, CharacterVector fname, CharacterVector driv | |||
eType = GDT_Byte; // #nocov | |||
#if GDAL_VERSION_NUM >= 3070000 | |||
else if (Type[0] == "Int8") | |||
eType = GDT_Int8; // #nocov | |||
eType = GDT_Byte; // #nocov |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look good IMO, see #2033 - do you actually run into this code? That would mean the #if condition is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure it's wrong...but yes, this failed to compile for me with GDT_Int8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this - it's because I'm running from the latest dev branch which has moved on to 3.7.xx versioning because there's a release candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GDT_Int8 implementation has now landed in GDAL master
Is @edzer 's point to do with WIP RFC87 rspatial/terra#885 (comment)? |
It also may be related to my specific installation where maybe there are some old headers sitting in the include path! |
We can also outcomment the whole |
I'm sure one or more of the missing features will eat into this considerably, but the initial test I did reads ~400,000 lines about 4 times faster. This might have to do with the use of library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.7.0dev-908498a4d8, PROJ 6.3.1; sf_use_s2() is
#> TRUE
# curl::curl_download(
# "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg",
# "nshn_water_line.gpkg"
# )
system.time(
tbl1 <- read_sf("nshn_water_line.gpkg", use_stream = FALSE)
)
#> user system elapsed
#> 20.264 1.355 21.619
system.time(
tbl2 <- read_sf("nshn_water_line.gpkg", use_stream = TRUE)
)
#> user system elapsed
#> 5.960 0.568 5.421 |
Still a ways to go here, but progress! Some TODOs:
I'm currently testing using |
90fa025
to
37ea3c0
Compare
37ea3c0
to
a32ba79
Compare
Still faster, but I switched it to use sf's WKB reader instead. I think the speed is due to the underlying Arrow driver being fast for gpkg (not nanoarrow). Many of the tests fail if you turn the default to use the stream (and one segfaults)...there's no way to implement library(sf)
#> Linking to GEOS 3.12.0, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
# curl::curl_download(
# "https://github.com/geoarrow/geoarrow-data/releases/download/latest-dev/ns-water-water_line.gpkg",
# "ns-water-water_line.gpkg"
# )
system.time(
tbl1 <- read_sf("ns-water-water_line.gpkg", use_stream = FALSE)
)
#> user system elapsed
#> 7.266 0.556 7.866
system.time(
tbl2 <- read_sf("ns-water-water_line.gpkg", use_stream = TRUE)
)
#> Simple feature collection with 483268 features and 33 fields
#> Geometry type: GEOMETRY
#> Dimension: XY
#> Bounding box: xmin: 215869.1 ymin: 4790519 xmax: 781792.9 ymax: 5237312
#> Projected CRS: NAD_1983_CSRS_2010_UTM_20_Nova_Scotia + CGVD2013(CGG2013) height
#> user system elapsed
#> 3.107 0.629 3.222 Created on 2023-09-21 with reprex v2.0.2 |
Looks great - trains, airports and planes are always the best place for some decent development! |
Co-authored-by: Even Rouault <[email protected]>
@mdsumner Gave this a go and found a segfault for a POINT (that doesn't exist in Python): https://gist.github.com/mdsumner/21c5e74f8565487e3304dedf596a10c4 . There's also one segfault in the tests (for something with M geometries but I haven't had a chance to try it with the debugger yet). |
This still needs tests for |
also dialect for when running sql, that's just a string input doesn't have any layer cleanup implications (so can run SQLITE st_funs on OGRSQL drivers, or vice versa, for example) |
Thanks, great to see this feature! Did you try do some benchmarks on synthetic data? In the example below, the differences in timings are quite small. What could be the reason (or is this expected on this dataset)? Benchmarklibrary("sf")
n = 500000
df = data.frame(x = rnorm(n), y = rnorm(n),
col_1 = sample(c(TRUE, FALSE), n, replace = TRUE), # logical
col_2 = sample(letters, n, replace = TRUE), # character
col_3 = runif(n), # double
col_4 = sample(1:100, n, replace = TRUE)) # integer
## points ##
pts = st_as_sf(df, coords = c("x", "y"), crs = "EPSG:2180")
write_sf(pts, "pts.gpkg")
bench::mark(
check = FALSE, iterations = 10,
stream = read_sf("pts.gpkg", use_stream = TRUE),
non_stream = read_sf("pts.gpkg", use_stream = FALSE)
)
#> expression min median `itr/sec` mem_alloc
#> 1 stream 1.8s 2.14s 0.491 87.8MB
#> 2 non_stream 1.7s 1.79s 0.545 72.5MB
## buffers ##
buff = st_buffer(pts, dist = 1000)
write_sf(buff, "buffers.gpkg")
bench::mark(
check = FALSE, iterations = 5,
stream = read_sf("buffers.gpkg", use_stream = TRUE),
non_stream = read_sf("buffers.gpkg", use_stream = FALSE)
)
#> expression min median `itr/sec` mem_alloc
#> 1 stream 4.22s 5.97s 0.181 1.93GB
#> 2 non_stream 4.21s 6.23s 0.182 1.91GB |
This doesn't work for me: n = 10
df = data.frame(x = rnorm(n), y = rnorm(n)) # without attributes
pts = st_as_sf(df, coords = c("x", "y"), crs = "EPSG:2180")
write_sf(pts, "pts.gpkg")
x = read_sf("pts.gpkg", use_stream = TRUE)
#> Error in st_as_sfc.WKB(x) :
#> cannot read WKB object from zero-length raw vector |
Make an issue! I'll collect any stream-reading related issues into a follow-up PR in the next week or so.
The initial PR is mostly about connecting the wires. I expect that optimizing the WKB reader will help (I'm planning to do so in the next week)...other than that, even with equivalent performance, this PR is more about long-term maintainability: it offloads the entire OGR--R conversion into code that sf doesn't have to maintain. There are a number of pending performance improvements there, too (e.g., ALTREP for strings) that sf (in the long term) can get for free. |
Good to hear, I will wait for further news! In which repository should I create the issue? |
here. |
This PR is a proof-of-concept for future support for the new columnar access API that is introduced in GDAL 3.6. The API exposes a pull-style iterator as an
ArrowArrayStream
. The pyogrio package has a PR up to support this and has noticed a ~2x improvement on their test data set ( geopandas/pyogrio#155 ). I've spent some time implementing conversions for Arrow C data interface objects ( apache/arrow-nanoarrow#65 ) and am curious to see if we can get any of that speed in R too!To give it a try:
All sorts of options aren't supported but I'm mostly trying to find out whether or not the nanoarrow conversions are going to be helpful in this context. I will follow up here with more experiments!