-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add additional data wrangling methods #6
Comments
Hi Adrian, Thank you for the suggestions. Similar suggestions have been mentioned before on the original repository h2oai#175 and also here #3. Those suggestions mostly cover pivot/unpivot/long-to-wide/wide-to-long. I am happy to look into adding some of these benchmarks since they are repeatedly suggested, but since I have other obligations at DuckDBLabs, I cannot be confident about when I would eventually get to them. However, if you would like to open a PR and collaborate with previously mentioned issue authors, I will happily review the PR and merge it! A WIP PR with just SQL queries for pivot/unpivot would be a great place to start I think. Since this repository isn't too cluttered and the idea seems to be popular, I imagine it would invite some collaboration. The SQL queries could then easily translate to all systems that support SQL (duckdb, clickhouse, spark) |
@Tmonster I have a ton of data.table code assembled in various packages. I could put those together rather quickly. I recall putting something together for DuckDB in comment in the previous repo for lags and moving averages. I can also scrounge that up. |
Hi Adrian, It looks like the original author Jan Gorecki opened up a PR with rolling functions. Was this your comment from the previous repo? Feel free to continue on this PR. I can then add another tab next to group by & join to show rolling functions. We can start with just a few queries as well. As interest grows slowly we can add a few more |
I opened draft PR so we could work out the scope of that task. Each task needs to be carefully designed, so we need separate issue (PR draft even better as it can have code already) for each new task to discuss it's scope. |
@Tmonster Yes, I was interested in this back then and still to this day. This request can make this repo more of a one stop shop for not only benchmarking but also to help others navigate between frameworks more easily and utilizing optimal code (versus what they may happen to find on stackoverflow). |
I think adding a rolling or asof join, similar to this asof benchmark test, would be helpful for timeseries work. For duckdb, I believe the performance has significantly improved in this benchmark with the new asof join support. start <- as.POSIXct("2020-01-01")
end <- as.POSIXct("2021-01-01")
# calendar with record every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 1e3
# table of data
data <- data.frame(date = start + as.difftime(seq(from=10000, to=NROW(calendar), length.out=N) , units = "mins"), value = seq(0,1,length.out=N))
# xts
calendar.xts <- xts::xts(, order.by = calendar[["date"]])
data.xts <- xts::xts(data[["value"]], order.by = data[["date"]])
# data.table
data.table::setDTthreads(8)
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')
# duckdb
conn <- DBI::dbConnect(duckdb::duckdb( ))
DBI::dbWriteTable(conn, 'data', data)
DBI::dbWriteTable(conn, 'calendar', calendar)
# duckdb as virtual tables
duckdb::duckdb_register( conn, "data_v", data)
duckdb::duckdb_register( conn, "calendar_v", calendar)
microbenchmark::microbenchmark(
# xts
merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left"),
# data.table
data.dt[calendar.dt, on = "date", roll = TRUE],
# basic dplyr
dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))),
# duckdb as native table
DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data USING(date)"),
# duckdb as virtual table
DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data_v USING(date)")
)
> Unit: milliseconds
> expr min lq mean median uq max neval
> merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left") 6.967171 7.381661 9.022994 7.522721 11.87473 15.49173 100
> data.dt[calendar.dt, on = "date", roll = TRUE] 26.923634 27.432936 29.005822 27.736459 29.99972 35.15939 100
> dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))) 36.935875 37.654380 50.074952 42.531821 44.32565 183.35196 100
> DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data USING(date)") 13.386008 15.339514 17.653180 16.566275 19.68496 25.52381 100
> DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data_v USING(date)") 12.720578 15.155957 18.751557 16.164065 19.65825 171.19706 100
# InMemoryDatasets (juliads)
using Dates, InMemoryDatasets, BenchmarkTools
start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2021-01-01T01:00:00")
d = start_:Dates.Minute(1):end_
# calendar every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
issorted!(calendar, :date)
N = Int(1e3)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000, Dates.Minute(end_ - start_).value, N)))), value=range(0, 1, N))
issorted!(data, :date)
@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)
> BenchmarkTools.Trial: 283 samples with 1 evaluation.
> Range (min … max): 17.076 ms … 19.199 ms ┊ GC (min … max): 0.00% … 8.70%
> Time (median): 17.450 ms ┊ GC (median): 0.00%
> Time (mean ± σ): 17.719 ms ± 608.662 μs ┊ GC (mean ± σ): 2.06% ± 3.16% |
@rdavis120 I think it is a good idea. IMO it make sense to create new issue and discuss scope of "rolling/asof join" task there. This issue is so generic that discussing details won't work well here. Just linking from here will be sufficient. |
Looking for a feedback about rolling functions inside #9. Whoever is interested in this benchmark task please subscribe to that PR. |
Thank you DuckDB team for keeping this benchmark going!!!
I see there are a lot of variations on group bys and joins, however, I think it would be highly beneficial to incorporate additional data wrangling methods. A few that come to mind, but others should add to this list, includes:
I believe a broader set of operations serves a several purposes. For one, I would like to know if a particular framework can actually do the operation. Secondly, I would like to see benchmarks on their performance. Lastly, I think it would a huge community benefit to see what the actual code ends up looking like to get the greatest performance, which isn't always available through documentation or stackoverflow.
Thanks in advance,
Adrian
The text was updated successfully, but these errors were encountered: