Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional data wrangling methods #6

Open
AdrianAntico opened this issue Apr 20, 2023 · 8 comments
Open

Add additional data wrangling methods #6

AdrianAntico opened this issue Apr 20, 2023 · 8 comments

Comments

@AdrianAntico
Copy link

Thank you DuckDB team for keeping this benchmark going!!!

I see there are a lot of variations on group bys and joins, however, I think it would be highly beneficial to incorporate additional data wrangling methods. A few that come to mind, but others should add to this list, includes:

  • Unions
  • Subsetting data
  • Sampling data
  • Rolling joins (see data.table)
  • Pivots long and wide
  • Rolling / windowing operations by groups over time, such as lags and moving averages
  • Differencing data by groups based on a time column
  • Updating records in a data frame / table
  • Categorical encoding methods: target encoding, James-Stein encoding
  • Column type conversions

I believe a broader set of operations serves a several purposes. For one, I would like to know if a particular framework can actually do the operation. Secondly, I would like to see benchmarks on their performance. Lastly, I think it would a huge community benefit to see what the actual code ends up looking like to get the greatest performance, which isn't always available through documentation or stackoverflow.

Thanks in advance,
Adrian

@Tmonster
Copy link
Collaborator

Hi Adrian,

Thank you for the suggestions. Similar suggestions have been mentioned before on the original repository h2oai#175 and also here #3. Those suggestions mostly cover pivot/unpivot/long-to-wide/wide-to-long.

I am happy to look into adding some of these benchmarks since they are repeatedly suggested, but since I have other obligations at DuckDBLabs, I cannot be confident about when I would eventually get to them. However, if you would like to open a PR and collaborate with previously mentioned issue authors, I will happily review the PR and merge it!

A WIP PR with just SQL queries for pivot/unpivot would be a great place to start I think. Since this repository isn't too cluttered and the idea seems to be popular, I imagine it would invite some collaboration. The SQL queries could then easily translate to all systems that support SQL (duckdb, clickhouse, spark)

@jangorecki jangorecki mentioned this issue Apr 24, 2023
16 tasks
@AdrianAntico
Copy link
Author

@Tmonster I have a ton of data.table code assembled in various packages. I could put those together rather quickly. I recall putting something together for DuckDB in comment in the previous repo for lags and moving averages. I can also scrounge that up.

@Tmonster
Copy link
Collaborator

Tmonster commented May 8, 2023

Hi Adrian,

It looks like the original author Jan Gorecki opened up a PR with rolling functions. Was this your comment from the previous repo?

Feel free to continue on this PR. I can then add another tab next to group by & join to show rolling functions. We can start with just a few queries as well. As interest grows slowly we can add a few more

@jangorecki
Copy link

I opened draft PR so we could work out the scope of that task. Each task needs to be carefully designed, so we need separate issue (PR draft even better as it can have code already) for each new task to discuss it's scope.
I haven't received any feedback there (yet). I invite any party interested in this task to look into it, comment, or LGTM so we can proceed with design of it.

@AdrianAntico
Copy link
Author

@Tmonster Yes, I was interested in this back then and still to this day. This request can make this repo more of a one stop shop for not only benchmarking but also to help others navigate between frameworks more easily and utilizing optimal code (versus what they may happen to find on stackoverflow).

@era127
Copy link

era127 commented Jun 21, 2023

I think adding a rolling or asof join, similar to this asof benchmark test, would be helpful for timeseries work. For duckdb, I believe the performance has significantly improved in this benchmark with the new asof join support.

start <- as.POSIXct("2020-01-01")
end <- as.POSIXct("2021-01-01")
# calendar with record every minute
calendar <- data.frame(date = seq(from = start, to = end, by = "+1 min"))
N <- 1e3
# table of data
data <- data.frame(date = start + as.difftime(seq(from=10000, to=NROW(calendar), length.out=N)  , units = "mins"), value = seq(0,1,length.out=N))

# xts
calendar.xts <- xts::xts(, order.by = calendar[["date"]])
data.xts <- xts::xts(data[["value"]], order.by = data[["date"]])

# data.table
data.table::setDTthreads(8)
data.dt <- data.table::data.table(data , key = 'date')
calendar.dt = data.table::data.table(calendar, key = 'date')

# duckdb
conn <- DBI::dbConnect(duckdb::duckdb( ))
DBI::dbWriteTable(conn, 'data', data)
DBI::dbWriteTable(conn, 'calendar', calendar)
# duckdb as virtual tables
duckdb::duckdb_register( conn, "data_v", data)
duckdb::duckdb_register( conn, "calendar_v", calendar)

microbenchmark::microbenchmark(
  # xts
  merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)), join = "left"),
  # data.table
  data.dt[calendar.dt, on = "date", roll = TRUE],
  # basic dplyr
  dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >= date))),
  # duckdb as native table
  DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data USING(date)"),
  # duckdb as virtual table
  DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data_v USING(date)")
)


> Unit: milliseconds
>                                                                                  expr       min        lq      mean    median       uq       max neval
>  merge(calendar.xts, zoo::na.locf(merge(calendar.xts, data.xts)),      join = "left")  6.967171  7.381661  9.022994  7.522721 11.87473  15.49173   100
>                                        data.dt[calendar.dt, on = "date", roll = TRUE] 26.923634 27.432936 29.005822 27.736459 29.99972  35.15939   100
>     dplyr::left_join(calendar, data, by = dplyr::join_by(closest(date >=      date))) 36.935875 37.654380 50.074952 42.531821 44.32565 183.35196   100
>       DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data USING(date)") 13.386008 15.339514 17.653180 16.566275 19.68496  25.52381   100
>     DBI::dbGetQuery(conn, "SELECT * FROM calendar ASOF LEFT JOIN data_v USING(date)") 12.720578 15.155957 18.751557 16.164065 19.65825 171.19706   100
# InMemoryDatasets (juliads)
using Dates, InMemoryDatasets, BenchmarkTools
start_ = Dates.DateTime("2020-01-01T00:00:00")
end_ = Dates.DateTime("2021-01-01T01:00:00")
d = start_:Dates.Minute(1):end_
# calendar every minute
calendar = InMemoryDatasets.Dataset(date=collect(d))
issorted!(calendar, :date)
N = Int(1e3)
data = InMemoryDatasets.Dataset(date=unique(start_ .+ Dates.Minute.(trunc.(range(10000, Dates.Minute(end_ - start_).value, N)))), value=range(0, 1, N))
issorted!(data, :date)
@benchmark InMemoryDatasets.closejoin(calendar, data, on=:date)

> BenchmarkTools.Trial: 283 samples with 1 evaluation.
>  Range (min  max):  17.076 ms   19.199 ms  ┊ GC (min  max): 0.00%  8.70%
>  Time  (median):     17.450 ms               ┊ GC (median):    0.00%
>  Time  (mean ± σ):   17.719 ms ± 608.662 μs  ┊ GC (mean ± σ):  2.06% ± 3.16%

@jangorecki
Copy link

@rdavis120 I think it is a good idea. IMO it make sense to create new issue and discuss scope of "rolling/asof join" task there. This issue is so generic that discussing details won't work well here. Just linking from here will be sufficient.

@jangorecki
Copy link

Looking for a feedback about rolling functions inside #9. Whoever is interested in this benchmark task please subscribe to that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants