Suggestion: Include reshape benchmarks #3

grantmcdermott · 2023-04-14T15:26:52Z

Stoked to see this back up and running!

(As an aside, the relentless performance gains of DuckDB are truly impressive.)

Two suggestions:

Please consider the collapse R package (link). In my own set of benchmarks, collapse is typically at or near the top of various groupby operations for datasets in the order of .5-5 GB. (I haven't tested larger than that and should also say it doesn't support join operations yet.) I can add a PR if interested. Closed via [WIP] New solution: r-collapse #33.
There was talk over at the old repo of adding a set of reshape benchmarks. Personally, I think this would be great to have. See: reshape task (pivot, unpivot) h2oai/db-benchmark#175

Thanks again for all effort in resurrecting this.

The text was updated successfully, but these errors were encountered:

Tmonster · 2023-04-18T08:00:52Z

Hi Grant, Thank you for the suggestion!

I currently don't have a lot of bandwidth to add a whole new solution to the benchmark, but if you want to open a PR that adds the necessary setup-collapse.sh, ver-collapse.sh, upg-collapse.sh, groupby-collapse.R, and join-collapse.R then I'd be happy to review. A good place to start would be copying the files in the dplyr folder in the benchmark, and just change the imported libraries. That will probably get you more than halfway.

See repro.sh for steps to run the benchmark either locally or on an AWS instance. If no errors are thrown for the 0.5GB & 5GB datasets I'd be happy to merge your PR and re-run the benchmark to include results for collapse.

Tmonster · 2023-04-18T08:55:20Z

As for the reshaping benchmarks, I think its a great idea!

It would take a while to finally include those queries in the benchmark, however, as I would need to

Create new queries and datasets. (Although I believe the group by datasets could work well for this)
Create new reshape-solution.* scripts for each of the solutions that support reshaping functionality
Modify the report generation code to include reshape results

I would like to do a re-work of the report generation code, as it was hard to track down bugs while re-running the benchmark. As mentioned in h2oai#175, however, I would be happy to review or collaborate any PRs that help maintain and improve the benchmark!

SebKrantz · 2023-09-18T07:29:42Z

collapse author here. Thanks @grantmcdermott and @vincentarelbundock for the initiative! I'm happy with adding collapse to the benchmarks, and also happy for any suggested code, but would like to wait for the pending v2.0 release (which includes implementations of table joins and reshaping). I will also ensure the benchmarking code is equivalent to other DBMS (collapse has some unfavorable defaults e.g. sort = TRUE, na.rm = TRUE, nthreads = 1). I expect v2.0 to be released within 1 month, and will then get back to this and submit a comprehensive PR, integreating what was suggested here.

vincentarelbundock · 2023-09-18T11:46:16Z

Sounds good @SebKrantz.

You may want to use my PR as a starting point since most of the setup and group-by stuff is close to done.

FYI, the dplyr and data.table benchmarks use na.rm=TRUE, but you are right that the sort and nthreads arguments may need to be adjusted.

grantmcdermott changed the title ~~Suggestions~~ Suggestions: Include r-collapse and reshape benchmarks Apr 14, 2023

Tmonster added the Solution Include new solution label Apr 19, 2023

Tmonster mentioned this issue Apr 24, 2023

Add additional data wrangling methods #6

Open

This comment was marked as resolved.

Sign in to view

vincentarelbundock mentioned this issue Sep 17, 2023

[WIP] New solution: r-collapse #33

Merged

grantmcdermott changed the title ~~Suggestions: Include r-collapse and reshape benchmarks~~ Suggestion: Include reshape benchmarks Jan 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: Include reshape benchmarks #3

Suggestion: Include reshape benchmarks #3

grantmcdermott commented Apr 14, 2023 •

edited

Loading

Tmonster commented Apr 18, 2023

Tmonster commented Apr 18, 2023

This comment was marked as resolved.

SebKrantz commented Sep 18, 2023 •

edited

Loading

vincentarelbundock commented Sep 18, 2023

Suggestion: Include reshape benchmarks #3

Suggestion: Include reshape benchmarks #3

Comments

grantmcdermott commented Apr 14, 2023 • edited Loading

Tmonster commented Apr 18, 2023

Tmonster commented Apr 18, 2023

This comment was marked as resolved.

SebKrantz commented Sep 18, 2023 • edited Loading

vincentarelbundock commented Sep 18, 2023

grantmcdermott commented Apr 14, 2023 •

edited

Loading

SebKrantz commented Sep 18, 2023 •

edited

Loading