Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I created a benchmark but DuckDB run times are super slow and not sure why #74

Open
AdrianAntico opened this issue May 5, 2024 · 6 comments

Comments

@AdrianAntico
Copy link

Here's a link to the repo: https://github.com/AdrianAntico/Benchmarks

I use 3 datasets, one with a million rows, one with 10M, and one with 100M. I am currently just running a sum aggregation, with varying number of grouping variables and aggregated numeric variables. One main difference between those datasets and the ones used here is that I make use of a Date type column in all aggregations. It also seems that Polars has a harder time with that data type. I'm showing data.table to be the fastest for all queries except the single Date type aggregation (this is where DuckDB wins).

I copied some of the code from this repo. Hoping someone can take a look because the results were a bit unexpected.

@jangorecki
Copy link

jangorecki commented May 5, 2024

Check compilation flags, not sure if it is relevant for duckdb but for other tools it does. For example -O3. If it is relevant it should be included in setup scripts.

@Tmonster
Copy link
Collaborator

Hi @AdrianAntico, I took a look at this last week Wednesday. For the 1Million case I got similar results, but for 10Million and 100Million I got other results which I've attached. I ran these tests on c6id.8xlarge machine.
10M Rows Benchmark
100M Rows Benchmark

Thinking about it now, it's possible the difference in results comes from network storage. DuckDB will bring the whole CSV to memory when the table is created. Maybe Pandas and datatable lazily create the data frame?

@Tmonster
Copy link
Collaborator

when I get the time I'll run everything locally and see what happens

@AdrianAntico
Copy link
Author

@Tmonster I ran the DuckDB.R script and noticed on the 100M rows dataset that multithreading wasn't occurring. At least, it appears that way compared to Polars and data.table when I did the same thing. I did set the statements on DuckDB to enable that using the code below but I think the native installation of DuckDB on R (install.packages("duckdb")) isn't enabling multithreading by default, or at least that's my best guess given @jangorecki comment about flag setting. I did check DuckDB's website on installation and it appears I'd have to build from source and modify some flag types, but it wasn't clear in the documentation what those should be, or even that I need to do it to get multithreading (assuming this is true).

con = dbConnect(duckdb::duckdb())
ncores = parallel::detectCores()
invisible(dbExecute(con, sprintf("PRAGMA THREADS=%d", ncores)))

So it appears the installation method, machine type, and dataset, local vs cloud, and perhaps operating system, all can play a pretty significant role in these timings.

I'm using Windows, with an AMD Ryzen CPU with 32 cores / 64 threads with 256gb memory, ran locally at home. So, with a native installation of each dataframe package and my setup, the timings I'm getting are different from what you're seeing. Also, I think the Date type column in the aggregation gave Polars some trouble. Perhaps there is an opportunity for you to utilize the datasets I'm using as they represent a more typical real world dataset. Or at least consider adding in a date type column for the aggregation (and join) operations. I can always add it the code to create a 1Bn row dataset at as well.

@jangorecki
Copy link

I would start by adding Linux to your machine and checking out there. Windows used to be not very reliable in resources utilization in my opinion.

@AdrianAntico
Copy link
Author

I prefer to stick with Windows as that is the most common operating system people use to work with data. I'm looking to benchmark under the common usage scenario. If DuckDB can be installed in an easy way to take advantage of multithreading then I will install that way. Currently, I'm not finding a clear path to installing on Windows in such a fashion. That means that most users are simply installing it in a native way. For example, R users will simply run install.packages("duckdb"). If there were a way to install it for multithreading use such as install.packages("duckdb", multithread=TRUE) or something like that, then I can easily see most users taking advantage of the multithreading. As far as I know, it's just DuckDB that requires the special setup, amongst the Windows available packages. I still haven't replicated for Julia so I may be wrong for their packages.

What confuses me is why it is a challenge to enable multithreading for DuckDB when it's available under basic install for the other packages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants