-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I created a benchmark but DuckDB run times are super slow and not sure why #74
Comments
Check compilation flags, not sure if it is relevant for duckdb but for other tools it does. For example |
Hi @AdrianAntico, I took a look at this last week Wednesday. For the 1Million case I got similar results, but for 10Million and 100Million I got other results which I've attached. I ran these tests on c6id.8xlarge machine. Thinking about it now, it's possible the difference in results comes from network storage. DuckDB will bring the whole CSV to memory when the table is created. Maybe Pandas and datatable lazily create the data frame? |
when I get the time I'll run everything locally and see what happens |
@Tmonster I ran the DuckDB.R script and noticed on the 100M rows dataset that multithreading wasn't occurring. At least, it appears that way compared to Polars and data.table when I did the same thing. I did set the statements on DuckDB to enable that using the code below but I think the native installation of DuckDB on R (install.packages("duckdb")) isn't enabling multithreading by default, or at least that's my best guess given @jangorecki comment about flag setting. I did check DuckDB's website on installation and it appears I'd have to build from source and modify some flag types, but it wasn't clear in the documentation what those should be, or even that I need to do it to get multithreading (assuming this is true).
So it appears the installation method, machine type, and dataset, local vs cloud, and perhaps operating system, all can play a pretty significant role in these timings. I'm using Windows, with an AMD Ryzen CPU with 32 cores / 64 threads with 256gb memory, ran locally at home. So, with a native installation of each dataframe package and my setup, the timings I'm getting are different from what you're seeing. Also, I think the Date type column in the aggregation gave Polars some trouble. Perhaps there is an opportunity for you to utilize the datasets I'm using as they represent a more typical real world dataset. Or at least consider adding in a date type column for the aggregation (and join) operations. I can always add it the code to create a 1Bn row dataset at as well. |
I would start by adding Linux to your machine and checking out there. Windows used to be not very reliable in resources utilization in my opinion. |
I prefer to stick with Windows as that is the most common operating system people use to work with data. I'm looking to benchmark under the common usage scenario. If DuckDB can be installed in an easy way to take advantage of multithreading then I will install that way. Currently, I'm not finding a clear path to installing on Windows in such a fashion. That means that most users are simply installing it in a native way. For example, R users will simply run What confuses me is why it is a challenge to enable multithreading for DuckDB when it's available under basic install for the other packages. |
Here's a link to the repo: https://github.com/AdrianAntico/Benchmarks
I use 3 datasets, one with a million rows, one with 10M, and one with 100M. I am currently just running a sum aggregation, with varying number of grouping variables and aggregated numeric variables. One main difference between those datasets and the ones used here is that I make use of a Date type column in all aggregations. It also seems that Polars has a harder time with that data type. I'm showing data.table to be the fastest for all queries except the single Date type aggregation (this is where DuckDB wins).
I copied some of the code from this repo. Hoping someone can take a look because the results were a bit unexpected.
The text was updated successfully, but these errors were encountered: