Replies: 2 comments 2 replies
-
🤔 I tried this with the ClickBench dataset (https://github.com/ClickHouse/ClickBench#data-loading) but Polars didn't seem to support that
When I tried with a parquet file from TPCH scalefactor 10, I was able to replicate your results I think this issue is that polars is reading the file in parallel while DataFusion is reading it with a single thread. I will file an issue about this |
Beta Was this translation helpful? Give feedback.
-
Thank you @alamb If it helps I can add one observation: I was monitoring mem usage using Task Manager in Windows. Polaris has a consistent mem usage increase until completely loaded. Using data fusion, same speed (like Polars) is seen until about half the dataset being loaded, then it slows down and continues with this lower speed until fully loaded. I can provide task manager pics if it matters. |
Beta Was this translation helpful? Give feedback.
-
Is it expected that loading a ~6.5gb parquet file into memory has huge difference between polars and datafusion?
Datafusion's
.cache()
method takes ~2 minutes. Loading same data with polars takes ~15s.Minimum working example is below.
Both are submitted with
cargo run
- if it makes a difference due to--release
.Code:
Cargo.toml
Beta Was this translation helpful? Give feedback.
All reactions