-
Notifications
You must be signed in to change notification settings - Fork 990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script for Benchmarking Exercise #5641
Comments
This seems like a duplicate to the h2o benchmarks which it seems are continued by duckdb at https://duckdblabs.github.io/db-benchmark/ |
The benchmark is only foucs on GroupBy and JoinTable |
I would advise to use Linux for benchmarking |
If use cloud, normally will use Linux. I suggest that the benchmark can be tested on the following categories by selecting all records except filter test: Compare for csv file size 1GB, 10GB and 100GB and relevent parquet file
Do benchmark in this way it can be more easily be understood by business users. |
People told me that R data.table is very fast, but based on the For a better design of benchmarking, the query shall cover an completed process of ReadFile->Query->WriteFile. question = "sum v3 count by id1:id6" # q10 |
My machine has only 32GB 8-core, The file size is 67GB. below time 179 seconds cover r/w file. D:\Peaks>do GroupBySumCount CurrentSetting{StreamMB(1000)Thread(100)} GroupBy{1000MillionRows.csv | Ledger, Account, PartNo,Project,Contact,Unit Code, D/C,Currency => Count()Sum(Base Amount)~ Table} WriteFile{Table ~ PeaksGroupByAllTextColumns.csv} Duration: 179.065 seconds |
if one is interested in query time, then why including reading and writing? if you look carefully issues in h2oai's project you should find suggestions to include complete workflow, rather than atomic operation, but it definitely should not be default. it's better to present timings for atomic operations and let people do maths themselves according to their use cases. |
What end user feeling of time is total processing time from clicking return key to see the file. It become popular software support streaming, ReadFile -> Query -> WriteFile are run in parallel, in fact no way to measure time for purely query. Certain software spent extensive time on extraction to build a better cube, to support faster query, if not count data extraction time, it is obviously not fair. |
Below is first time use R Data.Table. 100,000 Rows * 100 files requires 8.787 seconds while DuckDB requires 1.078s. I plan to publish new benchmarks but concern the R-Data.Table script is not an optimize script for fastest performance. The real benchmark I will use 3,000 files. Coming benchmark is an extension of my recent published benchmark https://youtu.be/gnIh6r7Gwh4 Would you help to optimize it? Sample Data: https://github.com/hkpeaks/peaks-consolidation/blob/main/Benchmark20230602/1.csv library(data.table) s <- Sys.time() setDTthreads(10) e <- Sys.time() |
I have tested for 3,000 files for Data.Table, it trigger out of memory Is it support streaming? |
data.table doesn't support streaming, no. |
streaming is not on the roadmap for data.table itself. There is example of streaming data in R with data.table in this old video: https://m.youtube.com/watch?v=rvT8XThGA8o Handling data bigger than memory in data.table is on the roadmap, using mmap. AFAIR it is high priority together with long vectors, yet it is not trivial change. |
use mmap for development the code may be OS dependent. In fact streaming is simple and reliable. You can have a look on my code. |
Closing as there is nothing here to act about. |
I am preparing benchmark for some of high performance software like Polars, DuckDB and yours.
I find difficulty to build script similar to the below workflow. I spent a lot of time to learn Polars scripting.
So I want to seek help from each of software developer to provide a script similar to below ETL workflow.
You can find my use cases with runtime for Windows/Linux CLI here https://github.com/hkpeaks/peaks-consolidation/releases
D:\Peaks>do FilterByDifferentCompareOperators100M.txt
Development runtime for testing only
Build Date: 23-05-18 | Expiry Date: 23-08-31
Report Comment: github.com/hkpeaks/peaks-consolidation
Select{100MillionRows.csv | Ledger(=L99,<L20)Project(>B25,<B23)~ Table}
Total Bytes: 7216385229 | Total Batches of Stream: 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Table(14 x 9092000)
Select{Currency(!=C06)}
Table(14 x 8096300)
Select{Account(<=11000, >=18000)}
Table(14 x 2477000)
Select{Quantity(Float100..300,Float600..900)}
Table(14 x 1247600)
Select{Contact(C32..C39)}
Table(14 x 929400)
Select{Contact(!=C33)~ Table2}
Table2(14 x 809100)
WriteFile{Table2 ~ PeaksFilterByDifferentCompareOperators100M.csv}
PeaksFilterByDifferentCompareOperators100M.csv(14 x 809100)
Duration: 9.702 seconds
The text was updated successfully, but these errors were encountered: