Script for Benchmarking Exercise #5641

hkpeaks · 2023-05-21T04:45:54Z

I am preparing benchmark for some of high performance software like Polars, DuckDB and yours.
I find difficulty to build script similar to the below workflow. I spent a lot of time to learn Polars scripting.
So I want to seek help from each of software developer to provide a script similar to below ETL workflow.
You can find my use cases with runtime for Windows/Linux CLI here https://github.com/hkpeaks/peaks-consolidation/releases

D:\Peaks>do FilterByDifferentCompareOperators100M.txt
Development runtime for testing only
Build Date: 23-05-18 | Expiry Date: 23-08-31
Report Comment: github.com/hkpeaks/peaks-consolidation

Select{100MillionRows.csv | Ledger(=L99,<L20)Project(>B25,<B23)~ Table}
Total Bytes: 7216385229 | Total Batches of Stream: 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Table(14 x 9092000)

Select{Currency(!=C06)}
Table(14 x 8096300)

Select{Account(<=11000, >=18000)}
Table(14 x 2477000)

Select{Quantity(Float100..300,Float600..900)}
Table(14 x 1247600)

Select{Contact(C32..C39)}
Table(14 x 929400)

Select{Contact(!=C33)~ Table2}
Table2(14 x 809100)

WriteFile{Table2 ~ PeaksFilterByDifferentCompareOperators100M.csv}
PeaksFilterByDifferentCompareOperators100M.csv(14 x 809100)

Duration: 9.702 seconds

ben-schwen · 2023-05-21T15:15:51Z

This seems like a duplicate to the h2o benchmarks which it seems are continued by duckdb at https://duckdblabs.github.io/db-benchmark/

hkpeaks · 2023-05-22T00:31:21Z

The benchmark is only foucs on GroupBy and JoinTable
None of software can complete the Jointable test for data size = 50GB,
but Peaks can do using 32GB memory, so I consider the benchmark may use incorrect setting of script to run jointable.
I want to do other benchmark, including focus each on filter, distinct, sorting, add column and etc.

jangorecki · 2023-05-22T04:57:31Z

I would advise to use Linux for benchmarking

hkpeaks · 2023-05-22T10:52:37Z

If use cloud, normally will use Linux.

I suggest that the benchmark can be tested on the following categories by selecting all records except filter test:
Script shall be provided by each software developers (as they know how to configure script to get best performance)

Compare for csv file size 1GB, 10GB and 100GB and relevent parquet file

Select all records for Sorting (cover 3+ sorting columns)
Select all records for Groupby (cover 3+ GroupBy columns, Aggregate Column - count, min, max, median, sum)
Select all records for Distinct (cover 3+ Distinct column)
Select all records for JoinTable(full join) (cover 3+ Composit Key)
Different compare operators (>, <,>=,<=, !=, ==, ..) filtering match and unmatch rows where .. is range operators
Output match and unmatch to files = 100% of original records

Do benchmark in this way it can be more easily be understood by business users.

hkpeaks · 2023-05-22T11:23:37Z

People told me that R data.table is very fast, but based on the
https://h2oai.github.io/db-benchmark/
the time required for this query is 3:25,3.00, it is much longer than my expectation. It is running in a 40-core CPU 125 Memory.
But the time seem do not include Read and Write file

For a better design of benchmarking, the query shall cover an completed process of ReadFile->Query->WriteFile.
Peaks Consolidation supports streaming model, ReadFile->Query->WriteFile are run in parallel. So it support joining 1 billion rows table using only 32GB memory.

question = "sum v3 count by id1:id6" # q10
t = system.time(print(dim(ans<-x[, .(v3=sum(v3, na.rm=TRUE), count=.N), by=id1:id6])))[["elapsed"]]
m = memory_usage()
chkt = system.time(chk<-ans[, .(sum(v3), sum(bit64::as.integer64(count)))])[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
rm(ans)
t = system.time(print(dim(ans<-x[, .(v3=sum(v3, na.rm=TRUE), count=.N), by=id1:id6])))[["elapsed"]]
m = memory_usage()
chkt = system.time(chk<-ans[, .(sum(v3), sum(bit64::as.integer64(count)))])[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
print(head(ans, 3))
print(tail(ans, 3))
rm(ans)

hkpeaks · 2023-05-22T11:40:53Z

My machine has only 32GB 8-core, The file size is 67GB. below time 179 seconds cover r/w file.
The above test has 125GB 40-core, The file size is 50GB. total is similar, but it seem not cover write file.

D:\Peaks>do GroupBySumCount
Development runtime for testing only
Build Date: 23-05-18 | Expiry Date: 23-08-31
Report Comment: github.com/hkpeaks/peaks-consolidation

CurrentSetting{StreamMB(1000)Thread(100)}

GroupBy{1000MillionRows.csv | Ledger, Account, PartNo,Project,Contact,Unit Code, D/C,Currency => Count()Sum(Base Amount)~ Table}
Total Bytes: 72163851129 | Total Batches of Stream: 72
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Table(10 x 99696)

WriteFile{Table ~ PeaksGroupByAllTextColumns.csv}
PeaksGroupByAllTextColumns.csv(10 x 99696)

Duration: 179.065 seconds

jangorecki · 2023-05-22T12:45:02Z

if one is interested in query time, then why including reading and writing? if you look carefully issues in h2oai's project you should find suggestions to include complete workflow, rather than atomic operation, but it definitely should not be default. it's better to present timings for atomic operations and let people do maths themselves according to their use cases.

hkpeaks · 2023-05-22T13:24:35Z

What end user feeling of time is total processing time from clicking return key to see the file. It become popular software support streaming, ReadFile -> Query -> WriteFile are run in parallel, in fact no way to measure time for purely query. Certain software spent extensive time on extraction to build a better cube, to support faster query, if not count data extraction time, it is obviously not fair.

hkpeaks · 2023-06-05T15:28:49Z

Below is first time use R Data.Table. 100,000 Rows * 100 files requires 8.787 seconds while DuckDB requires 1.078s. I plan to publish new benchmarks but concern the R-Data.Table script is not an optimize script for fastest performance. The real benchmark I will use 3,000 files. Coming benchmark is an extension of my recent published benchmark https://youtu.be/gnIh6r7Gwh4

Would you help to optimize it?

Sample Data: https://github.com/hkpeaks/peaks-consolidation/blob/main/Benchmark20230602/1.csv

library(data.table)
library(readr)

s <- Sys.time()

setDTthreads(10)
temp <- list.files ("D:/Benchmark/Input/Copy100", full.names = TRUE, pattern = "\.csv$")
DT <- rbindlist(lapply(temp, fread), fill = TRUE)
DT <- DT[Ledger >= 'L30' & Ledger <= 'L70', .(Total_Base_Amount = sum(Base_Amount)), by = .(Ledger, Account, DC, Currency)]
fwrite(DT, 'output/DataTableGroupByCSV.csv')

e <- Sys.time()
print(paste0("R-Data.Table FilterGroupBy CSV Time = ", round(e-s,3)))

hkpeaks · 2023-06-06T08:31:45Z

I have tested for 3,000 files for Data.Table, it trigger out of memory
duckdblabs/db-benchmark#13 (comment)
"Error: cannot allocate vector of size 1.1 Gb Execution halted"

Is it support streaming?

avimallu · 2023-06-06T17:26:34Z

data.table doesn't support streaming, no.

jangorecki · 2023-06-06T19:49:51Z

streaming is not on the roadmap for data.table itself. There is example of streaming data in R with data.table in this old video: https://m.youtube.com/watch?v=rvT8XThGA8o
You may try to adapt it for your use case.

Handling data bigger than memory in data.table is on the roadmap, using mmap. AFAIR it is high priority together with long vectors, yet it is not trivial change.

hkpeaks · 2023-06-07T08:17:00Z

use mmap for development the code may be OS dependent.

In fact streaming is simple and reliable. You can have a look on my code.

pola-rs/polars#9201 (comment)

jangorecki · 2023-10-01T03:02:05Z

Closing as there is nothing here to act about.

tdhock added benchmark performance labels Jun 4, 2023

hkpeaks mentioned this issue Jun 7, 2023

Recommended Benchmarking Arrangement duckdblabs/db-benchmark#13

Open

jangorecki closed this as completed Oct 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script for Benchmarking Exercise #5641

Script for Benchmarking Exercise #5641

hkpeaks commented May 21, 2023

ben-schwen commented May 21, 2023

hkpeaks commented May 22, 2023

jangorecki commented May 22, 2023

hkpeaks commented May 22, 2023 •

edited

Loading

hkpeaks commented May 22, 2023

hkpeaks commented May 22, 2023 •

edited

Loading

jangorecki commented May 22, 2023

hkpeaks commented May 22, 2023 •

edited

Loading

hkpeaks commented Jun 5, 2023 •

edited

Loading

hkpeaks commented Jun 6, 2023

avimallu commented Jun 6, 2023

jangorecki commented Jun 6, 2023 •

edited

Loading

hkpeaks commented Jun 7, 2023

jangorecki commented Oct 1, 2023

Script for Benchmarking Exercise #5641

Script for Benchmarking Exercise #5641

Comments

hkpeaks commented May 21, 2023

ben-schwen commented May 21, 2023

hkpeaks commented May 22, 2023

jangorecki commented May 22, 2023

hkpeaks commented May 22, 2023 • edited Loading

hkpeaks commented May 22, 2023

hkpeaks commented May 22, 2023 • edited Loading

jangorecki commented May 22, 2023

hkpeaks commented May 22, 2023 • edited Loading

hkpeaks commented Jun 5, 2023 • edited Loading

hkpeaks commented Jun 6, 2023

avimallu commented Jun 6, 2023

jangorecki commented Jun 6, 2023 • edited Loading

hkpeaks commented Jun 7, 2023

jangorecki commented Oct 1, 2023

hkpeaks commented May 22, 2023 •

edited

Loading

hkpeaks commented May 22, 2023 •

edited

Loading

hkpeaks commented May 22, 2023 •

edited

Loading

hkpeaks commented Jun 5, 2023 •

edited

Loading

jangorecki commented Jun 6, 2023 •

edited

Loading