Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script for Benchmarking Exercise #5641

Closed
hkpeaks opened this issue May 21, 2023 · 14 comments
Closed

Script for Benchmarking Exercise #5641

hkpeaks opened this issue May 21, 2023 · 14 comments

Comments

@hkpeaks
Copy link

hkpeaks commented May 21, 2023

I am preparing benchmark for some of high performance software like Polars, DuckDB and yours.
I find difficulty to build script similar to the below workflow. I spent a lot of time to learn Polars scripting.
So I want to seek help from each of software developer to provide a script similar to below ETL workflow.
You can find my use cases with runtime for Windows/Linux CLI here https://github.com/hkpeaks/peaks-consolidation/releases

D:\Peaks>do FilterByDifferentCompareOperators100M.txt
Development runtime for testing only
Build Date: 23-05-18 | Expiry Date: 23-08-31
Report Comment: github.com/hkpeaks/peaks-consolidation

Select{100MillionRows.csv | Ledger(=L99,<L20)Project(>B25,<B23)~ Table}
Total Bytes: 7216385229 | Total Batches of Stream: 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14 Table(14 x 9092000)

Select{Currency(!=C06)}
Table(14 x 8096300)

Select{Account(<=11000, >=18000)}
Table(14 x 2477000)

Select{Quantity(Float100..300,Float600..900)}
Table(14 x 1247600)

Select{Contact(C32..C39)}
Table(14 x 929400)

Select{Contact(!=C33)~ Table2}
Table2(14 x 809100)

WriteFile{Table2 ~ PeaksFilterByDifferentCompareOperators100M.csv}
PeaksFilterByDifferentCompareOperators100M.csv(14 x 809100)

Duration: 9.702 seconds

@ben-schwen
Copy link
Member

This seems like a duplicate to the h2o benchmarks which it seems are continued by duckdb at https://duckdblabs.github.io/db-benchmark/

@hkpeaks
Copy link
Author

hkpeaks commented May 22, 2023

The benchmark is only foucs on GroupBy and JoinTable
None of software can complete the Jointable test for data size = 50GB,
but Peaks can do using 32GB memory, so I consider the benchmark may use incorrect setting of script to run jointable.
I want to do other benchmark, including focus each on filter, distinct, sorting, add column and etc.

@jangorecki
Copy link
Member

I would advise to use Linux for benchmarking

@hkpeaks
Copy link
Author

hkpeaks commented May 22, 2023

If use cloud, normally will use Linux.

I suggest that the benchmark can be tested on the following categories by selecting all records except filter test:
Script shall be provided by each software developers (as they know how to configure script to get best performance)

Compare for csv file size 1GB, 10GB and 100GB and relevent parquet file

  • Select all records for Sorting (cover 3+ sorting columns)
  • Select all records for Groupby (cover 3+ GroupBy columns, Aggregate Column - count, min, max, median, sum)
  • Select all records for Distinct (cover 3+ Distinct column)
  • Select all records for JoinTable(full join) (cover 3+ Composit Key)
  • Different compare operators (>, <,>=,<=, !=, ==, ..) filtering match and unmatch rows where .. is range operators
    Output match and unmatch to files = 100% of original records

Do benchmark in this way it can be more easily be understood by business users.

@hkpeaks
Copy link
Author

hkpeaks commented May 22, 2023

People told me that R data.table is very fast, but based on the
https://h2oai.github.io/db-benchmark/
the time required for this query is 3:25,3.00, it is much longer than my expectation. It is running in a 40-core CPU 125 Memory.
But the time seem do not include Read and Write file

For a better design of benchmarking, the query shall cover an completed process of ReadFile->Query->WriteFile.
Peaks Consolidation supports streaming model, ReadFile->Query->WriteFile are run in parallel. So it support joining 1 billion rows table using only 32GB memory.

question = "sum v3 count by id1:id6" # q10
t = system.time(print(dim(ans<-x[, .(v3=sum(v3, na.rm=TRUE), count=.N), by=id1:id6])))[["elapsed"]]
m = memory_usage()
chkt = system.time(chk<-ans[, .(sum(v3), sum(bit64::as.integer64(count)))])[["elapsed"]]
write.log(run=1L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
rm(ans)
t = system.time(print(dim(ans<-x[, .(v3=sum(v3, na.rm=TRUE), count=.N), by=id1:id6])))[["elapsed"]]
m = memory_usage()
chkt = system.time(chk<-ans[, .(sum(v3), sum(bit64::as.integer64(count)))])[["elapsed"]]
write.log(run=2L, task=task, data=data_name, in_rows=nrow(x), question=question, out_rows=nrow(ans), out_cols=ncol(ans), solution=solution, version=ver, git=git, fun=fun, time_sec=t, mem_gb=m, cache=cache, chk=make_chk(chk), chk_time_sec=chkt, on_disk=on_disk)
print(head(ans, 3))
print(tail(ans, 3))
rm(ans)

@hkpeaks
Copy link
Author

hkpeaks commented May 22, 2023

My machine has only 32GB 8-core, The file size is 67GB. below time 179 seconds cover r/w file.
The above test has 125GB 40-core, The file size is 50GB. total is similar, but it seem not cover write file.

D:\Peaks>do GroupBySumCount
Development runtime for testing only
Build Date: 23-05-18 | Expiry Date: 23-08-31
Report Comment: github.com/hkpeaks/peaks-consolidation

CurrentSetting{StreamMB(1000)Thread(100)}

GroupBy{1000MillionRows.csv | Ledger, Account, PartNo,Project,Contact,Unit Code, D/C,Currency => Count()Sum(Base Amount)~ Table}
Total Bytes: 72163851129 | Total Batches of Stream: 72
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 Table(10 x 99696)

WriteFile{Table ~ PeaksGroupByAllTextColumns.csv}
PeaksGroupByAllTextColumns.csv(10 x 99696)

Duration: 179.065 seconds

@jangorecki
Copy link
Member

if one is interested in query time, then why including reading and writing? if you look carefully issues in h2oai's project you should find suggestions to include complete workflow, rather than atomic operation, but it definitely should not be default. it's better to present timings for atomic operations and let people do maths themselves according to their use cases.

@hkpeaks
Copy link
Author

hkpeaks commented May 22, 2023

What end user feeling of time is total processing time from clicking return key to see the file. It become popular software support streaming, ReadFile -> Query -> WriteFile are run in parallel, in fact no way to measure time for purely query. Certain software spent extensive time on extraction to build a better cube, to support faster query, if not count data extraction time, it is obviously not fair.

@hkpeaks
Copy link
Author

hkpeaks commented Jun 5, 2023

Below is first time use R Data.Table. 100,000 Rows * 100 files requires 8.787 seconds while DuckDB requires 1.078s. I plan to publish new benchmarks but concern the R-Data.Table script is not an optimize script for fastest performance. The real benchmark I will use 3,000 files. Coming benchmark is an extension of my recent published benchmark https://youtu.be/gnIh6r7Gwh4

Would you help to optimize it?

Sample Data: https://github.com/hkpeaks/peaks-consolidation/blob/main/Benchmark20230602/1.csv

library(data.table)
library(readr)

s <- Sys.time()

setDTthreads(10)
temp <- list.files ("D:/Benchmark/Input/Copy100", full.names = TRUE, pattern = "\.csv$")
DT <- rbindlist(lapply(temp, fread), fill = TRUE)
DT <- DT[Ledger >= 'L30' & Ledger <= 'L70', .(Total_Base_Amount = sum(Base_Amount)), by = .(Ledger, Account, DC, Currency)]
fwrite(DT, 'output/DataTableGroupByCSV.csv')

e <- Sys.time()
print(paste0("R-Data.Table FilterGroupBy CSV Time = ", round(e-s,3)))

@hkpeaks
Copy link
Author

hkpeaks commented Jun 6, 2023

I have tested for 3,000 files for Data.Table, it trigger out of memory
duckdblabs/db-benchmark#13 (comment)
"Error: cannot allocate vector of size 1.1 Gb Execution halted"

Is it support streaming?

@avimallu
Copy link
Contributor

avimallu commented Jun 6, 2023

data.table doesn't support streaming, no.

@jangorecki
Copy link
Member

jangorecki commented Jun 6, 2023

streaming is not on the roadmap for data.table itself. There is example of streaming data in R with data.table in this old video: https://m.youtube.com/watch?v=rvT8XThGA8o
You may try to adapt it for your use case.

Handling data bigger than memory in data.table is on the roadmap, using mmap. AFAIR it is high priority together with long vectors, yet it is not trivial change.

@hkpeaks
Copy link
Author

hkpeaks commented Jun 7, 2023

use mmap for development the code may be OS dependent.

In fact streaming is simple and reliable. You can have a look on my code.

pola-rs/polars#9201 (comment)

@jangorecki
Copy link
Member

Closing as there is nothing here to act about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants