From cb4413a4cc8bd19c48da5ce4f4a09d617dad72c9 Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Sat, 7 Dec 2024 12:37:08 +0100 Subject: [PATCH] Nits --- _posts/2022-10-28-lightweight-compression.md | 26 +++++++++---------- _posts/2023-08-04-adbc.md | 2 +- ...-05-csv-files-dethroning-parquet-or-not.md | 2 +- 3 files changed, 15 insertions(+), 15 deletions(-) diff --git a/_posts/2022-10-28-lightweight-compression.md b/_posts/2022-10-28-lightweight-compression.md index f1b5c3c2904..244e153375f 100644 --- a/_posts/2022-10-28-lightweight-compression.md +++ b/_posts/2022-10-28-lightweight-compression.md @@ -19,18 +19,18 @@ Column store formats, such as DuckDB's native file format or [Parquet]({% post_u DuckDB added support for compression [at the end of last year](https://github.com/duckdb/duckdb/pull/2099). As shown in the table below, the compression ratio of DuckDB has continuously improved since then and is still actively being improved. In this blog post, we discuss how compression in DuckDB works, and the design choices and various trade-offs that we have made while implementing compression for DuckDB's storage format. -| Version | Taxi | On Time | Lineitem | Notes | Date | -|:-----------------------|-------:|-------------:|---------:|:---------------|:---------------| -| DuckDB v0.2.8 | 15.3GB | 1.73GB | 0.85GB | Uncompressed | July 2021 | -| DuckDB v0.2.9 | 11.2GB | 1.25GB | 0.79GB | RLE + Constant | September 2021 | -| DuckDB v0.3.2 | 10.8GB | 0.98GB | 0.56GB | Bitpacking | February 2022 | -| DuckDB v0.3.3 | 6.9GB | 0.23GB | 0.32GB | Dictionary | April 2022 | -| DuckDB v0.5.0 | 6.6GB | 0.21GB | 0.29GB | FOR | September 2022 | -| DuckDB dev | 4.8GB | 0.21GB | 0.17GB | FSST + Chimp | `now()` | -| CSV | 17.0GB | 1.11GB | 0.72GB | | | -| Parquet (Uncompressed) | 4.5GB | 0.12GB | 0.31GB | | | -| Parquet (Snappy) | 3.2GB | 0.11GB | 0.18GB | | | -| Parquet (ZSTD) | 2.6GB | 0.08GB | 0.15GB | | | +| Version | Taxi | On Time | `lineitem` | Notes | Date | +|:-----------------------|-------:|-------------:|-----------:|:---------------|:---------------| +| DuckDB v0.2.8 | 15.3GB | 1.73GB | 0.85GB | Uncompressed | July 2021 | +| DuckDB v0.2.9 | 11.2GB | 1.25GB | 0.79GB | RLE + Constant | September 2021 | +| DuckDB v0.3.2 | 10.8GB | 0.98GB | 0.56GB | Bitpacking | February 2022 | +| DuckDB v0.3.3 | 6.9GB | 0.23GB | 0.32GB | Dictionary | April 2022 | +| DuckDB v0.5.0 | 6.6GB | 0.21GB | 0.29GB | FOR | September 2022 | +| DuckDB dev | 4.8GB | 0.21GB | 0.17GB | FSST + Chimp | `now()` | +| CSV | 17.0GB | 1.11GB | 0.72GB | | | +| Parquet (Uncompressed) | 4.5GB | 0.12GB | 0.31GB | | | +| Parquet (Snappy) | 3.2GB | 0.11GB | 0.18GB | | | +| Parquet (ZSTD) | 2.6GB | 0.08GB | 0.15GB | | | ## Compression Intro @@ -180,7 +180,7 @@ ORDER BY row_group_id; ``` | row_group_id | column_name | column_id | segment_type | count | compression | -|--------------|:-------------------|-----------|:-------------|-------|:-------------| +|-------------:|:-------------------|----------:|:-------------|------:|:-------------| | 4 | extra | 13 | FLOAT | 65536 | Chimp | | 20 | tip_amount | 15 | FLOAT | 65536 | Chimp | | 26 | pickup_latitude | 6 | VALIDITY | 65536 | Constant | diff --git a/_posts/2023-08-04-adbc.md b/_posts/2023-08-04-adbc.md index 89770428439..7fed18dd064 100644 --- a/_posts/2023-08-04-adbc.md +++ b/_posts/2023-08-04-adbc.md @@ -96,7 +96,7 @@ with con.cursor() as cursor: ## Benchmark ADBC vs ODBC -In our benchmark section, we aim to evaluate the differences in data reading from DuckDB via ADBC and ODBC. This benchmark was executed on an Apple M1 Max with 32GB of RAM and involves outputting and inserting the Lineitem table of TPC-H SF 1. You can find the repository with the code used to run this benchmark [here](https://github.com/pdet/connector_benchmark). +In our benchmark section, we aim to evaluate the differences in data reading from DuckDB via ADBC and ODBC. This benchmark was executed on an Apple M1 Max with 32GB of RAM and involves outputting and inserting the `lineitem` table of TPC-H SF 1. You can find the repository with the code used to run this benchmark [here](https://github.com/pdet/connector_benchmark). | Name | Time (s) | |-------------|---------:| diff --git a/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md b/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md index 5926e68f40e..ebd33532fde 100644 --- a/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md +++ b/_posts/2024-12-05-csv-files-dethroning-parquet-or-not.md @@ -160,7 +160,7 @@ However, it is still important to consider this in the comparison. In practice, We will run two different TPC-H queries on our files. -**Query 01.** First, we run TPC-H Q01. This query operates solely on the `Lineitem` table, performing an aggregation and grouping with a filter. It filters on one column and projects 7 out of the 16 columns from `Lineitem`. +**Query 01.** First, we run TPC-H Q01. This query operates solely on the `lineitem` table, performing an aggregation and grouping with a filter. It filters on one column and projects 7 out of the 16 columns from `lineitem`. Therefore, this query will stress the filter pushdown, which is [supported by the Parquet reader]({% link docs/data/parquet/overview.md %}#partial-reading) but not the CSV reader, and the projection pushdown, which is supported by both.