Skip to content

Commit

Permalink
Merge pull request #4292 from szarnyasg/nits-20241207c
Browse files Browse the repository at this point in the history
Nits
  • Loading branch information
szarnyasg authored Dec 7, 2024
2 parents 9e95429 + cb4413a commit e6e23b4
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 15 deletions.
26 changes: 13 additions & 13 deletions _posts/2022-10-28-lightweight-compression.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,18 +19,18 @@ Column store formats, such as DuckDB's native file format or [Parquet]({% post_u

DuckDB added support for compression [at the end of last year](https://github.com/duckdb/duckdb/pull/2099). As shown in the table below, the compression ratio of DuckDB has continuously improved since then and is still actively being improved. In this blog post, we discuss how compression in DuckDB works, and the design choices and various trade-offs that we have made while implementing compression for DuckDB's storage format.

| Version | Taxi | On Time | Lineitem | Notes | Date |
|:-----------------------|-------:|-------------:|---------:|:---------------|:---------------|
| DuckDB v0.2.8 | 15.3GB | 1.73GB | 0.85GB | Uncompressed | July 2021 |
| DuckDB v0.2.9 | 11.2GB | 1.25GB | 0.79GB | RLE + Constant | September 2021 |
| DuckDB v0.3.2 | 10.8GB | 0.98GB | 0.56GB | Bitpacking | February 2022 |
| DuckDB v0.3.3 | 6.9GB | 0.23GB | 0.32GB | Dictionary | April 2022 |
| DuckDB v0.5.0 | 6.6GB | 0.21GB | 0.29GB | FOR | September 2022 |
| DuckDB dev | 4.8GB | 0.21GB | 0.17GB | FSST + Chimp | `now()` |
| CSV | 17.0GB | 1.11GB | 0.72GB | | |
| Parquet (Uncompressed) | 4.5GB | 0.12GB | 0.31GB | | |
| Parquet (Snappy) | 3.2GB | 0.11GB | 0.18GB | | |
| Parquet (ZSTD) | 2.6GB | 0.08GB | 0.15GB | | |
| Version | Taxi | On Time | `lineitem` | Notes | Date |
|:-----------------------|-------:|-------------:|-----------:|:---------------|:---------------|
| DuckDB v0.2.8 | 15.3GB | 1.73GB | 0.85GB | Uncompressed | July 2021 |
| DuckDB v0.2.9 | 11.2GB | 1.25GB | 0.79GB | RLE + Constant | September 2021 |
| DuckDB v0.3.2 | 10.8GB | 0.98GB | 0.56GB | Bitpacking | February 2022 |
| DuckDB v0.3.3 | 6.9GB | 0.23GB | 0.32GB | Dictionary | April 2022 |
| DuckDB v0.5.0 | 6.6GB | 0.21GB | 0.29GB | FOR | September 2022 |
| DuckDB dev | 4.8GB | 0.21GB | 0.17GB | FSST + Chimp | `now()` |
| CSV | 17.0GB | 1.11GB | 0.72GB | | |
| Parquet (Uncompressed) | 4.5GB | 0.12GB | 0.31GB | | |
| Parquet (Snappy) | 3.2GB | 0.11GB | 0.18GB | | |
| Parquet (ZSTD) | 2.6GB | 0.08GB | 0.15GB | | |

## Compression Intro

Expand Down Expand Up @@ -180,7 +180,7 @@ ORDER BY row_group_id;
```

| row_group_id | column_name | column_id | segment_type | count | compression |
|--------------|:-------------------|-----------|:-------------|-------|:-------------|
|-------------:|:-------------------|----------:|:-------------|------:|:-------------|
| 4 | extra | 13 | FLOAT | 65536 | Chimp |
| 20 | tip_amount | 15 | FLOAT | 65536 | Chimp |
| 26 | pickup_latitude | 6 | VALIDITY | 65536 | Constant |
Expand Down
2 changes: 1 addition & 1 deletion _posts/2023-08-04-adbc.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ with con.cursor() as cursor:

## Benchmark ADBC vs ODBC

In our benchmark section, we aim to evaluate the differences in data reading from DuckDB via ADBC and ODBC. This benchmark was executed on an Apple M1 Max with 32GB of RAM and involves outputting and inserting the Lineitem table of TPC-H SF 1. You can find the repository with the code used to run this benchmark [here](https://github.com/pdet/connector_benchmark).
In our benchmark section, we aim to evaluate the differences in data reading from DuckDB via ADBC and ODBC. This benchmark was executed on an Apple M1 Max with 32GB of RAM and involves outputting and inserting the `lineitem` table of TPC-H SF 1. You can find the repository with the code used to run this benchmark [here](https://github.com/pdet/connector_benchmark).

| Name | Time (s) |
|-------------|---------:|
Expand Down
2 changes: 1 addition & 1 deletion _posts/2024-12-05-csv-files-dethroning-parquet-or-not.md
Original file line number Diff line number Diff line change
Expand Up @@ -160,7 +160,7 @@ However, it is still important to consider this in the comparison. In practice,

We will run two different TPC-H queries on our files.

**Query 01.** First, we run TPC-H Q01. This query operates solely on the `Lineitem` table, performing an aggregation and grouping with a filter. It filters on one column and projects 7 out of the 16 columns from `Lineitem`.
**Query 01.** First, we run TPC-H Q01. This query operates solely on the `lineitem` table, performing an aggregation and grouping with a filter. It filters on one column and projects 7 out of the 16 columns from `lineitem`.

Therefore, this query will stress the filter pushdown, which is [supported by the Parquet reader]({% link docs/data/parquet/overview.md %}#partial-reading) but not the CSV reader, and the projection pushdown, which is supported by both.

Expand Down

0 comments on commit e6e23b4

Please sign in to comment.