Skip to content

Commit

Permalink
Fix grammar and add section
Browse files Browse the repository at this point in the history
  • Loading branch information
AlSchlo committed Feb 24, 2024
1 parent 8403b5d commit 19ef143
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion docs/src/datafusion.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,9 @@ As per [Leis 2015](https://15721.courses.cs.cmu.edu/spring2024/papers/16-costmod

Our base cardinality estimation scheme is inspired by [Postgres](https://www.postgresql.org/docs/current/planner-stats-details.html). We utilize roughly the same four per-column statistics as Postgres: the most common values of that column, the # of distinct values of that column, the fraction of nulls of that column, and a distribution of values for that column. Our base predicate (filter or join) selectivity formulas are also the same as Postgres. This is as opposed to [Microsoft SQLServer](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd535534(v=sql.100)?redirectedfrom=MSDN), for instance, which utilizes very different per-column statistics and predicate selectivity formulas. Our statistics are not exactly the same as Postgres though. For one, while Postgres uses a simple equi-height histogram, we utilize the more advanced T-Digest data structure to model the distribution of values. Additionally, Postgres samples its tables to build its statistics whereas we do a full sequential scan of all tables. This full sequential scan is made efficient by the fact that we use sketches, which have a low time complexity, and we implemented our sketching algorithms to be easily parallelizable.

We obtain our statistics with highly parallel and minimal memory footprint using probabilistic algorithms, which trade off accuracy for scalability. Specifically:
## Statistics

We obtain our statistics with high parallelism and a minimal memory footprint, using probabilistic algorithms that trade off accuracy for scalability. Specifically:

1. The distribution of each column (i.e., CDF) is computed using the TDigest algorithm designed by [Ted Dunning et al.](https://arxiv.org/pdf/1902.04023.pdf), rather than traditional equi-width histograms. TDigests can be seen as dynamically resizable histograms that offer particular precision at the tails of the distribution.

Expand Down

0 comments on commit 19ef143

Please sign in to comment.