Fix grammar and add section

cmu-db · Feb 24, 2024 · 19ef143 · 19ef143
1 parent 8403b5d
commit 19ef143
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/docs/src/datafusion.md b/docs/src/datafusion.md
@@ -107,7 +107,9 @@ As per [Leis 2015](https://15721.courses.cs.cmu.edu/spring2024/papers/16-costmod
 
 Our base cardinality estimation scheme is inspired by [Postgres](https://www.postgresql.org/docs/current/planner-stats-details.html). We utilize roughly the same four per-column statistics as Postgres: the most common values of that column, the # of distinct values of that column, the fraction of nulls of that column, and a distribution of values for that column. Our base predicate (filter or join) selectivity formulas are also the same as Postgres. This is as opposed to [Microsoft SQLServer](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd535534(v=sql.100)?redirectedfrom=MSDN), for instance, which utilizes very different per-column statistics and predicate selectivity formulas. Our statistics are not exactly the same as Postgres though. For one, while Postgres uses a simple equi-height histogram, we utilize the more advanced T-Digest data structure to model the distribution of values. Additionally, Postgres samples its tables to build its statistics whereas we do a full sequential scan of all tables. This full sequential scan is made efficient by the fact that we use sketches, which have a low time complexity, and we implemented our sketching algorithms to be easily parallelizable.
 
-We obtain our statistics with highly parallel and minimal memory footprint using probabilistic algorithms, which trade off accuracy for scalability. Specifically:
+## Statistics
+
+We obtain our statistics with high parallelism and a minimal memory footprint, using probabilistic algorithms that trade off accuracy for scalability. Specifically:
 
 1. The distribution of each column (i.e., CDF) is computed using the TDigest algorithm designed by [Ted Dunning et al.](https://arxiv.org/pdf/1902.04023.pdf), rather than traditional equi-width histograms. TDigests can be seen as dynamically resizable histograms that offer particular precision at the tails of the distribution.