Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add tdigest data structure for statistics #71

Merged
merged 14 commits into from
Feb 24, 2024
Merged

feat: add tdigest data structure for statistics #71

merged 14 commits into from
Feb 24, 2024

Conversation

AlSchlo
Copy link
Contributor

@AlSchlo AlSchlo commented Feb 17, 2024

This pull request implements a simplified version of Ted Dunning's TDigest algorithm for efficient quantile/cdf computations.

It supports a fully parallelizable, memory-bounded computation scheme, along with an easy API.

Simplified for two reasons:

  • The linear interpolation is only done for the quantile, not the cdf.
  • Linear interpolation in general could be done more efficiently if we leverage unit-weighted centroids at the edges, as described in the paper.

Both are marked as TODOs in the code for future work.

Most open-source implementations found online had bugs, were incomplete, or were too complex (i.e., poorly written).

Includes unit tests on uniform and weighted distributions.

Next steps: Implementing the HyperLogLog algorithm for NDistinct.

@AlSchlo AlSchlo changed the title Tdigest feat: add tdigest data structure for statistics Feb 17, 2024
@AlSchlo AlSchlo marked this pull request as ready for review February 17, 2024 04:56
@AlSchlo AlSchlo requested review from yliang412 and skyzh February 17, 2024 05:04
Copy link
Member

@yliang412 yliang412 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the paper in 100% detail, but overall LGTM!

Could we keep the license consistent with the one for the entire repository (LICENSE)? I also suggest that you could add a short description for the gungnir crate under the structure section in the repo README.


// The TDigest structure for the statistical aggregator to query quantiles.
pub struct TDigest {
centroids: Vec<Centroid>, // A sorted array of Centroids, according to their mean.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use doc comments i.e. /// above the structs and fields so the comments show up in the documentation. For example,

/// The TDigest structure for the statistical aggregator to query quantiles.
pub struct TDigest {
    /// A sorted array of Centroids, according to their mean.
    centroids: Vec<Centroid>, 
...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where did I change the License? I don't think I added anything about the license.
Do you mean the cool header at the top of each Gungnir™ file? 😂

Copy link
Member

@yliang412 yliang412 Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yea. I was talking about those. Those are cool though lol.

gungnir/src/stats/tdigest.rs Outdated Show resolved Hide resolved
@AlSchlo AlSchlo merged commit f390059 into main Feb 24, 2024
1 check failed
@AlSchlo AlSchlo deleted the tdigest branch February 24, 2024 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants