-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add tdigest data structure for statistics #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gungnir/src/stats/tdigest.rs
Outdated
|
||
// The TDigest structure for the statistical aggregator to query quantiles. | ||
pub struct TDigest { | ||
centroids: Vec<Centroid>, // A sorted array of Centroids, according to their mean. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use doc comments i.e. ///
above the structs and fields so the comments show up in the documentation. For example,
/// The TDigest structure for the statistical aggregator to query quantiles.
pub struct TDigest {
/// A sorted array of Centroids, according to their mean.
centroids: Vec<Centroid>,
...
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where did I change the License? I don't think I added anything about the license.
Do you mean the cool header at the top of each Gungnir™ file? 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yea. I was talking about those. Those are cool though lol.
This pull request implements a simplified version of Ted Dunning's TDigest algorithm for efficient quantile/cdf computations.
It supports a fully parallelizable, memory-bounded computation scheme, along with an easy API.
Simplified for two reasons:
Both are marked as TODOs in the code for future work.
Most open-source implementations found online had bugs, were incomplete, or were too complex (i.e., poorly written).
Includes unit tests on uniform and weighted distributions.
Next steps: Implementing the HyperLogLog algorithm for NDistinct.