11b60af737
## What changes were proposed in this pull request? Equi-height histogram is effective in cardinality estimation, and more accurate than basic column stats (min, max, ndv, etc) especially in skew distribution. So we need to support it. For equi-height histogram, all buckets (intervals) have the same height (frequency). In this PR, we use a two-step method to generate an equi-height histogram: 1. use `ApproximatePercentile` to get percentiles `p(0), p(1/n), p(2/n) ... p((n-1)/n), p(1)`; 2. construct range values of buckets, e.g. `[p(0), p(1/n)], [p(1/n), p(2/n)] ... [p((n-1)/n), p(1)]`, and use `ApproxCountDistinctForIntervals` to count ndv in each bucket. Each bucket is of the form: `(lowerBound, higherBound, ndv)`. ## How was this patch tested? Added new test cases and modified some existing test cases. Author: Zhenhua Wang <wangzhenhua@huawei.com> Author: Zhenhua Wang <wzh_zju@163.com> Closes #19479 from wzhfy/generate_histogram. |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |