spark-instrumented-optimizer

History

Zhenhua Wang 571aa27554 [SPARK-21984][SQL] Join estimation based on equi-height histogram ## What changes were proposed in this pull request? Equi-height histogram is one of the state-of-the-art statistics for cardinality estimation, which can provide better estimation accuracy, and good at cases with skew data. This PR is to improve join estimation based on equi-height histogram. The difference from basic estimation (based on ndv) is the logic for computing join cardinality and the new ndv after join. The main idea is as follows: 1. find overlapped ranges between two histograms from two join keys; 2. apply the formula `T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1))` in each overlapped range. ## How was this patch tested? Added new test cases. Author: Zhenhua Wang <wangzhenhua@huawei.com> Closes #19594 from wzhfy/join_estimation_histogram.	2017-12-19 21:55:21 +08:00
..
src	[SPARK-21984][SQL] Join estimation based on equi-height histogram	2017-12-19 21:55:21 +08:00
pom.xml	[SPARK-22607][BUILD] Set large stack size consistently for tests to avoid StackOverflowError	2017-11-26 07:42:44 -06:00

Zhenhua Wang 571aa27554 [SPARK-21984][SQL] Join estimation based on equi-height histogram

## What changes were proposed in this pull request?

Equi-height histogram is one of the state-of-the-art statistics for cardinality estimation, which can provide better estimation accuracy, and good at cases with skew data.

This PR is to improve join estimation based on equi-height histogram. The difference from basic estimation (based on ndv) is the logic for computing join cardinality and the new ndv after join.

The main idea is as follows:
1. find overlapped ranges between two histograms from two join keys;
2. apply the formula `T(A IJ B) = T(A) * T(B) / max(V(A.k1), V(B.k1))` in each overlapped range.

## How was this patch tested?
Added new test cases.

Author: Zhenhua Wang <wangzhenhua@huawei.com>

Closes #19594 from wzhfy/join_estimation_histogram.

2017-12-19 21:55:21 +08:00

src

[SPARK-21984][SQL] Join estimation based on equi-height histogram

2017-12-19 21:55:21 +08:00

pom.xml

[SPARK-22607][BUILD] Set large stack size consistently for tests to avoid StackOverflowError

2017-11-26 07:42:44 -06:00