spark-instrumented-optimizer

History

Reynold Xin 5d79947369 [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics ## What changes were proposed in this pull request? This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element. ## How was this patch tested? This should be covered by existing tests. Author: Reynold Xin <rxin@databricks.com> Closes #16274 from rxin/SPARK-18853.	2016-12-14 21:22:49 +01:00
..
src	[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics	2016-12-14 21:22:49 +01:00
pom.xml	[SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT	2016-12-02 21:09:37 -08:00

Reynold Xin 5d79947369 [SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics

## What changes were proposed in this pull request?
This patch reduces the default number element estimation for arrays and maps from 100 to 1. The issue with the 100 number is that when nested (e.g. an array of map), 100 * 100 would be used as the default size. This sounds like just an overestimation which doesn't seem that bad (since it is usually better to overestimate than underestimate). However, due to the way we assume the size output for Project (new estimated column size / old estimated column size), this overestimation can become underestimation. It is actually in general in this case safer to assume 1 default element.

## How was this patch tested?
This should be covered by existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #16274 from rxin/SPARK-18853.

2016-12-14 21:22:49 +01:00

src

[SPARK-18853][SQL] Project (UnaryNode) is way too aggressive in estimating statistics

2016-12-14 21:22:49 +01:00

pom.xml

[SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOT

2016-12-02 21:09:37 -08:00