spark-instrumented-optimizer

History

Juliusz Sompolski 8077bb04f3 [SPARK-23445] ColumnStat refactoring ## What changes were proposed in this pull request? Refactor ColumnStat to be more flexible. * Split `ColumnStat` and `CatalogColumnStat` just like `CatalogStatistics` is split from `Statistics`. This detaches how the statistics are stored from how they are processed in the query plan. `CatalogColumnStat` keeps `min` and `max` as `String`, making it not depend on dataType information. * For `CatalogColumnStat`, parse column names from property names in the metastore (`KEY_VERSION` property), not from metastore schema. This means that `CatalogColumnStat`s can be created for columns even if the schema itself is not stored in the metastore. * Make all fields optional. `min`, `max` and `histogram` for columns were optional already. Having them all optional is more consistent, and gives flexibility to e.g. drop some of the fields through transformations if they are difficult / impossible to calculate. The added flexibility will make it possible to have alternative implementations for stats, and separates stats collection from stats and estimation processing in plans. ## How was this patch tested? Refactored existing tests to work with refactored `ColumnStat` and `CatalogColumnStat`. New tests added in `StatisticsSuite` checking that backwards / forwards compatibility is not broken. Author: Juliusz Sompolski <julek@databricks.com> Closes #20624 from juliuszsompolski/SPARK-23445.		2018-02-26 23:37:31 -08:00
..
java/org/apache/spark/sql	[SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems	2017-04-10 20:11:56 +01:00
resources	[SPARK-14134][CORE] Change the package name used for shading classes.	2016-04-06 19:33:51 -07:00
scala/org/apache/spark/sql	[SPARK-23445] ColumnStat refactoring	2018-02-26 23:37:31 -08:00