spark-instrumented-optimizer

History

Sameer Agarwal 31345fde82 [SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit ## What changes were proposed in this pull request? In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits. To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism. ## How was this patch tested? Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #17751 from sameeragarwal/randomsplit2.		2017-04-25 13:05:20 +08:00
..
benchmarks	[SPARK-17335][SQL] Fix ArrayType and MapType CatalogString.	2016-09-03 19:02:20 +02:00
src	[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit	2017-04-25 13:05:20 +08:00
pom.xml	[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT	2017-04-24 21:48:04 -07:00