spark-instrumented-optimizer

History

Liang-Chi Hsieh 127bc899ae [SPARK-27707][SQL] Prune unnecessary nested fields from Generate ## What changes were proposed in this pull request? Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example: ```scala val df = spark.sparkContext.parallelize(Seq(("1", Array.fill(M)({ val i = math.random (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString) })))).toDF("col", "arr") .selectExpr("col", "struct(col, arr) as st") .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col") ``` The explode causes `st` to be duplicated as many as the exploded elements. Benchmarks it: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 [info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz [info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] generate big nested struct array wholestage off 52668 53162 699 0.0 877803.4 1.0X [info] generate big nested struct array wholestage on 47261 49093 1125 0.0 787690.2 1.1X [info] ``` The query plan: ``` == Physical Plan == Project [col#508, st#512.col AS col1#515, arr_col#519] +- Generate explode(st#512.arr), [col#508, st#512], false, [arr_col#519] +- Project [_1#503 AS col#508, named_struct(col, _1#503, arr, _2#504) AS st#512] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#503, mapobjects(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#504] +- Scan[obj#534] ``` This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of `Generate`, and substitutes them by alias attributes on the projection on top of `Generate`. Benchmarks it after the change: ``` [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4 [info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz [info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] generate big nested struct array wholestage off 311 331 28 0.2 5188.6 1.0X [info] generate big nested struct array wholestage on 297 312 15 0.2 4947.3 1.0X [info] ``` The query plan: ``` == Physical Plan == Project [col#592, _gen_alias_608#608 AS col1#599, arr_col#603] +- Generate explode(st#596.arr), [col#592, _gen_alias_608#608], false, [arr_col#603] +- Project [_1#587 AS col#592, named_struct(col, _1#587, arr, _2#588) AS st#596, _1#587 AS _gen_alias_608#608] +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(in put[0, scala.Tuple2, true]))._1, true, false) AS _1#587, mapobjects(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#588] +- Scan[obj#586] ``` This behavior is controlled by a SQL config `spark.sql.optimizer.expression.nestedPruning.enabled`. ## How was this patch tested? Added benchmark. Closes #24637 from viirya/SPARK-27707. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>		2019-07-18 23:32:07 -07:00
..
AggregateBenchmark-results.txt	[SPARK-25476][SPARK-25510][TEST] Refactor AggregateBenchmark and add a new trait to better support Dataset and DataFrame API	2018-10-01 07:32:40 -07:00
BloomFilterBenchmark-results.txt	[SPARK-25589][SQL][TEST] Add BloomFilterBenchmark	2018-10-03 04:14:07 -07:00
BuiltInDataSourceWriteBenchmark-results.txt	[SPARK-25663][SPARK-25661][SQL][TEST] Refactor BuiltInDataSourceWriteBenchmark, DataSourceWriteBenchmark and AvroWriteBenchmark to use main method	2018-10-31 03:03:42 -07:00
ColumnarBatchBenchmark-results.txt	[SPARK-25481][SQL][TEST] Refactor ColumnarBatchBenchmark to use main method	2018-09-26 20:40:10 -07:00
CompressionSchemeBenchmark-results.txt	[SPARK-25478][SQL][TEST] Refactor CompressionSchemeBenchmark to use main method	2018-09-23 20:46:40 -07:00
CSVBenchmark-results.txt	[SPARK-27533][SQL][TEST] Date and timestamp CSV benchmarks	2019-04-23 11:08:02 +09:00
DatasetBenchmark-results.txt	[SPARK-25479][TEST] Refactor DatasetBenchmark to use main method	2018-10-04 11:58:16 -07:00
DataSourceReadBenchmark-results.txt	[SPARK-26584][SQL] Remove `spark.sql.orc.copyBatchToSpark` internal conf	2019-01-10 08:42:23 -08:00
DateTimeBenchmark-results.txt	[SPARK-27438][SQL] Parse strings with timestamps by to_timestamp() in microsecond precision	2019-04-22 19:41:32 +08:00
ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt	[SPARK-25484][SQL][TEST] Refactor ExternalAppendOnlyUnsafeRowArrayBenchmark	2019-01-09 09:54:21 -08:00
FilterPushdownBenchmark-results.txt	[SPARK-25438][SQL][TEST] Fix FilterPushdownBenchmark to use the same memory assumption	2018-09-15 17:48:39 -07:00
HashedRelationMetricsBenchmark-results.txt	[SPARK-26337][SQL][TEST] Add benchmark for LongToUnsafeRowMap	2018-12-14 10:50:48 +08:00
InExpressionBenchmark-results.txt	[SPARK-26205][SQL] Optimize InSet Expression for bytes, shorts, ints, dates	2019-03-04 15:40:04 -08:00
JoinBenchmark-results.txt	[SPARK-25664][SQL][TEST] Refactor JoinBenchmark to use main method	2018-10-12 16:08:12 -07:00
JSONBenchmark-results.txt	[SPARK-27535][SQL][TEST] Date and timestamp JSON benchmarks	2019-04-23 11:09:14 +09:00
MiscBenchmark-results.txt	[SPARK-27707][SQL] Prune unnecessary nested fields from Generate	2019-07-18 23:32:07 -07:00
OrcNestedSchemaPruningBenchmark-results.txt	[SPARK-27701][SQL] Extend NestedColumnAliasing to general nested field cases including GetArrayStructField	2019-06-11 20:12:53 -07:00
OrcV2NestedSchemaPruningBenchmark-results.txt	[SPARK-27701][SQL] Extend NestedColumnAliasing to general nested field cases including GetArrayStructField	2019-06-11 20:12:53 -07:00
ParquetNestedSchemaPruningBenchmark-results.txt	[SPARK-27701][SQL] Extend NestedColumnAliasing to general nested field cases including GetArrayStructField	2019-06-11 20:12:53 -07:00
PrimitiveArrayBenchmark-results.txt	[SPARK-25487][SQL][TEST] Refactor PrimitiveArrayBenchmark	2018-09-21 15:04:47 +09:00
RangeBenchmark-results.txt	[SPARK-25710][SQL] range should report metrics correctly	2018-10-13 13:55:28 +08:00
SortBenchmark-results.txt	[SPARK-25486][TEST] Refactor SortBenchmark to use main method	2018-09-25 11:13:05 -07:00
UDFBenchmark-results.txt	[SPARK-27684][SQL] Avoid conversion overhead for primitive types	2019-05-30 17:09:19 -07:00
UnsafeArrayDataBenchmark-results.txt	[SPARK-25483][TEST] Refactor UnsafeArrayDataBenchmark to use main method	2018-10-03 04:20:02 -07:00
WideSchemaBenchmark-results.txt	[SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method	2018-10-20 17:31:13 -07:00
WideTableBenchmark-results.txt	[SPARK-25676][SQL][FOLLOWUP] Use 'foreach(_ => ())'	2018-11-08 23:37:14 +08:00