spark-instrumented-optimizer

History

WeichenXu 5631a96367 [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection ### What changes were proposed in this pull request? The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run. In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions. ### Why are the changes needed? `Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually benchmark it in spark-shell: ``` def testExplainTime(collectionSize: Int) = { val df = spark.range(10).withColumn("id2", col("id") + 1) val list = Range(0, collectionSize).toList val startTime = System.currentTimeMillis() df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain() val elapsedTime = System.currentTimeMillis() - startTime println(s"cost time: ${elapsedTime}ms") } ``` Then test on collection size 5, 10, 100, 1000, 10000, test result is: collection size \| explain time (before) \| explain time (after) ------ \| ------ \| ------ 5 \| 26ms \| 29ms 10 \| 30ms \| 48ms 100 \| 104ms \| 50ms 1000 \| 1202ms \| 58ms 10000 \| 10012ms \| 523ms Closes #25754 from WeichenXu123/improve_in_collection. Lead-authored-by: WeichenXu <weichen.xu@databricks.com> Co-authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>		2019-09-12 17:23:08 -07:00
..
benchmarks	[SPARK-29065][SQL][TEST] Extend `EXTRACT` benchmark	2019-09-12 21:32:35 +09:00
src	[SPARK-29048] Improve performance on Column.isInCollection() with a large size collection	2019-09-12 17:23:08 -07:00
v1.2.1/src	[SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession	2019-08-19 19:01:56 +08:00
v2.3.5/src	[SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession	2019-08-19 19:01:56 +08:00
pom.xml	[SPARK-27521][SQL] Move data source v2 to catalyst module	2019-06-05 09:55:55 -07:00