spark-instrumented-optimizer

History

angerszhu 6146dc4562 [SPARK-29874][SQL] Optimize Dataset.isEmpty() ### What changes were proposed in this pull request? In origin way to judge if a DataSet is empty by ``` def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan => plan.executeCollect().head.getLong(0) == 0 } ``` will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver. In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process. We change it to ``` def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan => plan.executeTake(1).isEmpty } ``` After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API. then we won't add more shuffle process and just compute only one partition's data in last stage. In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side. ### Why are the changes needed? Optimize Dataset.isEmpty() ### Does this PR introduce any user-facing change? No ### How was this patch tested? Origin UT Closes #26500 from AngersZhuuuu/SPARK-29874. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2019-11-21 18:43:21 +08:00
..
benchmarks	[SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values	2019-11-11 21:53:33 +08:00
src	[SPARK-29874][SQL] Optimize Dataset.isEmpty()	2019-11-21 18:43:21 +08:00
v1.2.1/src	[SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown	2019-10-31 08:25:32 -07:00
v2.3.5/src	[SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown	2019-10-31 08:25:32 -07:00
pom.xml	[SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+	2019-11-15 23:58:15 -08:00