spark-instrumented-optimizer/sql/core
angerszhu 6146dc4562 [SPARK-29874][SQL] Optimize Dataset.isEmpty()
### What changes were proposed in this pull request?
In  origin way to judge if a DataSet is empty by
```
 def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
    plan.executeCollect().head.getLong(0) == 0
  }
```
will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver.
In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process.

We change it to
```
  def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>
    plan.executeTake(1).isEmpty
  }
```
After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API.
then we won't add more shuffle process and just compute only one partition's data in last stage.
In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side.

### Why are the changes needed?
Optimize Dataset.isEmpty()

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Origin UT

Closes #26500 from AngersZhuuuu/SPARK-29874.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-11-21 18:43:21 +08:00
..
benchmarks [SPARK-29822][SQL] Fix cast error when there are white spaces between signs and values 2019-11-11 21:53:33 +08:00
src [SPARK-29874][SQL] Optimize Dataset.isEmpty() 2019-11-21 18:43:21 +08:00
v1.2.1/src [SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown 2019-10-31 08:25:32 -07:00
v2.3.5/src [SPARK-29277][SQL][test-hadoop3.2] Add early DSv2 filter and projection pushdown 2019-10-31 08:25:32 -07:00
pom.xml [SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+ 2019-11-15 23:58:15 -08:00