[SPARK-35767][SQL] Avoid executing child plan twice in CoalesceExec

### What changes were proposed in this pull request?

`CoalesceExec` needlessly calls `child.execute` twice when it could just call it once and re-use the results. This only happens when `numPartitions == 1`.

### Why are the changes needed?

It is more efficient to execute the child plan once rather than twice.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

There are no functional changes. This is just a performance optimization, so the existing tests should be sufficient to catch any regressions.

Closes #32920 from andygrove/coalesce-exec-executes-twice.

Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This commit is contained in:
Andy Grove 2021-06-15 11:59:21 -07:00 committed by Dongjoon Hyun
parent 8a02f3a413
commit 1012967ace

View file

@ -724,12 +724,13 @@ case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecN
}
protected override def doExecute(): RDD[InternalRow] = {
if (numPartitions == 1 && child.execute().getNumPartitions < 1) {
val rdd = child.execute()
if (numPartitions == 1 && rdd.getNumPartitions < 1) {
// Make sure we don't output an RDD with 0 partitions, when claiming that we have a
// `SinglePartition`.
new CoalesceExec.EmptyRDDWithPartitions(sparkContext, numPartitions)
} else {
child.execute().coalesce(numPartitions, shuffle = false)
rdd.coalesce(numPartitions, shuffle = false)
}
}