[SPARK-35767][SQL] Avoid executing child plan twice in CoalesceExec

### What changes were proposed in this pull request? `CoalesceExec` needlessly calls `child.execute` twice when it could just call it once and re-use the results. This only happens when `numPartitions == 1`. ### Why are the changes needed? It is more efficient to execute the child plan once rather than twice. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? There are no functional changes. This is just a performance optimization, so the existing tests should be sufficient to catch any regressions. Closes #32920 from andygrove/coalesce-exec-executes-twice. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-15 11:59:21 -07:00 · 2021-06-15 11:59:21 -07:00 · 1012967ace
parent 8a02f3a413
commit 1012967ace
1 changed files with 3 additions and 2 deletions
--- a/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala
@ -724,12 +724,13 @@ case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecN
  }

  protected override def doExecute(): RDD[InternalRow] = {
-    if (numPartitions == 1 && child.execute().getNumPartitions < 1) {
+    val rdd = child.execute()
+    if (numPartitions == 1 && rdd.getNumPartitions < 1) {
      // Make sure we don't output an RDD with 0 partitions, when claiming that we have a
      // `SinglePartition`.
      new CoalesceExec.EmptyRDDWithPartitions(sparkContext, numPartitions)
    } else {
-      child.execute().coalesce(numPartitions, shuffle = false)
+      rdd.coalesce(numPartitions, shuffle = false)
    }
  }