[SPARK-35767][SQL] Avoid executing child plan twice in CoalesceExec
### What changes were proposed in this pull request? `CoalesceExec` needlessly calls `child.execute` twice when it could just call it once and re-use the results. This only happens when `numPartitions == 1`. ### Why are the changes needed? It is more efficient to execute the child plan once rather than twice. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? There are no functional changes. This is just a performance optimization, so the existing tests should be sufficient to catch any regressions. Closes #32920 from andygrove/coalesce-exec-executes-twice. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This commit is contained in:
parent
8a02f3a413
commit
1012967ace
|
@ -724,12 +724,13 @@ case class CoalesceExec(numPartitions: Int, child: SparkPlan) extends UnaryExecN
|
|||
}
|
||||
|
||||
protected override def doExecute(): RDD[InternalRow] = {
|
||||
if (numPartitions == 1 && child.execute().getNumPartitions < 1) {
|
||||
val rdd = child.execute()
|
||||
if (numPartitions == 1 && rdd.getNumPartitions < 1) {
|
||||
// Make sure we don't output an RDD with 0 partitions, when claiming that we have a
|
||||
// `SinglePartition`.
|
||||
new CoalesceExec.EmptyRDDWithPartitions(sparkContext, numPartitions)
|
||||
} else {
|
||||
child.execute().coalesce(numPartitions, shuffle = false)
|
||||
rdd.coalesce(numPartitions, shuffle = false)
|
||||
}
|
||||
}
|
||||
|
||||
|
|
Loading…
Reference in a new issue