08b951b1cb
### What changes were proposed in this pull request? For broadcast hash join and shuffled hash join, whenever the build side hashed relation turns out to be empty. We don't need to execute stream side plan at all, and can return an empty iterator (for inner join and left semi join), because we know for sure that none of stream side rows can be outputted as there's no match. ### Why are the changes needed? A very minor optimization for rare use case, but in case build side turns out to be empty, we can leverage it to short-cut stream side to save CPU and IO. Example broadcast hash join query similar to `JoinBenchmark` with empty hashed relation: ``` def broadcastHashJoinLongKey(): Unit = { val N = 20 << 20 val M = 1 << 16 val dim = broadcast(spark.range(0).selectExpr("id as k", "cast(id as string) as v")) codegenBenchmark("Join w long", N) { val df = spark.range(N).join(dim, (col("id") % M) === col("k")) assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined) df.noop() } } ``` Comparing wall clock time for enabling and disabling this PR (for non-codegen code path). Seeing like 8x improvement. ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Join w long: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ Join PR disabled 637 646 12 32.9 30.4 1.0X Join PR enabled 77 78 2 271.8 3.7 8.3X ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `JoinSuite`. Closes #29484 from c21/empty-relation. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
v1.2/src | ||
v2.3/src | ||
pom.xml |