620cae098c
### What changes were proposed in this pull request? Added optimizer rule `RemoveRedundantAggregates`. It removes redundant aggregates from a query plan. A redundant aggregate is an aggregate whose only goal is to keep distinct values, while its parent aggregate would ignore duplicate values. The affected part of the query plan for TPCDS q87: Before: ``` == Physical Plan == *(26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#785] +- *(25) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#724] +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` After: ``` == Physical Plan == *(26) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#751] +- *(25) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(25) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- Exchange hashpartitioning(c_last_name#61, c_first_name#60, d_date#26, 5), true, [id=#694] +- *(24) HashAggregate(keys=[c_last_name#61, c_first_name#60, d_date#26], functions=[]) +- SortMergeJoin [coalesce(c_last_name#61, ), isnull(c_last_name#61), coalesce(c_first_name#60, ), isnull(c_first_name#60), coalesce(d_date#26, 0), isnull(d_date#26)], [coalesce(c_last_name#221, ), isnull(c_last_name#221), coalesce(c_first_name#220, ), isnull(c_first_name#220), coalesce(d_date#186, 0), isnull(d_date#186)], LeftAnti :- ... ``` ### Why are the changes needed? Performance improvements - few TPCDS queries have these kinds of duplicate aggregates. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Benchmarks (sf=5): OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Linux 5.8.13-arch1-1 Intel(R) Core(TM) i5-6500 CPU 3.20GHz | Query | Before | After | Speedup | | ------| ------- | ------| ------- | | q14a | 44s | 44s | 1x | | q14b | 41s | 41s | 1x | | q38 | 6.5s | 5.9s | 1.1x | | q87 | 7.2s | 6.8s | 1.1x | | q14a-v2.7 | 55s | 53s | 1x | Closes #30018 from tanelk/SPARK-33122. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |