spark-instrumented-optimizer/sql/catalyst
yangjie01 a30bb0cfda [SPARK-32550][SQL][FOLLOWUP] Eliminate negative impact on HyperLogLogSuite
### What changes were proposed in this pull request?
Change to use `dataTypes.foreach` instead of get the element use specified index in `def this(dataTypes: Seq[DataType]) `constructor of `SpecificInternalRow` because the random access performance is unsatisfactory if the input argument not a `IndexSeq`.

This pr followed srowen's  advice.

### Why are the changes needed?
I found that SPARK-32550 had some negative impact on performance, the typical cases is "deterministic cardinality estimation" in `HyperLogLogPlusPlusSuite` when rsd is 0.001, we found the code that is significantly slower is line 41 in `HyperLogLogPlusPlusSuite`: `new SpecificInternalRow(hll.aggBufferAttributes.map(_.dataType)) `

08b951b1cb/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala (L40-L44)

The size of "hll.aggBufferAttributes" in this case is 209716, the results of comparison before and after spark-32550 merged are as follows, The unit is ns:

  | After   SPARK-32550 createBuffer | After   SPARK-32550 end to end | Before   SPARK-32550 createBuffer | Before   SPARK-32550 end to end
-- | -- | -- | -- | --
rsd 0.001, n   1000 | 52715513243 | 53004810687 | 195807999 | 773977677
rsd 0.001, n   5000 | 51881246165 | 52519358215 | 13689949 | 249974855
rsd 0.001, n   10000 | 52234282788 | 52374639172 | 14199071 | 183452846
rsd 0.001, n   50000 | 55503517122 | 55664035449 | 15219394 | 584477125
rsd 0.001, n   100000 | 51862662845 | 52116774177 | 19662834 | 166483678
rsd 0.001, n   500000 | 51619226715 | 52183189526 | 178048012 | 16681330
rsd 0.001, n   1000000 | 54861366981 | 54976399142 | 226178708 | 18826340
rsd 0.001, n   5000000 | 52023602143 | 52354615149 | 388173579 | 15446409
rsd 0.001, n   10000000 | 53008591660 | 53601392304 | 533454460 | 16033032

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
`mvn test -pl sql/catalyst -DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite -Dtest=none`

**Before**:

```
Run completed in 8 minutes, 18 seconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```

**After**
```
Run completed in 7 seconds, 65 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```

Closes #29529 from LuciferYang/revert-spark-32550.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-25 11:13:01 +09:00
..
benchmarks [SPARK-30413][SQL] Avoid WrappedArray roundtrip in GenericArrayData constructor, plus related optimization in ParquetMapConverter 2020-01-19 19:12:19 -08:00
src [SPARK-32550][SQL][FOLLOWUP] Eliminate negative impact on HyperLogLogSuite 2020-08-25 11:13:01 +09:00
pom.xml [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00