Commit graph

4 commits

Author SHA1 Message Date
fitermay 21db4336b0 [SPARK-27070] Fix performance bug in DefaultPartitionCoalescer
When trying to coalesce a UnionRDD of two large FileScanRDDs
(each with a few million partitions) into around 8k partitions
the driver can stall for over an hour.

Profiler shows that over 90% of the time is spent in TimSort
which is invoked by `pickBin`.  This patch replaces sorting with a more
efficient `min` for the purpose of finding the least occupied
PartitionGroup

Closes #23986 from fitermay/SPARK-27070.

Authored-by: fitermay <fiterman@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-14 20:13:18 -05:00
Maxim Gekk bb985586f2 [SPARK-26816][CORE][TEST] Add XORShiftRandom Benchmark
## What changes were proposed in this pull request?

- The benchmark of `XORShiftRandom.nextInt` vis-a-vis `java.util.Random.nextInt` is moved from the `XORShiftRandom` object to `XORShiftRandomBenchmark`.
- Added benchmarks for `nextLong`, `nextDouble` and `nextGaussian` that are used in Spark as well.
- Added a separate benchmark for `XORShiftRandom.hashSeed`.

Closes #23752 from MaxGekk/xorshiftrandom-benchmark.

Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-02-10 13:52:24 -08:00
Patrick Brown 6cd23482d1 [SPARK-25839][CORE] Implement use of KryoPool in KryoSerializer
## What changes were proposed in this pull request?

* Implement (optional) use of KryoPool in KryoSerializer, an alternative to the existing implementation of caching a Kryo instance inside KryoSerializerInstance
* Add config key & documentation of spark.kryo.pool in order to turn this on
* Add benchmark KryoSerializerBenchmark to compare new and old implementation
* Add results of benchmark

## How was this patch tested?

Added new tests inside KryoSerializerSuite to test the pool implementation as well as added the pool option to the existing regression testing for SPARK-7766

This is my original work and I license the work to the project under the project’s open source license.

Closes #22855 from patrickbrownsync/kryo-pool.

Authored-by: Patrick Brown <patrick.brown@blyncsy.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-11-10 12:51:24 -06:00
Gengliang Wang b2e3256256 [SPARK-25490][SQL][TEST] Fix OOM of KryoBenchmark due to large 2D array and refactor it to use main method
## What changes were proposed in this pull request?

Before the code changes, I tried to run it with 8G memory:
```
build/sbt -mem 8000  "core/testOnly org.apache.spark.serializer.KryoBenchmark"
```
Still I got got OOM.

This is because the lengths of the arrays are random
669ade3a8e/core/src/test/scala/org/apache/spark/serializer/KryoBenchmark.scala (L90-L91)

And the 2D array is usually large: `10000 * Random.nextInt(0, 10000)`

This PR is to fix it and refactor it to use main method.

The benchmark result is also reason compared to the original one.

## How was this patch tested?

Run with
```
bin/spark-submit --class org.apache.spark.serializer.KryoBenchmark core/target/scala-2.11/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar
```
and
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt  "core/test:runMain org.apache.spark.serializer.KryoBenchmark"

Closes #22663 from gengliangwang/kyroBenchmark.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-10-24 16:56:17 -05:00