ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
fitermay	21db4336b0	[SPARK-27070] Fix performance bug in DefaultPartitionCoalescer When trying to coalesce a UnionRDD of two large FileScanRDDs (each with a few million partitions) into around 8k partitions the driver can stall for over an hour. Profiler shows that over 90% of the time is spent in TimSort which is invoked by `pickBin`. This patch replaces sorting with a more efficient `min` for the purpose of finding the least occupied PartitionGroup Closes #23986 from fitermay/SPARK-27070. Authored-by: fitermay <fiterman@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-14 20:13:18 -05:00
Maxim Gekk	bb985586f2	[SPARK-26816][CORE][TEST] Add XORShiftRandom Benchmark ## What changes were proposed in this pull request? - The benchmark of `XORShiftRandom.nextInt` vis-a-vis `java.util.Random.nextInt` is moved from the `XORShiftRandom` object to `XORShiftRandomBenchmark`. - Added benchmarks for `nextLong`, `nextDouble` and `nextGaussian` that are used in Spark as well. - Added a separate benchmark for `XORShiftRandom.hashSeed`. Closes #23752 from MaxGekk/xorshiftrandom-benchmark. Lead-authored-by: Maxim Gekk <maxim.gekk@databricks.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-02-10 13:52:24 -08:00
Patrick Brown	6cd23482d1	[SPARK-25839][CORE] Implement use of KryoPool in KryoSerializer ## What changes were proposed in this pull request? * Implement (optional) use of KryoPool in KryoSerializer, an alternative to the existing implementation of caching a Kryo instance inside KryoSerializerInstance * Add config key & documentation of spark.kryo.pool in order to turn this on * Add benchmark KryoSerializerBenchmark to compare new and old implementation * Add results of benchmark ## How was this patch tested? Added new tests inside KryoSerializerSuite to test the pool implementation as well as added the pool option to the existing regression testing for SPARK-7766 This is my original work and I license the work to the project under the project’s open source license. Closes #22855 from patrickbrownsync/kryo-pool. Authored-by: Patrick Brown <patrick.brown@blyncsy.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-11-10 12:51:24 -06:00
Gengliang Wang	b2e3256256	[SPARK-25490][SQL][TEST] Fix OOM of KryoBenchmark due to large 2D array and refactor it to use main method ## What changes were proposed in this pull request? Before the code changes, I tried to run it with 8G memory: ``` build/sbt -mem 8000 "core/testOnly org.apache.spark.serializer.KryoBenchmark" ``` Still I got got OOM. This is because the lengths of the arrays are random `669ade3a8e/core/src/test/scala/org/apache/spark/serializer/KryoBenchmark.scala (L90-L91)` And the 2D array is usually large: `10000 * Random.nextInt(0, 10000)` This PR is to fix it and refactor it to use main method. The benchmark result is also reason compared to the original one. ## How was this patch tested? Run with ``` bin/spark-submit --class org.apache.spark.serializer.KryoBenchmark core/target/scala-2.11/spark-core_2.11-3.0.0-SNAPSHOT-tests.jar ``` and ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "core/test:runMain org.apache.spark.serializer.KryoBenchmark" Closes #22663 from gengliangwang/kyroBenchmark. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2018-10-24 16:56:17 -05:00

4 commits