spark-instrumented-optimizer

History

sujithjay 0bf1a74a77 [SPARK-22465][CORE] Add a safety-check to RDD defaultPartitioner ## What changes were proposed in this pull request? In choosing a Partitioner to use for a cogroup-like operation between a number of RDDs, the default behaviour was if some of the RDDs already have a partitioner, we choose the one amongst them with the maximum number of partitions. This behaviour, in some cases, could hit the 2G limit (SPARK-6235). To illustrate one such scenario, consider two RDDs: rDD1: with smaller data and smaller number of partitions, alongwith a Partitioner. rDD2: with much larger data and a larger number of partitions, without a Partitioner. The cogroup of these two RDDs could hit the 2G limit, as a larger amount of data is shuffled into a smaller number of partitions. This PR introduces a safety-check wherein the Partitioner is chosen only if either of the following conditions are met: 1. if the number of partitions of the RDD associated with the Partitioner is greater than or equal to the max number of upstream partitions; or 2. if the number of partitions of the RDD associated with the Partitioner is less than and within a single order of magnitude of the max number of upstream partitions. ## How was this patch tested? Unit tests in PartitioningSuite and PairRDDFunctionsSuite Author: sujithjay <sujith@logistimo.com> Closes #20002 from sujithjay/SPARK-22465.		2017-12-24 11:14:30 -08:00
..
java	[SPARK-17788][SPARK-21033][SQL] fix the potential OOM in UnsafeExternalSorter and ShuffleExternalSorter	2017-10-30 17:53:06 +01:00
resources	[SPARK-20648][CORE] Port JobsTab and StageTab to the new UI backend.	2017-11-14 10:34:32 -06:00
scala/org/apache	[SPARK-22465][CORE] Add a safety-check to RDD defaultPartitioner	2017-12-24 11:14:30 -08:00