Document that groupByKey will OOM for large keys
This pull request is my own work and I license it under Spark's open-source license. This contribution is an improvement to the documentation. I documented that the maximum number of values per key for groupByKey is limited by available RAM (see [Datablox][datablox link] and [the spark mailing list][list link]). Just saying that better performance is available is not sufficient. Sometimes you need to do a group-by - your operation needs all the items available in order to complete. This warning explains the problem. [datablox link]: http://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html [list link]: http://apache-spark-user-list.1001560.n3.nabble.com/Understanding-RDD-GroupBy-OutOfMemory-Exceptions-tp11427p11466.html Author: Eric Moyer <eric_moyer@yahoo.com> Closes #3936 from RadixSeven/better-group-by-docs and squashes the following commits: 5b6f4e9 [Eric Moyer] groupByKey docs naming updates 238e81b [Eric Moyer] Doc that groupByKey will OOM for large keys
This commit is contained in:
parent
0760787da8
commit
538f221627
|
@ -437,6 +437,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
|
|||
* Note: This operation may be very expensive. If you are grouping in order to perform an
|
||||
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
|
||||
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
|
||||
*
|
||||
* Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
|
||||
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
|
||||
*/
|
||||
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = {
|
||||
// groupByKey shouldn't use map side combine because map side combine does not
|
||||
|
@ -458,6 +461,9 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
|
|||
* Note: This operation may be very expensive. If you are grouping in order to perform an
|
||||
* aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]]
|
||||
* or [[PairRDDFunctions.reduceByKey]] will provide much better performance.
|
||||
*
|
||||
* Note: As currently implemented, groupByKey must be able to hold all the key-value pairs for any
|
||||
* key in memory. If a key has too many values, it can result in an [[OutOfMemoryError]].
|
||||
*/
|
||||
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = {
|
||||
groupByKey(new HashPartitioner(numPartitions))
|
||||
|
|
Loading…
Reference in a new issue