[SPARK-20232][PYTHON] Improve combineByKey docs
## What changes were proposed in this pull request? Improve combineByKey documentation: * Add note on memory allocation * Change example code to use different mergeValue and mergeCombiners ## How was this patch tested? Doctest. ## Legal This is my original work and I license the work to the project under the project’s open source license. Author: David Gingrich <david@textio.com> Closes #17545 from dgingrich/topic-spark-20232-combinebykey-docs.
This commit is contained in:
parent
fbe4216e1e
commit
8ddf0d2a60
|
@ -1804,17 +1804,31 @@ class RDD(object):
|
|||
a one-element list)
|
||||
- C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
|
||||
a list)
|
||||
- C{mergeCombiners}, to combine two C's into a single one.
|
||||
- C{mergeCombiners}, to combine two C's into a single one (e.g., merges
|
||||
the lists)
|
||||
|
||||
To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
|
||||
modify and return their first argument instead of creating a new C.
|
||||
|
||||
In addition, users can control the partitioning of the output RDD.
|
||||
|
||||
.. note:: V and C can be different -- for example, one might group an RDD of type
|
||||
(Int, Int) into an RDD of type (Int, List[Int]).
|
||||
|
||||
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
|
||||
>>> def add(a, b): return a + str(b)
|
||||
>>> sorted(x.combineByKey(str, add, add).collect())
|
||||
[('a', '11'), ('b', '1')]
|
||||
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
|
||||
>>> def to_list(a):
|
||||
... return [a]
|
||||
...
|
||||
>>> def append(a, b):
|
||||
... a.append(b)
|
||||
... return a
|
||||
...
|
||||
>>> def extend(a, b):
|
||||
... a.extend(b)
|
||||
... return a
|
||||
...
|
||||
>>> sorted(x.combineByKey(to_list, append, extend).collect())
|
||||
[('a', [1, 2]), ('b', [1])]
|
||||
"""
|
||||
if numPartitions is None:
|
||||
numPartitions = self._defaultReducePartitions()
|
||||
|
|
Loading…
Reference in a new issue