[SPARK-20232][PYTHON] Improve combineByKey docs

## What changes were proposed in this pull request?

Improve combineByKey documentation:

* Add note on memory allocation
* Change example code to use different mergeValue and mergeCombiners

## How was this patch tested?

Doctest.

## Legal

This is my original work and I license the work to the project under the project’s open source license.

Author: David Gingrich <david@textio.com>

Closes #17545 from dgingrich/topic-spark-20232-combinebykey-docs.
This commit is contained in:
David Gingrich 2017-04-13 12:43:28 -07:00 committed by Holden Karau
parent fbe4216e1e
commit 8ddf0d2a60

View file

@ -1804,17 +1804,31 @@ class RDD(object):
a one-element list)
- C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
a list)
- C{mergeCombiners}, to combine two C's into a single one.
- C{mergeCombiners}, to combine two C's into a single one (e.g., merges
the lists)
To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
modify and return their first argument instead of creating a new C.
In addition, users can control the partitioning of the output RDD.
.. note:: V and C can be different -- for example, one might group an RDD of type
(Int, Int) into an RDD of type (Int, List[Int]).
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
>>> def add(a, b): return a + str(b)
>>> sorted(x.combineByKey(str, add, add).collect())
[('a', '11'), ('b', '1')]
>>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
>>> def to_list(a):
... return [a]
...
>>> def append(a, b):
... a.append(b)
... return a
...
>>> def extend(a, b):
... a.extend(b)
... return a
...
>>> sorted(x.combineByKey(to_list, append, extend).collect())
[('a', [1, 2]), ('b', [1])]
"""
if numPartitions is None:
numPartitions = self._defaultReducePartitions()