[SPARK-20232][PYTHON] Improve combineByKey docs

## What changes were proposed in this pull request? Improve combineByKey documentation: * Add note on memory allocation * Change example code to use different mergeValue and mergeCombiners ## How was this patch tested? Doctest. ## Legal This is my original work and I license the work to the project under the project’s open source license. Author: David Gingrich <david@textio.com> Closes #17545 from dgingrich/topic-spark-20232-combinebykey-docs.
2017-04-13 12:43:28 -07:00 · 2017-04-13 12:43:28 -07:00 · 8ddf0d2a60
parent fbe4216e1e
commit 8ddf0d2a60
1 changed files with 19 additions and 5 deletions
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@ -1804,17 +1804,31 @@ class RDD(object):
              a one-element list)
            - C{mergeValue}, to merge a V into a C (e.g., adds it to the end of
              a list)
-            - C{mergeCombiners}, to combine two C's into a single one.
+            - C{mergeCombiners}, to combine two C's into a single one (e.g., merges
+              the lists)
+
+        To avoid memory allocation, both mergeValue and mergeCombiners are allowed to
+        modify and return their first argument instead of creating a new C.

        In addition, users can control the partitioning of the output RDD.

        .. note:: V and C can be different -- for example, one might group an RDD of type
            (Int, Int) into an RDD of type (Int, List[Int]).

-        >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
-        >>> def add(a, b): return a + str(b)
-        >>> sorted(x.combineByKey(str, add, add).collect())
-        [('a', '11'), ('b', '1')]
+        >>> x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
+        >>> def to_list(a):
+        ...     return [a]
+        ...
+        >>> def append(a, b):
+        ...     a.append(b)
+        ...     return a
+        ...
+        >>> def extend(a, b):
+        ...     a.extend(b)
+        ...     return a
+        ...
+        >>> sorted(x.combineByKey(to_list, append, extend).collect())
+        [('a', [1, 2]), ('b', [1])]
        """
        if numPartitions is None:
            numPartitions = self._defaultReducePartitions()