[SPARK-1468] Modify the partition function used by partitionBy.

Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes. Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468 Author: Erik Selin <erik.selin@jadedpixel.com> Closes #371 from tyro89/consistent_hashing and squashes the following commits: 201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.
2014-06-03 13:31:16 -07:00 · 2014-06-03 13:31:16 -07:00 · 8edc9d0330
parent b1f285359a
commit 8edc9d0330
1 changed files with 4 additions and 1 deletions
--- a/python/pyspark/rdd.py
+++ b/python/pyspark/rdd.py
@ -1062,7 +1062,7 @@ class RDD(object):
        return python_right_outer_join(self, other, numPartitions)

    # TODO: add option to control map-side combining
-    def partitionBy(self, numPartitions, partitionFunc=hash):
+    def partitionBy(self, numPartitions, partitionFunc=None):
        """
        Return a copy of the RDD partitioned using the specified partitioner.

@ -1073,6 +1073,9 @@ class RDD(object):
        """
        if numPartitions is None:
            numPartitions = self.ctx.defaultParallelism
+
+        if partitionFunc is None:
+            partitionFunc = lambda x: 0 if x is None else hash(x)
        # Transferring O(n) objects to Java is too expensive.  Instead, we'll
        # form the hash buckets in Python, transferring O(numPartitions) objects
        # to Java.  Each object is a (splitNumber, [objects]) pair.