[SPARK-1468] Modify the partition function used by partitionBy.
Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes. Associated JIRA at https://issues.apache.org/jira/browse/SPARK-1468 Author: Erik Selin <erik.selin@jadedpixel.com> Closes #371 from tyro89/consistent_hashing and squashes the following commits: 201c301 [Erik Selin] Make partitionBy use a tweaked version of hash as its default partition function since the python hash function does not consistently assign the same value to None across python processes.
This commit is contained in:
parent
b1f285359a
commit
8edc9d0330
|
@ -1062,7 +1062,7 @@ class RDD(object):
|
|||
return python_right_outer_join(self, other, numPartitions)
|
||||
|
||||
# TODO: add option to control map-side combining
|
||||
def partitionBy(self, numPartitions, partitionFunc=hash):
|
||||
def partitionBy(self, numPartitions, partitionFunc=None):
|
||||
"""
|
||||
Return a copy of the RDD partitioned using the specified partitioner.
|
||||
|
||||
|
@ -1073,6 +1073,9 @@ class RDD(object):
|
|||
"""
|
||||
if numPartitions is None:
|
||||
numPartitions = self.ctx.defaultParallelism
|
||||
|
||||
if partitionFunc is None:
|
||||
partitionFunc = lambda x: 0 if x is None else hash(x)
|
||||
# Transferring O(n) objects to Java is too expensive. Instead, we'll
|
||||
# form the hash buckets in Python, transferring O(numPartitions) objects
|
||||
# to Java. Each object is a (splitNumber, [objects]) pair.
|
||||
|
|
Loading…
Reference in a new issue