spark-instrumented-optimizer/python/pyspark
Davies Liu 7f22fa81eb [SPARK-4327] [PySpark] Python API for RDD.randomSplit()
```
pyspark.RDD.randomSplit(self, weights, seed=None)
    Randomly splits this RDD with the provided weights.

    :param weights: weights for splits, will be normalized if they don't sum to 1
    :param seed: random seed
    :return: split RDDs in an list

    >>> rdd = sc.parallelize(range(10), 1)
    >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
    >>> rdd1.collect()
    [3, 6]
    >>> rdd2.collect()
    [0, 5, 7]
    >>> rdd3.collect()
    [1, 2, 4, 8, 9]
```

Author: Davies Liu <davies@databricks.com>

Closes #3193 from davies/randomSplit and squashes the following commits:

78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
2014-11-18 16:37:35 -08:00
..
mllib [SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS 2014-11-18 15:57:33 -08:00
streaming replace awaitTransformation with awaitTermination in scaladoc/javadoc 2014-10-21 09:37:17 -07:00
__init__.py [SPARK-4348] [PySpark] [MLlib] rename random.py to rand.py 2014-11-13 10:24:54 -08:00
accumulators.py [SPARK-3478] [PySpark] Profile the Python tasks 2014-09-30 18:24:57 -07:00
broadcast.py [SPARK-3721] [PySpark] broadcast objects larger than 2G 2014-11-18 16:17:51 -08:00
cloudpickle.py [SPARK-3679] [PySpark] pickle the exact globals of functions 2014-09-24 13:00:05 -07:00
conf.py [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs 2014-10-07 18:09:27 -07:00
context.py [SPARK-3721] [PySpark] broadcast objects larger than 2G 2014-11-18 16:17:51 -08:00
daemon.py [SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM 2014-10-25 01:20:39 -07:00
files.py [SPARK-3309] [PySpark] Put all public API in __all__ 2014-09-03 11:49:45 -07:00
heapq3.py [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() 2014-08-26 16:57:40 -07:00
java_gateway.py [SPARK-4415] [PySpark] JVM should exit after Python exit 2014-11-14 20:14:33 -08:00
join.py [SPARK-546] Add full outer join to RDD and DStream. 2014-09-24 20:39:09 -07:00
rdd.py [SPARK-4327] [PySpark] Python API for RDD.randomSplit() 2014-11-18 16:37:35 -08:00
rddsampler.py [SPARK-4327] [PySpark] Python API for RDD.randomSplit() 2014-11-18 16:37:35 -08:00
resultiterable.py [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically 2014-08-06 12:58:24 -07:00
serializers.py [SPARK-3721] [PySpark] broadcast objects larger than 2G 2014-11-18 16:17:51 -08:00
shell.py [SPARK-3273][SPARK-3301]We should read the version information from the same place 2014-09-06 15:08:43 -07:00
shuffle.py [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. 2014-11-03 23:56:14 -08:00
sql.py [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. 2014-11-03 23:56:14 -08:00
statcounter.py StatCounter on NumPy arrays [PYSPARK][SPARK-2012] 2014-08-01 22:33:25 -07:00
storagelevel.py [SPARK-3417] Use new-style classes in PySpark 2014-09-08 15:45:36 -07:00
tests.py [SPARK-3721] [PySpark] broadcast objects larger than 2G 2014-11-18 16:17:51 -08:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
worker.py [SPARK-3721] [PySpark] broadcast objects larger than 2G 2014-11-18 16:17:51 -08:00