spark-instrumented-optimizer/python/pyspark
Xiangrui Meng 3cca196220 [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample
The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.

~~~
In [14]: import random

In [15]: r1 = random.Random(10)

In [16]: r1.randint(0, 1)
Out[16]: 1

In [17]: r1.random()
Out[17]: 0.4288890546751146

In [18]: r1.random()
Out[18]: 0.5780913011344704

In [19]: r2 = random.Random(10)

In [20]: r2.randint(0, 1)
Out[20]: 1

In [21]: r2.randint(0, 1)
Out[21]: 0

In [22]: r2.random()
Out[22]: 0.5780913011344704
~~~

Note: The new tests are not for this bug fix.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:

869ae4b [Xiangrui Meng] move tests tests.py
c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
2014-11-03 12:24:24 -08:00
..
mllib [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API 2014-10-30 22:25:18 -07:00
streaming replace awaitTransformation with awaitTermination in scaladoc/javadoc 2014-10-21 09:37:17 -07:00
__init__.py [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs 2014-10-07 18:09:27 -07:00
accumulators.py [SPARK-3478] [PySpark] Profile the Python tasks 2014-09-30 18:24:57 -07:00
broadcast.py [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx 2014-09-16 12:51:58 -07:00
cloudpickle.py [SPARK-3679] [PySpark] pickle the exact globals of functions 2014-09-24 13:00:05 -07:00
conf.py [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs 2014-10-07 18:09:27 -07:00
context.py [SPARK-2652] [PySpark] donot use KyroSerializer as default serializer 2014-10-23 23:58:00 -07:00
daemon.py [SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM 2014-10-25 01:20:39 -07:00
files.py [SPARK-3309] [PySpark] Put all public API in __all__ 2014-09-03 11:49:45 -07:00
heapq3.py [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() 2014-08-26 16:57:40 -07:00
java_gateway.py [SPARK-3167] Handle special driver configs in Windows 2014-08-26 22:52:16 -07:00
join.py [SPARK-546] Add full outer join to RDD and DStream. 2014-09-24 20:39:09 -07:00
rdd.py [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample 2014-11-03 12:24:24 -08:00
rddsampler.py [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample 2014-11-03 12:24:24 -08:00
resultiterable.py [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically 2014-08-06 12:58:24 -07:00
serializers.py [SPARK-3993] [PySpark] fix bug while reuse worker after take() 2014-10-23 17:20:00 -07:00
shell.py [SPARK-3273][SPARK-3301]We should read the version information from the same place 2014-09-06 15:08:43 -07:00
shuffle.py [SPARK-3786] [PySpark] speedup tests 2014-10-06 14:07:53 -07:00
sql.py [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations 2014-11-01 19:29:14 -07:00
statcounter.py StatCounter on NumPy arrays [PYSPARK][SPARK-2012] 2014-08-01 22:33:25 -07:00
storagelevel.py [SPARK-3417] Use new-style classes in PySpark 2014-09-08 15:45:36 -07:00
tests.py [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample 2014-11-03 12:24:24 -08:00
traceback_utils.py [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. 2014-09-15 19:28:17 -07:00
worker.py [SPARK-3993] [PySpark] fix bug while reuse worker after take() 2014-10-23 17:20:00 -07:00