spark-instrumented-optimizer

History

Xiangrui Meng 3cca196220 [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1. ~~~ In [14]: import random In [15]: r1 = random.Random(10) In [16]: r1.randint(0, 1) Out[16]: 1 In [17]: r1.random() Out[17]: 0.4288890546751146 In [18]: r1.random() Out[18]: 0.5780913011344704 In [19]: r2 = random.Random(10) In [20]: r2.randint(0, 1) Out[20]: 1 In [21]: r2.randint(0, 1) Out[21]: 0 In [22]: r2.random() Out[22]: 0.5780913011344704 ~~~ Note: The new tests are not for this bug fix. Author: Xiangrui Meng <meng@databricks.com> Closes #3010 from mengxr/SPARK-4148 and squashes the following commits: 869ae4b [Xiangrui Meng] move tests tests.py c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample		2014-11-03 12:24:24 -08:00
..
mllib	[SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API	2014-10-30 22:25:18 -07:00
streaming	replace awaitTransformation with awaitTermination in scaladoc/javadoc	2014-10-21 09:37:17 -07:00
__init__.py	[SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs	2014-10-07 18:09:27 -07:00
accumulators.py	[SPARK-3478] [PySpark] Profile the Python tasks	2014-09-30 18:24:57 -07:00
broadcast.py	[SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx	2014-09-16 12:51:58 -07:00
cloudpickle.py	[SPARK-3679] [PySpark] pickle the exact globals of functions	2014-09-24 13:00:05 -07:00
conf.py	[SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs	2014-10-07 18:09:27 -07:00
context.py	[SPARK-2652] [PySpark] donot use KyroSerializer as default serializer	2014-10-23 23:58:00 -07:00
daemon.py	[SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM	2014-10-25 01:20:39 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
heapq3.py	[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()	2014-08-26 16:57:40 -07:00
java_gateway.py	[SPARK-3167] Handle special driver configs in Windows	2014-08-26 22:52:16 -07:00
join.py	[SPARK-546] Add full outer join to RDD and DStream.	2014-09-24 20:39:09 -07:00
rdd.py	[SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample	2014-11-03 12:24:24 -08:00
rddsampler.py	[SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample	2014-11-03 12:24:24 -08:00
resultiterable.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
serializers.py	[SPARK-3993] [PySpark] fix bug while reuse worker after take()	2014-10-23 17:20:00 -07:00
shell.py	[SPARK-3273][SPARK-3301]We should read the version information from the same place	2014-09-06 15:08:43 -07:00
shuffle.py	[SPARK-3786] [PySpark] speedup tests	2014-10-06 14:07:53 -07:00
sql.py	[SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations	2014-11-01 19:29:14 -07:00
statcounter.py	StatCounter on NumPy arrays [PYSPARK][SPARK-2012]	2014-08-01 22:33:25 -07:00
storagelevel.py	[SPARK-3417] Use new-style classes in PySpark	2014-09-08 15:45:36 -07:00
tests.py	[SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample	2014-11-03 12:24:24 -08:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
worker.py	[SPARK-3993] [PySpark] fix bug while reuse worker after take()	2014-10-23 17:20:00 -07:00