spark-instrumented-optimizer

History

Davies Liu d39f2e9c68 [SPARK-4477] [PySpark] remove numpy from RDDSampler In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy. numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927. It also complicate the code a lot, so we may should remove numpy from RDDSampler. I also did some benchmark to verify that: ``` >>> from pyspark.mllib.random import RandomRDDs >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache() >>> rdd.count() # cache it >>> rdd.sample(True, 0.9).count() # measure this line ``` the results: \|withReplacement \| random \| numpy.random \| ------- \| ------------ \| ------- \|True \| 1.5 s\| 1.4 s\| \|False\| 0.6 s \| 0.8 s\| closes #2313 Note: this patch including some commits that not mirrored to github, it will be OK after it catches up. Author: Davies Liu <davies@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3351 from davies/numpy and squashes the following commits: 5c438d7 [Davies Liu] fix comment c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477 98eb31b [Xiangrui Meng] make poisson sampling slightly faster ee17d78 [Davies Liu] remove = for float 13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy f583023 [Davies Liu] fix tests 51649f5 [Davies Liu] remove numpy in RDDSampler 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit()		2014-11-20 16:40:25 -08:00
..
mllib	[SPARK-4439] [MLlib] add python api for random forest	2014-11-20 15:31:28 -08:00
streaming	[DOC][PySpark][Streaming] Fix docstring for sphinx	2014-11-19 14:23:18 -08:00
__init__.py	[SPARK-4348] [PySpark] [MLlib] rename random.py to rand.py	2014-11-13 10:24:54 -08:00
accumulators.py	[SPARK-3478] [PySpark] Profile the Python tasks	2014-09-30 18:24:57 -07:00
broadcast.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
cloudpickle.py	[SPARK-3679] [PySpark] pickle the exact globals of functions	2014-09-24 13:00:05 -07:00
conf.py	[SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs	2014-10-07 18:09:27 -07:00
context.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
daemon.py	[SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM	2014-10-25 01:20:39 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
heapq3.py	[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()	2014-08-26 16:57:40 -07:00
java_gateway.py	[SPARK-4415] [PySpark] JVM should exit after Python exit	2014-11-14 20:14:33 -08:00
join.py	[SPARK-546] Add full outer join to RDD and DStream.	2014-09-24 20:39:09 -07:00
rdd.py	[SPARK-4477] [PySpark] remove numpy from RDDSampler	2014-11-20 16:40:25 -08:00
rddsampler.py	[SPARK-4477] [PySpark] remove numpy from RDDSampler	2014-11-20 16:40:25 -08:00
resultiterable.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
serializers.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
shell.py	[SPARK-3273][SPARK-3301]We should read the version information from the same place	2014-09-06 15:08:43 -07:00
shuffle.py	[SPARK-4384] [PySpark] improve sort spilling	2014-11-19 15:45:37 -08:00
sql.py	[SPARK-4228][SQL] SchemaRDD to JSON	2014-11-20 13:44:19 -08:00
statcounter.py	StatCounter on NumPy arrays [PYSPARK][SPARK-2012]	2014-08-01 22:33:25 -07:00
storagelevel.py	[SPARK-3417] Use new-style classes in PySpark	2014-09-08 15:45:36 -07:00
tests.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
worker.py	[SPARK-3721] [PySpark] broadcast objects larger than 2G	2014-11-18 16:17:51 -08:00