spark-instrumented-optimizer

History

Davies Liu 71af030b46 [SPARK-3094] [PySpark] compatitable with PyPy After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example: ``` PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py ``` The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks: Job \| CPython 2.7 \| PyPy 2.3.1 \| Speed up ------- \| ------------ \| ------------- \| ------- Word Count \| 41s \| 15s \| 2.7x Sort \| 46s \| 44s \| 1.05x Stats \| 174s \| 3.6s \| 48x Here is the code used for benchmark: ```python rdd = sc.textFile("text") def wordcount(): rdd.flatMap(lambda x:x.split('/'))\ .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap() def sort(): rdd.sortBy(lambda x:x, 1).count() def stats(): sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats() ``` Author: Davies Liu <davies.liu@gmail.com> Closes #2144 from davies/pypy and squashes the following commits: 9aed6c5 [Davies Liu] use protocol 2 in CloudPickle 4bc1f04 [Davies Liu] refactor b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way 3ca2351 [Davies Liu] Merge branch 'master' into pypy fae8b19 [Davies Liu] improve attrgetter, add tests 591f830 [Davies Liu] try to run tests with PyPy in run-tests c8d62ba [Davies Liu] cleanup f651fd0 [Davies Liu] fix tests using array with PyPy 1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways 3c1dbfe [Davies Liu] Merge branch 'master' into pypy 42fb5fa [Davies Liu] Merge branch 'master' into pypy cb2d724 [Davies Liu] fix tests 9986692 [Davies Liu] Merge branch 'master' into pypy 25b4ca7 [Davies Liu] support PyPy		2014-09-12 18:42:50 -07:00
..
mllib	[SPARK-3443][MLLIB] update default values of tree:	2014-09-08 18:59:57 -07:00
__init__.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
accumulators.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
broadcast.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
cloudpickle.py	[SPARK-3094] [PySpark] compatitable with PyPy	2014-09-12 18:42:50 -07:00
conf.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
context.py	[SPARK-3047] [PySpark] add an option to use str in textFileRDD	2014-09-11 11:50:36 -07:00
daemon.py	[SPARK-3094] [PySpark] compatitable with PyPy	2014-09-12 18:42:50 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
heapq3.py	[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()	2014-08-26 16:57:40 -07:00
java_gateway.py	[SPARK-3167] Handle special driver configs in Windows	2014-08-26 22:52:16 -07:00
join.py	[SPARK-2470] PEP8 fixes to PySpark	2014-07-21 22:30:53 -07:00
rdd.py	[PySpark] Add blank line so that Python RDD.top() docstring renders correctly	2014-09-12 09:46:21 -07:00
rddsampler.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
resultiterable.py	[SPARK-2627] [PySpark] have the build enforce PEP 8 automatically	2014-08-06 12:58:24 -07:00
serializers.py	[SPARK-3094] [PySpark] compatitable with PyPy	2014-09-12 18:42:50 -07:00
shell.py	[SPARK-3273][SPARK-3301]We should read the version information from the same place	2014-09-06 15:08:43 -07:00
shuffle.py	[SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()	2014-08-26 16:57:40 -07:00
sql.py	[SPARK-3417] Use new-style classes in PySpark	2014-09-08 15:45:36 -07:00
statcounter.py	StatCounter on NumPy arrays [PYSPARK][SPARK-2012]	2014-08-01 22:33:25 -07:00
storagelevel.py	[SPARK-3417] Use new-style classes in PySpark	2014-09-08 15:45:36 -07:00
tests.py	[SPARK-3094] [PySpark] compatitable with PyPy	2014-09-12 18:42:50 -07:00
worker.py	[SPARK-3114] [PySpark] Fix Python UDFs in Spark SQL.	2014-08-18 20:42:19 -07:00