spark-instrumented-optimizer/python/pyspark
Andrew Or 4b8ec6fcfd [SPARK-1808] Route bin/pyspark through Spark submit
**Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.

**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.

**Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.

For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.

This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.

Author: Andrew Or <andrewor14@gmail.com>

Closes #799 from andrewor14/pyspark-submit and squashes the following commits:

bf37e36 [Andrew Or] Minor changes
01066fa [Andrew Or] bin/pyspark for Windows
c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
1866f85 [Andrew Or] Windows is not cooperating
456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
b7ba0d8 [Andrew Or] Address a few comments (minor)
06eb138 [Andrew Or] Use shlex instead of writing our own parser
05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
6fba412 [Andrew Or] Deal with quotes + address various comments
fe4c8a7 [Andrew Or] Update --help for bin/pyspark
afe47bf [Andrew Or] Fix spark shell
f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a371d26 [Andrew Or] Route bin/pyspark through Spark submit
2014-05-16 22:34:38 -07:00
..
mllib [SPARK-1743][MLLIB] add loadLibSVMFile and saveAsLibSVMFile to pyspark 2014-05-07 16:01:11 -07:00
__init__.py SPARK-1004. PySpark on YARN 2014-04-29 23:24:34 -07:00
accumulators.py Add custom serializer support to PySpark. 2013-11-10 16:45:38 -08:00
broadcast.py Fix some Python docs and make sure to unset SPARK_TESTING in Python 2013-12-29 20:15:07 -05:00
cloudpickle.py Rename top-level 'pyspark' directory to 'python' 2013-01-01 15:05:00 -08:00
conf.py [FIX] do not load defaults when testing SparkConf in pyspark 2014-05-14 14:57:17 -07:00
context.py SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions 2014-05-07 09:48:31 -07:00
daemon.py SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions 2014-05-07 09:48:31 -07:00
files.py Initial work to rename package to org.apache.spark 2013-09-01 14:13:13 -07:00
java_gateway.py [SPARK-1808] Route bin/pyspark through Spark submit 2014-05-16 22:34:38 -07:00
join.py Spark 1271: Co-Group and Group-By should pass Iterable[X] 2014-04-08 18:15:59 -07:00
rdd.py Documentation: Encourage use of reduceByKey instead of groupByKey. 2014-05-14 22:24:04 -07:00
rddsampler.py SPARK-1438 RDD.sample() make seed param optional 2014-04-24 17:27:16 -07:00
resultiterable.py Spark 1271: Co-Group and Group-By should pass Iterable[X] 2014-04-08 18:15:59 -07:00
serializers.py SPARK-1421. Make MLlib work on Python 2.6 2014-04-05 20:52:05 -07:00
shell.py [SPARK-1808] Route bin/pyspark through Spark submit 2014-05-16 22:34:38 -07:00
sql.py [SQL] Make it possible to create Java/Python SQLContexts from an existing Scala SQLContext. 2014-05-13 21:23:51 -07:00
statcounter.py Spark 1246 add min max to stat counter 2014-03-18 00:45:47 -07:00
storagelevel.py SPARK-1305: Support persisting RDD's directly to Tachyon 2014-04-04 20:38:20 -07:00
tests.py [SPARK-1549] Add Python support to spark-submit 2014-05-06 15:12:35 -07:00
worker.py Add Python includes to path before depickling broadcast values 2014-05-10 13:02:13 -07:00