This is a hot fix for the hot fix in fb499be1ac. The changes in that commit did not actually cause the `doctest` module in python to be loaded for the following tests:
- pyspark/broadcast.py
- pyspark/accumulators.py
- pyspark/serializers.py
(@pwendell I might have told you the wrong thing)
Author: Andrew Or <andrewor14@gmail.com>
Closes#1053 from andrewor14/python-test-fix and squashes the following commits:
d2e5401 [Andrew Or] Explain why these tests are handled differently
0bd6fdd [Andrew Or] Fix 3 pyspark tests not being invoked
Author: Patrick Wendell <pwendell@gmail.com>
Closes#1036 from pwendell/jenkins-test and squashes the following commits:
9c99856 [Patrick Wendell] Better output during tests
71e7b74 [Patrick Wendell] Removing incorrect python path
74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins.
Fixed a couple of misleading comments in bin/pyspark and bin/spark-class. The comments make it seem like the script is looking for the Scala installation when in fact it is looking for Spark.
Author: Sumedh Mungee <smungee@gmail.com>
Closes#843 from smungee/spark-1250-fix-comments and squashes the following commits:
26870f3 [Sumedh Mungee] [SPARK-1250] Fixed misleading comments in bin/pyspark and bin/spark-class
Author: Neville Li <neville@spotify.com>
Closes#812 from nevillelyh/neville/v1.0 and squashes the following commits:
0dc33ed [Neville Li] Fix spark-submit path in pyspark
becec64 [Neville Li] Fix spark-submit path in spark-shell
**Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.
**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.
**Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.
For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.
This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.
Author: Andrew Or <andrewor14@gmail.com>
Closes#799 from andrewor14/pyspark-submit and squashes the following commits:
bf37e36 [Andrew Or] Minor changes
01066fa [Andrew Or] bin/pyspark for Windows
c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
1866f85 [Andrew Or] Windows is not cooperating
456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
b7ba0d8 [Andrew Or] Address a few comments (minor)
06eb138 [Andrew Or] Use shlex instead of writing our own parser
05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
6fba412 [Andrew Or] Deal with quotes + address various comments
fe4c8a7 [Andrew Or] Update --help for bin/pyspark
afe47bf [Andrew Or] Fix spark shell
f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a371d26 [Andrew Or] Route bin/pyspark through Spark submit
Gives a nicely formatted message to the user when `run-example` is run to
tell them to use `spark-submit`.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#704 from pwendell/examples and squashes the following commits:
1996ee8 [Patrick Wendell] Feedback form Andrew
3eb7803 [Patrick Wendell] Suggestions from TD
2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`.
This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo
Author: Sandy Ryza <sandy@cloudera.com>
Closes#30 from sryza/sandy-spark-1004 and squashes the following commits:
89889d4 [Sandy Ryza] Move unzipping py4j to the generate-resources phase so that it gets included in the jar the first time
5165a02 [Sandy Ryza] Fix docs
fd0df79 [Sandy Ryza] PySpark on YARN
This is based on @dianacarroll's previous pull request https://github.com/apache/spark/pull/227, and @joshrosen's comments on https://github.com/apache/spark/pull/38. Since we do want to allow passing arguments to IPython, this does the following:
* It documents that IPython can't be used with standalone jobs for now. (Later versions of IPython will deal with PYTHONSTARTUP properly and enable this, see https://github.com/ipython/ipython/pull/5226, but no released version has that fix.)
* If you run `pyspark` with `IPYTHON=1`, it passes your command-line arguments to it. This way you can do stuff like `IPYTHON=1 bin/pyspark notebook`.
* The old `IPYTHON_OPTS` remains, but I've removed it from the documentation. This is in case people read an old tutorial that uses it.
This is not a perfect solution and I'd also be okay with keeping things as they are today (ignoring `$@` for IPython and using IPYTHON_OPTS), and only doing the doc change. With this change though, when IPython fixes https://github.com/ipython/ipython/pull/5226, people will immediately be able to do `IPYTHON=1 bin/pyspark myscript.py` to run a standalone script and get all the benefits of running scripts in IPython (presumably better debugging and such). Without it, there will be no way to run scripts in IPython.
@joshrosen you should probably take the final call on this.
Author: Diana Carroll <dcarroll@cloudera.com>
Closes#294 from mateiz/spark-1134 and squashes the following commits:
747bb13 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
see comments on Pull Request https://github.com/apache/spark/pull/38
(i couldn't figure out how to modify an existing pull request, so I'm hoping I can withdraw that one and replace it with this one.)
Author: Diana Carroll <dcarroll@cloudera.com>
Closes#227 from dianacarroll/spark-1134 and squashes the following commits:
ffe47f2 [Diana Carroll] [spark-1134] remove ipythonopts from ipython command
b673bf7 [Diana Carroll] Merge branch 'master' of github.com:apache/spark
0309cf9 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
Various spark scripts load spark-env.sh. This can cause growth of any variables that may be appended to (SPARK_CLASSPATH, SPARK_REPL_OPTS) and it makes the precedence order for options specified in spark-env.sh less clear.
One use-case for the latter is that we want to set options from the command-line of spark-shell, but these options will be overridden by subsequent loading of spark-env.sh. If we were to load the spark-env.sh first and then set our command-line options, we could guarantee correct precedence order.
Note that we use SPARK_CONF_DIR if available to support the sbin/ scripts, which always set this variable from sbin/spark-config.sh. Otherwise, we default to the ../conf/ as usual.
Author: Aaron Davidson <aaron@databricks.com>
Closes#184 from aarondav/idem and squashes the following commits:
e291f91 [Aaron Davidson] Use "private" variables in load-spark-env.sh
8da8360 [Aaron Davidson] Add .sh extension to load-spark-env.sh
93a2471 [Aaron Davidson] SPARK-1286: Make usage of spark-env.sh idempotent
This patch removes compatibility for IPython < 1.0 but fixes the launch
script and makes it much simpler.
I tested this using the three commands in the PySpark documentation page:
1. IPYTHON=1 ./pyspark
2. IPYTHON_OPTS="notebook" ./pyspark
3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark
There are two changes:
- We rely on PYTHONSTARTUP env var to start PySpark
- Removed the quotes around $IPYTHON_OPTS... having quotes
gloms them together as a single argument passed to `exec` which
seemed to cause ipython to fail (it instead expects them as
multiple arguments).