...
Tested ! TBH, it isn't a great idea to have directory with spaces within. Because emacs doesn't like it then hadoop doesn't like it. and so on...
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#2229 from ScrapCodes/SPARK-3337/quoting-shell-scripts and squashes the following commits:
d4ad660 [Prashant Sharma] SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#2270 from sarutak/SPARK-3399 and squashes the following commits:
7613be6 [Kousuke Saruta] Modified pyspark script to ignore environment variables YARN_CONF_DIR and HADOOP_CONF_DIR while testing
In `SparkSubmitDriverBootstrapper`, we wait for the parent process to send us an `EOF` before finishing the application. This is applicable for the PySpark shell because we terminate the application the same way. However if we run a python application, for instance, the JVM actually never exits unless it receives a manual EOF from the user. This is causing a few tests to timeout.
We only need to do this for the PySpark shell because Spark submit runs as a python subprocess only in this case. Thus, the normal Spark shell doesn't need to go through this case even though it is also a REPL.
Thanks davies for reporting this.
Author: Andrew Or <andrewor14@gmail.com>
Closes#2170 from andrewor14/bootstrap-hotfix and squashes the following commits:
42963f5 [Andrew Or] Do not wait for EOF unless this is the pyspark shell
Although you can make pyspark use ipython with `IPYTHON=1`, and also change the python executable with `PYSPARK_PYTHON=...`, you can't use both at the same time because it hardcodes the default ipython script.
This makes it use the `PYSPARK_PYTHON` variable if present and fall back to default python, similarly to how the default python executable is handled.
So you can use a custom ipython like so:
`PYSPARK_PYTHON=./anaconda/bin/ipython IPYTHON_OPTS="notebook" pyspark`
Author: Rob O'Dwyer <odwyerrob@gmail.com>
Closes#2167 from robbles/patch-1 and squashes the following commits:
d98e8a9 [Rob O'Dwyer] Allow using custom ipython executable with pyspark
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes#1715, Closes#1864, and Closes#1861Closes#1825 from sarutak/SPARK-2894 and squashes the following commits:
47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
Author: Josh Rosen <joshrosen@apache.org>
Closes#1626 from JoshRosen/SPARK-2305 and squashes the following commits:
03fb283 [Josh Rosen] Update Py4J to version 0.8.2.1.
Trivial fix.
Author: Prashant Sharma <prashant.s@imaginea.com>
Closes#1050 from ScrapCodes/SPARK-2109/pyspark-script-bug and squashes the following commits:
77072b9 [Prashant Sharma] Changed echos to redirect to STDERR.
13f48a0 [Prashant Sharma] [SPARK-2109] Setting SPARK_MEM for bin/pyspark does not work.
This is a hot fix for the hot fix in fb499be1ac. The changes in that commit did not actually cause the `doctest` module in python to be loaded for the following tests:
- pyspark/broadcast.py
- pyspark/accumulators.py
- pyspark/serializers.py
(@pwendell I might have told you the wrong thing)
Author: Andrew Or <andrewor14@gmail.com>
Closes#1053 from andrewor14/python-test-fix and squashes the following commits:
d2e5401 [Andrew Or] Explain why these tests are handled differently
0bd6fdd [Andrew Or] Fix 3 pyspark tests not being invoked
Author: Patrick Wendell <pwendell@gmail.com>
Closes#1036 from pwendell/jenkins-test and squashes the following commits:
9c99856 [Patrick Wendell] Better output during tests
71e7b74 [Patrick Wendell] Removing incorrect python path
74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins.
Fixed a couple of misleading comments in bin/pyspark and bin/spark-class. The comments make it seem like the script is looking for the Scala installation when in fact it is looking for Spark.
Author: Sumedh Mungee <smungee@gmail.com>
Closes#843 from smungee/spark-1250-fix-comments and squashes the following commits:
26870f3 [Sumedh Mungee] [SPARK-1250] Fixed misleading comments in bin/pyspark and bin/spark-class
Author: Neville Li <neville@spotify.com>
Closes#812 from nevillelyh/neville/v1.0 and squashes the following commits:
0dc33ed [Neville Li] Fix spark-submit path in pyspark
becec64 [Neville Li] Fix spark-submit path in spark-shell
**Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.
**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.
**Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.
For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.
This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.
Author: Andrew Or <andrewor14@gmail.com>
Closes#799 from andrewor14/pyspark-submit and squashes the following commits:
bf37e36 [Andrew Or] Minor changes
01066fa [Andrew Or] bin/pyspark for Windows
c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
1866f85 [Andrew Or] Windows is not cooperating
456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
b7ba0d8 [Andrew Or] Address a few comments (minor)
06eb138 [Andrew Or] Use shlex instead of writing our own parser
05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
6fba412 [Andrew Or] Deal with quotes + address various comments
fe4c8a7 [Andrew Or] Update --help for bin/pyspark
afe47bf [Andrew Or] Fix spark shell
f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a371d26 [Andrew Or] Route bin/pyspark through Spark submit
Gives a nicely formatted message to the user when `run-example` is run to
tell them to use `spark-submit`.
Author: Patrick Wendell <pwendell@gmail.com>
Closes#704 from pwendell/examples and squashes the following commits:
1996ee8 [Patrick Wendell] Feedback form Andrew
3eb7803 [Patrick Wendell] Suggestions from TD
2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`.
This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo
Author: Sandy Ryza <sandy@cloudera.com>
Closes#30 from sryza/sandy-spark-1004 and squashes the following commits:
89889d4 [Sandy Ryza] Move unzipping py4j to the generate-resources phase so that it gets included in the jar the first time
5165a02 [Sandy Ryza] Fix docs
fd0df79 [Sandy Ryza] PySpark on YARN
This is based on @dianacarroll's previous pull request https://github.com/apache/spark/pull/227, and @joshrosen's comments on https://github.com/apache/spark/pull/38. Since we do want to allow passing arguments to IPython, this does the following:
* It documents that IPython can't be used with standalone jobs for now. (Later versions of IPython will deal with PYTHONSTARTUP properly and enable this, see https://github.com/ipython/ipython/pull/5226, but no released version has that fix.)
* If you run `pyspark` with `IPYTHON=1`, it passes your command-line arguments to it. This way you can do stuff like `IPYTHON=1 bin/pyspark notebook`.
* The old `IPYTHON_OPTS` remains, but I've removed it from the documentation. This is in case people read an old tutorial that uses it.
This is not a perfect solution and I'd also be okay with keeping things as they are today (ignoring `$@` for IPython and using IPYTHON_OPTS), and only doing the doc change. With this change though, when IPython fixes https://github.com/ipython/ipython/pull/5226, people will immediately be able to do `IPYTHON=1 bin/pyspark myscript.py` to run a standalone script and get all the benefits of running scripts in IPython (presumably better debugging and such). Without it, there will be no way to run scripts in IPython.
@joshrosen you should probably take the final call on this.
Author: Diana Carroll <dcarroll@cloudera.com>
Closes#294 from mateiz/spark-1134 and squashes the following commits:
747bb13 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
see comments on Pull Request https://github.com/apache/spark/pull/38
(i couldn't figure out how to modify an existing pull request, so I'm hoping I can withdraw that one and replace it with this one.)
Author: Diana Carroll <dcarroll@cloudera.com>
Closes#227 from dianacarroll/spark-1134 and squashes the following commits:
ffe47f2 [Diana Carroll] [spark-1134] remove ipythonopts from ipython command
b673bf7 [Diana Carroll] Merge branch 'master' of github.com:apache/spark
0309cf9 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
Various spark scripts load spark-env.sh. This can cause growth of any variables that may be appended to (SPARK_CLASSPATH, SPARK_REPL_OPTS) and it makes the precedence order for options specified in spark-env.sh less clear.
One use-case for the latter is that we want to set options from the command-line of spark-shell, but these options will be overridden by subsequent loading of spark-env.sh. If we were to load the spark-env.sh first and then set our command-line options, we could guarantee correct precedence order.
Note that we use SPARK_CONF_DIR if available to support the sbin/ scripts, which always set this variable from sbin/spark-config.sh. Otherwise, we default to the ../conf/ as usual.
Author: Aaron Davidson <aaron@databricks.com>
Closes#184 from aarondav/idem and squashes the following commits:
e291f91 [Aaron Davidson] Use "private" variables in load-spark-env.sh
8da8360 [Aaron Davidson] Add .sh extension to load-spark-env.sh
93a2471 [Aaron Davidson] SPARK-1286: Make usage of spark-env.sh idempotent
This patch removes compatibility for IPython < 1.0 but fixes the launch
script and makes it much simpler.
I tested this using the three commands in the PySpark documentation page:
1. IPYTHON=1 ./pyspark
2. IPYTHON_OPTS="notebook" ./pyspark
3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark
There are two changes:
- We rely on PYTHONSTARTUP env var to start PySpark
- Removed the quotes around $IPYTHON_OPTS... having quotes
gloms them together as a single argument passed to `exec` which
seemed to cause ipython to fail (it instead expects them as
multiple arguments).