2013-07-16 20:21:33 -04:00
|
|
|
#
|
|
|
|
# Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
# contributor license agreements. See the NOTICE file distributed with
|
|
|
|
# this work for additional information regarding copyright ownership.
|
|
|
|
# The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
# (the "License"); you may not use this file except in compliance with
|
|
|
|
# the License. You may obtain a copy of the License at
|
|
|
|
#
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
#
|
|
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
# See the License for the specific language governing permissions and
|
|
|
|
# limitations under the License.
|
|
|
|
#
|
|
|
|
|
2014-08-27 01:52:16 -04:00
|
|
|
import atexit
|
2012-08-10 04:10:02 -04:00
|
|
|
import os
|
2015-04-21 03:08:18 -04:00
|
|
|
import sys
|
2015-02-16 18:25:11 -05:00
|
|
|
import select
|
2013-08-28 19:39:44 -04:00
|
|
|
import signal
|
[SPARK-1808] Route bin/pyspark through Spark submit
**Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.
**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.
**Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.
For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.
This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.
Author: Andrew Or <andrewor14@gmail.com>
Closes #799 from andrewor14/pyspark-submit and squashes the following commits:
bf37e36 [Andrew Or] Minor changes
01066fa [Andrew Or] bin/pyspark for Windows
c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
1866f85 [Andrew Or] Windows is not cooperating
456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
b7ba0d8 [Andrew Or] Address a few comments (minor)
06eb138 [Andrew Or] Use shlex instead of writing our own parser
05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
6fba412 [Andrew Or] Deal with quotes + address various comments
fe4c8a7 [Andrew Or] Update --help for bin/pyspark
afe47bf [Andrew Or] Fix spark shell
f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a371d26 [Andrew Or] Route bin/pyspark through Spark submit
2014-05-17 01:34:38 -04:00
|
|
|
import shlex
|
2015-02-16 18:25:11 -05:00
|
|
|
import socket
|
2013-09-01 21:19:29 -04:00
|
|
|
import platform
|
2012-12-28 01:47:37 -05:00
|
|
|
from subprocess import Popen, PIPE
|
2015-04-21 03:08:18 -04:00
|
|
|
|
|
|
|
if sys.version >= '3':
|
|
|
|
xrange = range
|
|
|
|
|
2012-12-28 01:47:37 -05:00
|
|
|
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
|
2015-04-21 03:08:18 -04:00
|
|
|
from py4j.java_collections import ListConverter
|
2012-08-10 04:10:02 -04:00
|
|
|
|
2015-02-16 18:25:11 -05:00
|
|
|
from pyspark.serializers import read_int
|
|
|
|
|
2014-07-22 01:30:53 -04:00
|
|
|
|
2015-04-21 03:08:18 -04:00
|
|
|
# patching ListConverter, or it will convert bytearray into Java ArrayList
|
|
|
|
def can_convert_list(self, obj):
|
|
|
|
return isinstance(obj, (list, tuple, xrange))
|
|
|
|
|
|
|
|
ListConverter.can_convert = can_convert_list
|
|
|
|
|
|
|
|
|
2014-04-30 02:24:34 -04:00
|
|
|
def launch_gateway():
|
[SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
Author: Matei Zaharia <matei@databricks.com>
Closes #664 from mateiz/py-submit and squashes the following commits:
15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 18:12:35 -04:00
|
|
|
if "PYSPARK_GATEWAY_PORT" in os.environ:
|
|
|
|
gateway_port = int(os.environ["PYSPARK_GATEWAY_PORT"])
|
2013-09-01 21:19:29 -04:00
|
|
|
else:
|
2015-04-08 13:14:52 -04:00
|
|
|
SPARK_HOME = os.environ["SPARK_HOME"]
|
[SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
Author: Matei Zaharia <matei@databricks.com>
Closes #664 from mateiz/py-submit and squashes the following commits:
15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 18:12:35 -04:00
|
|
|
# Launch the Py4j gateway using Spark's run command so that we pick up the
|
|
|
|
# proper classpath and settings from spark-env.sh
|
|
|
|
on_windows = platform.system() == "Windows"
|
[SPARK-1808] Route bin/pyspark through Spark submit
**Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.
**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.
**Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.
For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.
This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.
Author: Andrew Or <andrewor14@gmail.com>
Closes #799 from andrewor14/pyspark-submit and squashes the following commits:
bf37e36 [Andrew Or] Minor changes
01066fa [Andrew Or] bin/pyspark for Windows
c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
1866f85 [Andrew Or] Windows is not cooperating
456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
b7ba0d8 [Andrew Or] Address a few comments (minor)
06eb138 [Andrew Or] Use shlex instead of writing our own parser
05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
6fba412 [Andrew Or] Deal with quotes + address various comments
fe4c8a7 [Andrew Or] Update --help for bin/pyspark
afe47bf [Andrew Or] Fix spark shell
f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a371d26 [Andrew Or] Route bin/pyspark through Spark submit
2014-05-17 01:34:38 -04:00
|
|
|
script = "./bin/spark-submit.cmd" if on_windows else "./bin/spark-submit"
|
2015-03-16 19:26:55 -04:00
|
|
|
submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "pyspark-shell")
|
2015-06-30 00:32:40 -04:00
|
|
|
if os.environ.get("SPARK_TESTING"):
|
2015-07-30 13:45:32 -04:00
|
|
|
submit_args = ' '.join([
|
|
|
|
"--conf spark.ui.enabled=false",
|
|
|
|
submit_args
|
|
|
|
])
|
2015-03-16 19:26:55 -04:00
|
|
|
command = [os.path.join(SPARK_HOME, script)] + shlex.split(submit_args)
|
2015-02-16 18:25:11 -05:00
|
|
|
|
|
|
|
# Start a socket that will be used by PythonGatewayServer to communicate its port to us
|
|
|
|
callback_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
|
|
|
callback_socket.bind(('127.0.0.1', 0))
|
|
|
|
callback_socket.listen(1)
|
|
|
|
callback_host, callback_port = callback_socket.getsockname()
|
|
|
|
env = dict(os.environ)
|
|
|
|
env['_PYSPARK_DRIVER_CALLBACK_HOST'] = callback_host
|
|
|
|
env['_PYSPARK_DRIVER_CALLBACK_PORT'] = str(callback_port)
|
|
|
|
|
|
|
|
# Launch the Java gateway.
|
|
|
|
# We open a pipe to stdin so that the Java gateway can die when the pipe is broken
|
[SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
Author: Matei Zaharia <matei@databricks.com>
Closes #664 from mateiz/py-submit and squashes the following commits:
15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 18:12:35 -04:00
|
|
|
if not on_windows:
|
|
|
|
# Don't send ctrl-c / SIGINT to the Java gateway:
|
|
|
|
def preexec_func():
|
|
|
|
signal.signal(signal.SIGINT, signal.SIG_IGN)
|
2015-02-16 18:25:11 -05:00
|
|
|
proc = Popen(command, stdin=PIPE, preexec_fn=preexec_func, env=env)
|
[SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
Author: Matei Zaharia <matei@databricks.com>
Closes #664 from mateiz/py-submit and squashes the following commits:
15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 18:12:35 -04:00
|
|
|
else:
|
|
|
|
# preexec_fn not supported on Windows
|
2015-02-16 18:25:11 -05:00
|
|
|
proc = Popen(command, stdin=PIPE, env=env)
|
2014-06-25 13:47:22 -04:00
|
|
|
|
2015-02-16 18:25:11 -05:00
|
|
|
gateway_port = None
|
|
|
|
# We use select() here in order to avoid blocking indefinitely if the subprocess dies
|
|
|
|
# before connecting
|
|
|
|
while gateway_port is None and proc.poll() is None:
|
|
|
|
timeout = 1 # (seconds)
|
|
|
|
readable, _, _ = select.select([callback_socket], [], [], timeout)
|
|
|
|
if callback_socket in readable:
|
|
|
|
gateway_connection = callback_socket.accept()[0]
|
|
|
|
# Determine which ephemeral port the server started on:
|
2015-04-16 19:20:57 -04:00
|
|
|
gateway_port = read_int(gateway_connection.makefile(mode="rb"))
|
2015-02-16 18:25:11 -05:00
|
|
|
gateway_connection.close()
|
|
|
|
callback_socket.close()
|
|
|
|
if gateway_port is None:
|
|
|
|
raise Exception("Java gateway process exited before sending the driver its port number")
|
2014-06-18 16:16:26 -04:00
|
|
|
|
2014-08-27 01:52:16 -04:00
|
|
|
# In Windows, ensure the Java child processes do not linger after Python has exited.
|
|
|
|
# In UNIX-based systems, the child process can kill itself on broken pipe (i.e. when
|
|
|
|
# the parent process' stdin sends an EOF). In Windows, however, this is not possible
|
|
|
|
# because java.lang.Process reads directly from the parent process' stdin, contending
|
|
|
|
# with any opportunity to read an EOF from the parent. Note that this is only best
|
|
|
|
# effort and will not take effect if the python process is violently terminated.
|
|
|
|
if on_windows:
|
|
|
|
# In Windows, the child process here is "spark-submit.cmd", not the JVM itself
|
|
|
|
# (because the UNIX "exec" command is not available). This means we cannot simply
|
|
|
|
# call proc.kill(), which kills only the "spark-submit.cmd" process but not the
|
|
|
|
# JVMs. Instead, we use "taskkill" with the tree-kill option "/t" to terminate all
|
|
|
|
# child processes in the tree (http://technet.microsoft.com/en-us/library/bb491009.aspx)
|
|
|
|
def killChild():
|
|
|
|
Popen(["cmd", "/c", "taskkill", "/f", "/t", "/pid", str(proc.pid)])
|
|
|
|
atexit.register(killChild)
|
|
|
|
|
2012-12-28 01:47:37 -05:00
|
|
|
# Connect to the gateway
|
2015-04-21 03:08:18 -04:00
|
|
|
gateway = JavaGateway(GatewayClient(port=gateway_port), auto_convert=True)
|
[SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
Author: Matei Zaharia <matei@databricks.com>
Closes #664 from mateiz/py-submit and squashes the following commits:
15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 18:12:35 -04:00
|
|
|
|
2012-12-28 01:47:37 -05:00
|
|
|
# Import the classes used by PySpark
|
2013-12-29 14:03:39 -05:00
|
|
|
java_import(gateway.jvm, "org.apache.spark.SparkConf")
|
2013-08-31 22:27:07 -04:00
|
|
|
java_import(gateway.jvm, "org.apache.spark.api.java.*")
|
|
|
|
java_import(gateway.jvm, "org.apache.spark.api.python.*")
|
2013-12-24 16:49:03 -05:00
|
|
|
java_import(gateway.jvm, "org.apache.spark.mllib.api.python.*")
|
2015-01-27 19:08:24 -05:00
|
|
|
# TODO(davies): move into sql
|
|
|
|
java_import(gateway.jvm, "org.apache.spark.sql.*")
|
|
|
|
java_import(gateway.jvm, "org.apache.spark.sql.hive.*")
|
2012-08-10 04:10:02 -04:00
|
|
|
java_import(gateway.jvm, "scala.Tuple2")
|
2014-04-30 02:24:34 -04:00
|
|
|
|
[SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
Author: Matei Zaharia <matei@databricks.com>
Closes #664 from mateiz/py-submit and squashes the following commits:
15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 18:12:35 -04:00
|
|
|
return gateway
|