spark-instrumented-optimizer/python/pyspark/shell.py

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
An interactive shell.

This file is designed to be launched as a PYTHONSTARTUP script.
"""

import sys
if sys.version_info[0] != 2:
    print("Error: Default Python used is Python%s" % sys.version_info.major)
    print("\tSet env variable PYSPARK_PYTHON to Python2 binary and re-run it.")
    sys.exit(1)


import atexit
import os
import platform

import py4j

import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SQLContext, HiveContext
from pyspark.storagelevel import StorageLevel

# this is the deprecated equivalent of ADD_JARS
add_files = None
if os.environ.get("ADD_FILES") is not None:
    add_files = os.environ.get("ADD_FILES").split(',')

if os.environ.get("SPARK_EXECUTOR_URI"):
    SparkContext.setSystemProperty("spark.executor.uri", os.environ["SPARK_EXECUTOR_URI"])

sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
atexit.register(lambda: sc.stop())

try:
    # Try to access HiveConf, it will raise exception if Hive is not added
    sc._jvm.org.apache.hadoop.hive.conf.HiveConf()
    sqlCtx = HiveContext(sc)
except py4j.protocol.Py4JError:
    sqlCtx = SQLContext(sc)

print("""Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version %s
      /_/
""" % sc.version)
print("Using Python version %s (%s, %s)" % (
    platform.python_version(),
    platform.python_build()[0],
    platform.python_build()[1]))
print("SparkContext available as sc, %s available as sqlCtx." % sqlCtx.__class__.__name__)

if add_files is not None:
    print("Warning: ADD_FILES environment variable is deprecated, use --py-files argument instead")
    print("Adding files: [%s]" % ", ".join(add_files))

# The ./bin/pyspark script stores the old PYTHONSTARTUP value in OLD_PYTHONSTARTUP,
# which allows us to execute the user's PYTHONSTARTUP file:
_pythonstartup = os.environ.get('OLD_PYTHONSTARTUP')
if _pythonstartup and os.path.isfile(_pythonstartup):
    execfile(_pythonstartup)
Add Apache license headers and LICENSE and NOTICE files 2013-07-16 20:21:33 -04:00			`#`
			`# Licensed to the Apache Software Foundation (ASF) under one or more`
			`# contributor license agreements. See the NOTICE file distributed with`
			`# this work for additional information regarding copyright ownership.`
			`# The ASF licenses this file to You under the Apache License, Version 2.0`
			`# (the "License"); you may not use this file except in compliance with`
			`# the License. You may obtain a copy of the License at`
			`#`
			`# http://www.apache.org/licenses/LICENSE-2.0`
			`#`
			`# Unless required by applicable law or agreed to in writing, software`
			`# distributed under the License is distributed on an "AS IS" BASIS,`
			`# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`# See the License for the specific language governing permissions and`
			`# limitations under the License.`
			`#`

Add PySpark README and run scripts. 2012-10-19 20:16:41 -04:00			`"""`
			`An interactive shell.`

Added accumulators to PySpark 2013-01-20 04:57:44 -05:00			`This file is designed to be launched as a PYTHONSTARTUP script.`
Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide. 2013-01-02 00:25:49 -05:00			`"""`
[python alternative] pyspark require Python2, failing if system default is Py3 from shell.py Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py Author: AbhishekKr <abhikumar163@gmail.com> Closes #399 from abhishekkr/pyspark_shell and squashes the following commits: 134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py 2014-04-16 22:05:40 -04:00
			`import sys`
Fixed broken pyspark shell. Author: Reynold Xin <rxin@apache.org> Closes #444 from rxin/pyspark and squashes the following commits: fc11356 [Reynold Xin] Made the PySpark shell version checking compatible with Python 2.6. 571830b [Reynold Xin] Fixed broken pyspark shell. 2014-04-18 13:10:13 -04:00			`if sys.version_info[0] != 2:`
[python alternative] pyspark require Python2, failing if system default is Py3 from shell.py Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py Author: AbhishekKr <abhikumar163@gmail.com> Closes #399 from abhishekkr/pyspark_shell and squashes the following commits: 134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py 2014-04-16 22:05:40 -04:00			`print("Error: Default Python used is Python%s" % sys.version_info.major)`
			`print("\tSet env variable PYSPARK_PYTHON to Python2 binary and re-run it.")`
			`sys.exit(1)`


[SPARK-2435] Add shutdown hook to pyspark Author: Matthew Farrellee <matt@redhat.com> Closes #2183 from mattf/SPARK-2435 and squashes the following commits: ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark 2014-09-03 22:37:37 -04:00			`import atexit`
Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide. 2013-01-02 00:25:49 -05:00			`import os`
Add banner to PySpark and make wordcount output nicer 2013-09-01 01:38:32 -04:00			`import platform`
[SPARK-5872] [SQL] create a sqlCtx in pyspark shell The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not. It also skip the Hive tests in pyspark.sql.tests if no hive is available. Author: Davies Liu <davies@databricks.com> Closes #4659 from davies/sqlctx and squashes the following commits: 0e6629a [Davies Liu] sqlCtx in pyspark 2015-02-17 18:44:37 -05:00
			`import py4j`

Make module help available in python shell. Also, adds a line in doc explaining how to use. 2013-01-30 18:04:06 -05:00			`import pyspark`
Remove reflection, hard-code StorageLevels The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python 2013-09-07 12:28:39 -04:00			`from pyspark.context import SparkContext`
[SPARK-5872] [SQL] create a sqlCtx in pyspark shell The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not. It also skip the Hive tests in pyspark.sql.tests if no hive is available. Author: Davies Liu <davies@databricks.com> Closes #4659 from davies/sqlctx and squashes the following commits: 0e6629a [Davies Liu] sqlCtx in pyspark 2015-02-17 18:44:37 -05:00			`from pyspark.sql import SQLContext, HiveContext`
Export StorageLevel and refactor 2013-09-07 17:41:31 -04:00			`from pyspark.storagelevel import StorageLevel`
Add PySpark README and run scripts. 2012-10-19 20:16:41 -04:00
[SPARK-3340] Deprecate ADD_JARS and ADD_FILES I created a patch that disables the environment variables. Thereby scala or python shell log a warning message to notify user about the deprecation with the following message: scala: "ADD_JARS environment variable is deprecated, use --jar spark submit argument instead" python: "Warning: ADD_FILES environment variable is deprecated, use --py-files argument instead" Is it what is expected or the code associated with the variables should be just completely removed? Should it be somewhere documented? Author: azagrebin <azagrebin@gmail.com> Closes #4616 from azagrebin/master and squashes the following commits: bab1aa9 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: minor readability issue 0643895 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: add warning messages 42f0107 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES 2015-02-16 21:06:19 -05:00			`# this is the deprecated equivalent of ADD_JARS`
			`add_files = None`
			`if os.environ.get("ADD_FILES") is not None:`
			`add_files = os.environ.get("ADD_FILES").split(',')`
Add PySpark README and run scripts. 2012-10-19 20:16:41 -04:00
Set spark.executor.uri from environment variable (needed by Mesos) The Mesos backend uses this property when setting up a slave process. It is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I couldn't find any analogous for pyspark. Author: Ivan Wick <ivanwick+github@gmail.com> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #311 from ivanwick/master and squashes the following commits: da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable (needed by Mesos) 2014-04-10 20:49:30 -04:00			`if os.environ.get("SPARK_EXECUTOR_URI"):`
			`SparkContext.setSystemProperty("spark.executor.uri", os.environ["SPARK_EXECUTOR_URI"])`

[SPARK-1808] Route bin/pyspark through Spark submit Problem. For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`. Solution. Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent. Details. `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest. For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case. This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too. Author: Andrew Or <andrewor14@gmail.com> Closes #799 from andrewor14/pyspark-submit and squashes the following commits: bf37e36 [Andrew Or] Minor changes 01066fa [Andrew Or] bin/pyspark for Windows c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes) 1866f85 [Andrew Or] Windows is not cooperating 456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set 7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit b7ba0d8 [Andrew Or] Address a few comments (minor) 06eb138 [Andrew Or] Use shlex instead of writing our own parser 05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly 6fba412 [Andrew Or] Deal with quotes + address various comments fe4c8a7 [Andrew Or] Update --help for bin/pyspark afe47bf [Andrew Or] Fix spark shell f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit a371d26 [Andrew Or] Route bin/pyspark through Spark submit 2014-05-17 01:34:38 -04:00			`sc = SparkContext(appName="PySparkShell", pyFiles=add_files)`
[SPARK-2435] Add shutdown hook to pyspark Author: Matthew Farrellee <matt@redhat.com> Closes #2183 from mattf/SPARK-2435 and squashes the following commits: ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark 2014-09-03 22:37:37 -04:00			`atexit.register(lambda: sc.stop())`
Add banner to PySpark and make wordcount output nicer 2013-09-01 01:38:32 -04:00
[SPARK-5872] [SQL] create a sqlCtx in pyspark shell The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not. It also skip the Hive tests in pyspark.sql.tests if no hive is available. Author: Davies Liu <davies@databricks.com> Closes #4659 from davies/sqlctx and squashes the following commits: 0e6629a [Davies Liu] sqlCtx in pyspark 2015-02-17 18:44:37 -05:00			`try:`
			`# Try to access HiveConf, it will raise exception if Hive is not added`
			`sc._jvm.org.apache.hadoop.hive.conf.HiveConf()`
			`sqlCtx = HiveContext(sc)`
			`except py4j.protocol.Py4JError:`
			`sqlCtx = SQLContext(sc)`

[python alternative] pyspark require Python2, failing if system default is Py3 from shell.py Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py Author: AbhishekKr <abhikumar163@gmail.com> Closes #399 from abhishekkr/pyspark_shell and squashes the following commits: 134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py 2014-04-16 22:05:40 -04:00			`print("""Welcome to`
Add banner to PySpark and make wordcount output nicer 2013-09-01 01:38:32 -04:00			`____ __`
			`/ __/__ ___ _____/ /__`
			_\ \/ _ \/ _ `/ __/ '_/
[SPARK-3273][SPARK-3301]We should read the version information from the same place Author: GuoQiang Li <witgo@qq.com> Closes #2175 from witgo/SPARK-3273 and squashes the following commits: cf9c65a [GuoQiang Li] We should read the version information from the same place 2a44e2f [GuoQiang Li] The spark version in the welcome message of pyspark is not correct 2014-09-06 18:08:43 -04:00			`/__ / .__/\_,_/_/ /_/\_\ version %s`
Add banner to PySpark and make wordcount output nicer 2013-09-01 01:38:32 -04:00			`/_/`
[SPARK-3273][SPARK-3301]We should read the version information from the same place Author: GuoQiang Li <witgo@qq.com> Closes #2175 from witgo/SPARK-3273 and squashes the following commits: cf9c65a [GuoQiang Li] We should read the version information from the same place 2a44e2f [GuoQiang Li] The spark version in the welcome message of pyspark is not correct 2014-09-06 18:08:43 -04:00			`""" % sc.version)`
[python alternative] pyspark require Python2, failing if system default is Py3 from shell.py Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py Author: AbhishekKr <abhikumar163@gmail.com> Closes #399 from abhishekkr/pyspark_shell and squashes the following commits: 134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py 2014-04-16 22:05:40 -04:00			`print("Using Python version %s (%s, %s)" % (`
Add banner to PySpark and make wordcount output nicer 2013-09-01 01:38:32 -04:00			`platform.python_version(),`
			`platform.python_build()[0],`
[python alternative] pyspark require Python2, failing if system default is Py3 from shell.py Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py Author: AbhishekKr <abhikumar163@gmail.com> Closes #399 from abhishekkr/pyspark_shell and squashes the following commits: 134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py 2014-04-16 22:05:40 -04:00			`platform.python_build()[1]))`
[SPARK-5872] [SQL] create a sqlCtx in pyspark shell The sqlCtx will be HiveContext if hive is built in assembly jar, or SQLContext if not. It also skip the Hive tests in pyspark.sql.tests if no hive is available. Author: Davies Liu <davies@databricks.com> Closes #4659 from davies/sqlctx and squashes the following commits: 0e6629a [Davies Liu] sqlCtx in pyspark 2015-02-17 18:44:37 -05:00			`print("SparkContext available as sc, %s available as sqlCtx." % sqlCtx.__class__.__name__)`
Add PySpark README and run scripts. 2012-10-19 20:16:41 -04:00
follow pep8 None should be compared using is or is not http://legacy.python.org/dev/peps/pep-0008/ ## Programming Recommendations - Comparisons to singletons like None should always be done with is or is not, never the equality operators. Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local> Closes #1422 from giwa/apache_master and squashes the following commits: 7b361f3 [Ken Takagiwa] follow pep8 None should be checked using is or is not 2014-07-16 00:34:05 -04:00			`if add_files is not None:`
[SPARK-3340] Deprecate ADD_JARS and ADD_FILES I created a patch that disables the environment variables. Thereby scala or python shell log a warning message to notify user about the deprecation with the following message: scala: "ADD_JARS environment variable is deprecated, use --jar spark submit argument instead" python: "Warning: ADD_FILES environment variable is deprecated, use --py-files argument instead" Is it what is expected or the code associated with the variables should be just completely removed? Should it be somewhere documented? Author: azagrebin <azagrebin@gmail.com> Closes #4616 from azagrebin/master and squashes the following commits: bab1aa9 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: minor readability issue 0643895 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES: add warning messages 42f0107 [azagrebin] [SPARK-3340] Deprecate ADD_JARS and ADD_FILES 2015-02-16 21:06:19 -05:00			`print("Warning: ADD_FILES environment variable is deprecated, use --py-files argument instead")`
[python alternative] pyspark require Python2, failing if system default is Py3 from shell.py Python alternative for https://github.com/apache/spark/pull/392; managed from shell.py Author: AbhishekKr <abhikumar163@gmail.com> Closes #399 from abhishekkr/pyspark_shell and squashes the following commits: 134bdc9 [AbhishekKr] pyspark require Python2, failing if system default is Py3 from shell.py 2014-04-16 22:05:40 -04:00			`print("Adding files: [%s]" % ", ".join(add_files))`
Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark Now ADD_FILES uses a comma as file name separator. 2013-08-12 21:00:35 -04:00
pyspark -> bin/pyspark 2014-01-02 08:20:12 -05:00			`# The ./bin/pyspark script stores the old PYTHONSTARTUP value in OLD_PYTHONSTARTUP,`
Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide. 2013-01-02 00:25:49 -05:00			`# which allows us to execute the user's PYTHONSTARTUP file:`
			`_pythonstartup = os.environ.get('OLD_PYTHONSTARTUP')`
			`if _pythonstartup and os.path.isfile(_pythonstartup):`
Added accumulators to PySpark 2013-01-20 04:57:44 -05:00			`execfile(_pythonstartup)`