spark-instrumented-optimizer/python/docs/source/getting_started/install.rst

178 lines
6.5 KiB
ReStructuredText
Raw Normal View History

.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
============
Installation
============
PySpark is included in the official releases of Spark available in the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
For Python users, PySpark also provides ``pip`` installation from PyPI. This is usually for local usage or as
a client to connect to a cluster instead of setting up a cluster itself.
This page includes instructions for installing PySpark by using pip, Conda, downloading manually,
and building from the source.
Python Version Supported
------------------------
Python 3.6 and above.
Using PyPI
----------
PySpark installation using `PyPI <https://pypi.org/project/pyspark/>`_ is as follows:
.. code-block:: bash
pip install pyspark
If you want to install extra dependencies for a specific component, you can install it as below:
.. code-block:: bash
pip install pyspark[sql]
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
.. code-block:: bash
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
The default distribution uses Hadoop 3.2 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
downloads a different version and use it in PySpark. Downloading it can take a while depending on
the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manually choose the mirror for faster downloading.
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
.. code-block:: bash
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=2.7 pip install
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
.. code-block:: bash
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark -v
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
Supported values in ``PYSPARK_HADOOP_VERSION`` are:
- ``without``: Spark pre-built with user-provided Apache Hadoop
- ``2.7``: Spark pre-built for Apache Hadoop 2.7
- ``3.2``: Spark pre-built for Apache Hadoop 3.2 and later (default)
Note that this installation way of PySpark with/without a specific Hadoop version is experimental. It can change or be removed between minor releases.
[SPARK-32017][PYTHON][BUILD] Make Pyspark Hadoop 3.2+ Variant available in PyPI ### What changes were proposed in this pull request? This PR proposes to add a way to select Hadoop and Hive versions in pip installation. Users can select Hive or Hadoop versions as below: ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`. **Please NOTE that:** - We cannot currently leverage pip's native installation option, for example: ```bash pip install pyspark --install-option="hadoop3.2" ``` This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837. It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example. - In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes. Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes. - This way is sort of consistent with SparkR: SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together. PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell. If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc. - There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option. The usual way looks either `--install-option` above with hacks or environment variables given my investigation. ### Why are the changes needed? To provide users the options to select Hadoop and Hive versions. ### Does this PR introduce _any_ user-facing change? Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`; ```bash HADOOP_VERSION=3.2 pip install pyspark HIVE_VERSION=1.2 pip install pyspark HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark ``` ### How was this patch tested? Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`): ```bash ./build/mvn -DskipTests -Phive-thriftserver clean package ``` Mac: ```bash SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz ``` Windows: ```bash set HADOOP_VERSION=3.2 set SPARK_VERSION=3.0.1 pip install pyspark-3.1.0.dev0.tar.gz ``` Closes #29703 from HyukjinKwon/SPARK-32017. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-22 20:30:51 -04:00
Using Conda
-----------
Conda is an open-source package management and environment management system which is a part of
the `Anaconda <https://docs.continuum.io/anaconda/>`_ distribution. It is both cross-platform and
language agnostic. In practice, Conda can replace both `pip <https://pip.pypa.io/en/latest/>`_ and
`virtualenv <https://virtualenv.pypa.io/en/latest/>`_.
Create new virtual environment from your terminal as shown below:
.. code-block:: bash
conda create -n pyspark_env
After the virtual environment is created, it should be visible under the list of Conda environments
which can be seen using the following command:
.. code-block:: bash
conda env list
Now activate the newly created environment with the following command:
.. code-block:: bash
conda activate pyspark_env
You can install pyspark by `Using PyPI <#using-pypi>`_ to install PySpark in the newly created
environment, for example as below. It will install PySpark under the new virtual environment
``pyspark_env`` created above.
.. code-block:: bash
pip install pyspark
Alternatively, you can install PySpark from Conda itself as below:
.. code-block:: bash
conda install pyspark
However, note that `PySpark at Conda <https://anaconda.org/conda-forge/pyspark>`_ is not necessarily
synced with PySpark release cycle because it is maintained by the community separately.
Manually Downloading
--------------------
PySpark is included in the distributions available at the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
You can download a distribution you want from the site. After that, uncompress the tar file into the directory where you want
to install Spark, for example, as below:
.. code-block:: bash
tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
Ensure the ``SPARK_HOME`` environment variable points to the directory where the tar file has been extracted.
Update ``PYTHONPATH`` environment variable such that it can find the PySpark and Py4J under ``SPARK_HOME/python/lib``.
One example of doing this is shown below:
.. code-block:: bash
cd spark-3.0.0-bin-hadoop2.7
export SPARK_HOME=`pwd`
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
Installing from Source
----------------------
To install PySpark from source, refer to |building_spark|_.
Dependencies
------------
============= ========================= ======================================
Package Minimum supported version Note
============= ========================= ======================================
`pandas` 0.23.2 Optional for Spark SQL
`NumPy` 1.7 Required for MLlib DataFrame-based API
`pyarrow` 1.0.0 Optional for Spark SQL
`Py4J` 0.10.9.2 Required
`pandas` 0.23.2 Required for pandas APIs on Spark
`pyarrow` 1.0.0 Required for pandas APIs on Spark
`Numpy` 1.14(<1.20.0) Required for pandas APIs on Spark
============= ========================= ======================================
Note that PySpark requires Java 8 or later with ``JAVA_HOME`` properly set.
If using JDK 11, set ``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related features and refer
to |downloading|_.
Note for AArch64 (ARM64) users: PyArrow is required by PySpark SQL, but PyArrow support for AArch64
is introduced in PyArrow 4.0.0. If PySpark installation fails on AArch64 due to PyArrow
installation errors, you can install PyArrow >= 4.0.0 as below:
.. code-block:: bash
pip install "pyarrow>=4.0.0" --prefer-binary