f893a19c4c
### What changes were proposed in this pull request? This PR: - rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors" - documents extra dependency installation `pip install pyspark[sql]` - uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html - adds some more details I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html ### Why are the changes needed? To improve installation guide. ### Does this PR introduce _any_ user-facing change? Yes, it updates the user-facing installation guide. ### How was this patch tested? Manually built the doc and tested. Closes #29779 from HyukjinKwon/SPARK-32180. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
139 lines
4.6 KiB
ReStructuredText
139 lines
4.6 KiB
ReStructuredText
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
|
|
============
|
|
Installation
|
|
============
|
|
|
|
PySpark is included in the official releases of Spark available in the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
|
|
For Python users, PySpark also provides ``pip`` installation from PyPI. This is usually for local usage or as
|
|
a client to connect to a cluster instead of setting up a cluster itself.
|
|
|
|
This page includes instructions for installing PySpark by using pip, Conda, downloading manually,
|
|
and building from the source.
|
|
|
|
|
|
Python Version Supported
|
|
------------------------
|
|
|
|
Python 3.6 and above.
|
|
|
|
|
|
Using PyPI
|
|
----------
|
|
|
|
PySpark installation using `PyPI <https://pypi.org/project/pyspark/>`_ is as follows:
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install pyspark
|
|
|
|
If you want to install extra dependencies for a specific componenet, you can install it as below:
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install pyspark[sql]
|
|
|
|
|
|
Using Conda
|
|
-----------
|
|
|
|
Conda is an open-source package management and environment management system which is a part of
|
|
the `Anaconda <https://docs.continuum.io/anaconda/>`_ distribution. It is both cross-platform and
|
|
language agnostic. In practice, Conda can replace both `pip <https://pip.pypa.io/en/latest/>`_ and
|
|
`virtualenv <https://virtualenv.pypa.io/en/latest/>`_.
|
|
|
|
Create new virtual environment from your terminal as shown below:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda create -n pyspark_env
|
|
|
|
After the virtual environment is created, it should be visible under the list of Conda environments
|
|
which can be seen using the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda env list
|
|
|
|
Now activate the newly created environment with the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda activate pyspark_env
|
|
|
|
You can install pyspark by `Using PyPI <#using-pypi>`_ to install PySpark in the newly created
|
|
environment, for example as below. It will install PySpark under the new virtual environemnt
|
|
``pyspark_env`` created above.
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install pyspark
|
|
|
|
Alternatively, you can install PySpark from Conda itself as below:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda install pyspark
|
|
|
|
However, note that `PySpark at Conda <https://anaconda.org/conda-forge/pyspark>`_ is not necessarily
|
|
synced with PySpark release cycle because it is maintained by the community separately.
|
|
|
|
|
|
Manually Downloading
|
|
--------------------
|
|
|
|
PySpark is included in the distributions available at the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
|
|
You can download a distribution you want from the site. After that, uncompress the tar file into the directoy where you want
|
|
to install Spark, for example, as below:
|
|
|
|
.. code-block:: bash
|
|
|
|
tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
|
|
|
|
Ensure the ``SPARK_HOME`` environment variable points to the directory where the tar file has been extracted.
|
|
Update ``PYTHONPATH`` environment variable such that it can find the PySpark and Py4J under ``SPARK_HOME/python/lib``.
|
|
One example of doing this is shown below:
|
|
|
|
.. code-block:: bash
|
|
|
|
cd spark-3.0.0-bin-hadoop2.7
|
|
export SPARK_HOME=`pwd`
|
|
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
|
|
|
|
|
|
Installing from Source
|
|
----------------------
|
|
|
|
To install PySpark from source, refer to |building_spark|_.
|
|
|
|
|
|
Dependencies
|
|
------------
|
|
============= ========================= ================
|
|
Package Minimum supported version Note
|
|
============= ========================= ================
|
|
`pandas` 0.23.2 Optional for SQL
|
|
`NumPy` 1.7 Required for ML
|
|
`pyarrow` 0.15.1 Optional for SQL
|
|
`Py4J` 0.10.9 Required
|
|
============= ========================= ================
|
|
|
|
Note that PySpark requires Java 8 or later with ``JAVA_HOME`` properly set.
|
|
If using JDK 11, set ``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related features and refer
|
|
to |downloading|_.
|