13fd272cd3
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `R`
* `common`
* `dev`
* `mlib`
* `external`
* `project`
* `streaming`
* `resource-managers`
* `python`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes #30402 from jsoref/spelling-R_common_dev_mlib_external_project_streaming_resource-managers_python.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
167 lines
5.8 KiB
ReStructuredText
167 lines
5.8 KiB
ReStructuredText
.. Licensed to the Apache Software Foundation (ASF) under one
|
|
or more contributor license agreements. See the NOTICE file
|
|
distributed with this work for additional information
|
|
regarding copyright ownership. The ASF licenses this file
|
|
to you under the Apache License, Version 2.0 (the
|
|
"License"); you may not use this file except in compliance
|
|
with the License. You may obtain a copy of the License at
|
|
|
|
.. http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
.. Unless required by applicable law or agreed to in writing,
|
|
software distributed under the License is distributed on an
|
|
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
|
KIND, either express or implied. See the License for the
|
|
specific language governing permissions and limitations
|
|
under the License.
|
|
|
|
============
|
|
Installation
|
|
============
|
|
|
|
PySpark is included in the official releases of Spark available in the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
|
|
For Python users, PySpark also provides ``pip`` installation from PyPI. This is usually for local usage or as
|
|
a client to connect to a cluster instead of setting up a cluster itself.
|
|
|
|
This page includes instructions for installing PySpark by using pip, Conda, downloading manually,
|
|
and building from the source.
|
|
|
|
|
|
Python Version Supported
|
|
------------------------
|
|
|
|
Python 3.6 and above.
|
|
|
|
|
|
Using PyPI
|
|
----------
|
|
|
|
PySpark installation using `PyPI <https://pypi.org/project/pyspark/>`_ is as follows:
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install pyspark
|
|
|
|
If you want to install extra dependencies for a specific component, you can install it as below:
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install pyspark[sql]
|
|
|
|
For PySpark with/without a specific Hadoop version, you can install it by using ``HADOOP_VERSION`` environment variables as below:
|
|
|
|
.. code-block:: bash
|
|
|
|
HADOOP_VERSION=2.7 pip install pyspark
|
|
|
|
The default distribution uses Hadoop 3.2 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
|
|
downloads a different version and use it in PySpark. Downloading it can take a while depending on
|
|
the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manually choose the mirror for faster downloading.
|
|
|
|
.. code-block:: bash
|
|
|
|
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install
|
|
|
|
It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.
|
|
|
|
.. code-block:: bash
|
|
|
|
HADOOP_VERSION=2.7 pip install pyspark -v
|
|
|
|
Supported values in ``HADOOP_VERSION`` are:
|
|
|
|
- ``without``: Spark pre-built with user-provided Apache Hadoop
|
|
- ``2.7``: Spark pre-built for Apache Hadoop 2.7
|
|
- ``3.2``: Spark pre-built for Apache Hadoop 3.2 and later (default)
|
|
|
|
Note that this installation way of PySpark with/without a specific Hadoop version is experimental. It can change or be removed between minor releases.
|
|
|
|
|
|
Using Conda
|
|
-----------
|
|
|
|
Conda is an open-source package management and environment management system which is a part of
|
|
the `Anaconda <https://docs.continuum.io/anaconda/>`_ distribution. It is both cross-platform and
|
|
language agnostic. In practice, Conda can replace both `pip <https://pip.pypa.io/en/latest/>`_ and
|
|
`virtualenv <https://virtualenv.pypa.io/en/latest/>`_.
|
|
|
|
Create new virtual environment from your terminal as shown below:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda create -n pyspark_env
|
|
|
|
After the virtual environment is created, it should be visible under the list of Conda environments
|
|
which can be seen using the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda env list
|
|
|
|
Now activate the newly created environment with the following command:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda activate pyspark_env
|
|
|
|
You can install pyspark by `Using PyPI <#using-pypi>`_ to install PySpark in the newly created
|
|
environment, for example as below. It will install PySpark under the new virtual environment
|
|
``pyspark_env`` created above.
|
|
|
|
.. code-block:: bash
|
|
|
|
pip install pyspark
|
|
|
|
Alternatively, you can install PySpark from Conda itself as below:
|
|
|
|
.. code-block:: bash
|
|
|
|
conda install pyspark
|
|
|
|
However, note that `PySpark at Conda <https://anaconda.org/conda-forge/pyspark>`_ is not necessarily
|
|
synced with PySpark release cycle because it is maintained by the community separately.
|
|
|
|
|
|
Manually Downloading
|
|
--------------------
|
|
|
|
PySpark is included in the distributions available at the `Apache Spark website <https://spark.apache.org/downloads.html>`_.
|
|
You can download a distribution you want from the site. After that, uncompress the tar file into the directory where you want
|
|
to install Spark, for example, as below:
|
|
|
|
.. code-block:: bash
|
|
|
|
tar xzvf spark-3.0.0-bin-hadoop2.7.tgz
|
|
|
|
Ensure the ``SPARK_HOME`` environment variable points to the directory where the tar file has been extracted.
|
|
Update ``PYTHONPATH`` environment variable such that it can find the PySpark and Py4J under ``SPARK_HOME/python/lib``.
|
|
One example of doing this is shown below:
|
|
|
|
.. code-block:: bash
|
|
|
|
cd spark-3.0.0-bin-hadoop2.7
|
|
export SPARK_HOME=`pwd`
|
|
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
|
|
|
|
|
|
Installing from Source
|
|
----------------------
|
|
|
|
To install PySpark from source, refer to |building_spark|_.
|
|
|
|
|
|
Dependencies
|
|
------------
|
|
============= ========================= ================
|
|
Package Minimum supported version Note
|
|
============= ========================= ================
|
|
`pandas` 0.23.2 Optional for SQL
|
|
`NumPy` 1.7 Required for ML
|
|
`pyarrow` 1.0.0 Optional for SQL
|
|
`Py4J` 0.10.9 Required
|
|
============= ========================= ================
|
|
|
|
Note that PySpark requires Java 8 or later with ``JAVA_HOME`` properly set.
|
|
If using JDK 11, set ``-Dio.netty.tryReflectionSetAccessible=true`` for Arrow related features and refer
|
|
to |downloading|_.
|