spark-instrumented-optimizer/python/docs/source/getting_started/ps_install.rst
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00

146 lines
4.1 KiB
ReStructuredText

============
Installation
============
Koalas requires PySpark so please make sure your PySpark is available.
To install Koalas, you can use:
- `Conda <https://anaconda.org/conda-forge/koalas>`__
- `PyPI <https://pypi.org/project/koalas>`__
- `Installation from source <../development/ps_contributing.rst#environment-setup>`__
To install PySpark, you can use:
- `Installation with the official release channel <https://spark.apache.org/downloads.html>`__
- `Conda <https://anaconda.org/conda-forge/pyspark>`__
- `PyPI <https://pypi.org/project/pyspark>`__
- `Installation from source <https://github.com/apache/spark#building-spark>`__
Python version support
----------------------
Officially Python 3.5 to 3.8.
.. note::
Koalas support for Python 3.5 is deprecated and will be dropped in the future release.
At that point, existing Python 3.5 workflows that use Koalas will continue to work without
modification, but Python 3.5 users will no longer get access to the latest Koalas features
and bugfixes. We recommend that you upgrade to Python 3.6 or newer.
Installing Koalas
-----------------
Installing with Conda
~~~~~~~~~~~~~~~~~~~~~~
First you will need `Conda <http://conda.pydata.org/docs/>`__ to be installed.
After that, we should create a new conda environment. A conda environment is similar with a
virtualenv that allows you to specify a specific version of Python and set of libraries.
Run the following commands from a terminal window::
conda create --name koalas-dev-env
This will create a minimal environment with only Python installed in it.
To put your self inside this environment run::
conda activate koalas-dev-env
The final step required is to install Koalas. This can be done with the
following command::
conda install -c conda-forge koalas
To install a specific Koalas version::
conda install -c conda-forge koalas=1.3.0
Installing from PyPI
~~~~~~~~~~~~~~~~~~~~
Koalas can be installed via pip from
`PyPI <https://pypi.org/project/koalas>`__::
pip install koalas
Installing from source
~~~~~~~~~~~~~~~~~~~~~~
See the `Contribution Guide <../development/ps_contributing.rst#environment-setup>`__ for complete instructions.
Installing PySpark
------------------
Installing with the official release channel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can install PySpark by downloading a release in `the official release channel <https://spark.apache.org/downloads.html>`__.
Once you download the release, un-tar it first as below::
tar xzvf spark-2.4.4-bin-hadoop2.7.tgz
After that, make sure set ``SPARK_HOME`` environment variable to indicate the directory you untar-ed::
cd spark-2.4.4-bin-hadoop2.7
export SPARK_HOME=`pwd`
Also, make sure your ``PYTHONPATH`` can find the PySpark and Py4J under ``$SPARK_HOME/python/lib``::
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
Installing with Conda
~~~~~~~~~~~~~~~~~~~~~~
PySpark can be installed via `Conda <https://anaconda.org/conda-forge/pyspark>`__::
conda install -c conda-forge pyspark
Installing with PyPI
~~~~~~~~~~~~~~~~~~~~~~
PySpark can be installed via pip from `PyPI <https://pypi.org/project/pyspark>`__::
pip install pyspark
Installing from source
~~~~~~~~~~~~~~~~~~~~~~
To install PySpark from source, refer `Building Spark <https://github.com/apache/spark#building-spark>`__.
Likewise, make sure you set ``SPARK_HOME`` environment variable to the git-cloned directory, and your
``PYTHONPATH`` environment variable can find the PySpark and Py4J under ``$SPARK_HOME/python/lib``::
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
Dependencies
------------
============= ================
Package Required version
============= ================
`pandas` >=0.23.2
`pyspark` >=2.4.0
`pyarrow` >=0.10
`numpy` >=1.14
============= ================
Optional dependencies
~~~~~~~~~~~~~~~~~~~~~
============= ================
Package Required version
============= ================
`mlflow` >=1.0
`plotly` >=4.8
`matplotlib` >=3.0.0,<3.3.0
============= ================