[SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section

### What changes were proposed in this pull request?

This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts.

### Why are the changes needed?

pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete.

### Does this PR introduce _any_ user-facing change?

To end users, no because the docs are not released yet.

### How was this patch tested?

I manually built the docs and checked the output

Closes #33018 from HyukjinKwon/SPARK-35645.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This commit is contained in:
Hyukjin Kwon 2021-06-22 09:36:27 -07:00 committed by Dongjoon Hyun
parent ce53b7199d
commit 27046582e4
4 changed files with 5 additions and 283 deletions

View file

@ -25,18 +25,12 @@ There are more guides shared with other languages such as
`Quick Start <https://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides
at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#where-to-go-from-here>`_.
.. TODO(SPARK-35588): Merge PySpark quickstart and 10 minutes to pandas API on Spark.
.. toctree::
:maxdepth: 2
install
quickstart
For pandas API on Spark:
.. toctree::
:maxdepth: 2
ps_install
ps_10mins
ps_videos_blogs

View file

@ -46,7 +46,10 @@ If you want to install extra dependencies for a specific component, you can inst
.. code-block:: bash
# Spark SQL
pip install pyspark[sql]
# pandas API on Spark
pip install pyspark[pandas_on_spark]
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:

View file

@ -1,145 +0,0 @@
============
Installation
============
Pandas API on Spark requires PySpark so please make sure your PySpark is available.
To install pandas API on Spark, you can use:
- `Conda <https://anaconda.org/conda-forge/koalas>`__
- `PyPI <https://pypi.org/project/koalas>`__
- `Installation from source <../development/ps_contributing.rst#environment-setup>`__
To install PySpark, you can use:
- `Installation with the official release channel <https://spark.apache.org/downloads.html>`__
- `Conda <https://anaconda.org/conda-forge/pyspark>`__
- `PyPI <https://pypi.org/project/pyspark>`__
- `Installation from source <https://github.com/apache/spark#building-spark>`__
Python version support
----------------------
Officially Python 3.5 to 3.8.
.. note::
Python 3.5 support is deprecated and will be dropped in the future release.
At that point, existing Python 3.5 workflows that use pandas API on Spark will continue to work without
modification, but Python 3.5 users will no longer get access to the latest pandas-on-Spark features
and bugfixes. We recommend that you upgrade to Python 3.6 or newer.
Installing pandas API on Spark
-------------------------------
Installing with Conda
~~~~~~~~~~~~~~~~~~~~~~
First you will need `Conda <http://conda.pydata.org/docs/>`__ to be installed.
After that, we should create a new conda environment. A conda environment is similar with a
virtualenv that allows you to specify a specific version of Python and set of libraries.
Run the following commands from a terminal window::
conda create --name koalas-dev-env
This will create a minimal environment with only Python installed in it.
To put your self inside this environment run::
conda activate koalas-dev-env
The final step required is to install pandas API on Spark. This can be done with the
following command::
conda install -c conda-forge koalas
To install a specific version of pandas API on Spark::
conda install -c conda-forge koalas=1.3.0
Installing from PyPI
~~~~~~~~~~~~~~~~~~~~
Pandas API on Spark can be installed via pip from
`PyPI <https://pypi.org/project/koalas>`__::
pip install koalas
Installing from source
~~~~~~~~~~~~~~~~~~~~~~
See the `Contribution Guide <../development/ps_contributing.rst#environment-setup>`__ for complete instructions.
Installing PySpark
------------------
Installing with the official release channel
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
You can install PySpark by downloading a release in `the official release channel <https://spark.apache.org/downloads.html>`__.
Once you download the release, un-tar it first as below::
tar xzvf spark-2.4.4-bin-hadoop2.7.tgz
After that, make sure set ``SPARK_HOME`` environment variable to indicate the directory you untar-ed::
cd spark-2.4.4-bin-hadoop2.7
export SPARK_HOME=`pwd`
Also, make sure your ``PYTHONPATH`` can find the PySpark and Py4J under ``$SPARK_HOME/python/lib``::
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
Installing with Conda
~~~~~~~~~~~~~~~~~~~~~~
PySpark can be installed via `Conda <https://anaconda.org/conda-forge/pyspark>`__::
conda install -c conda-forge pyspark
Installing with PyPI
~~~~~~~~~~~~~~~~~~~~~~
PySpark can be installed via pip from `PyPI <https://pypi.org/project/pyspark>`__::
pip install pyspark
Installing from source
~~~~~~~~~~~~~~~~~~~~~~
To install PySpark from source, refer `Building Spark <https://github.com/apache/spark#building-spark>`__.
Likewise, make sure you set ``SPARK_HOME`` environment variable to the git-cloned directory, and your
``PYTHONPATH`` environment variable can find the PySpark and Py4J under ``$SPARK_HOME/python/lib``::
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
Dependencies
------------
============= ================
Package Required version
============= ================
`pandas` >=0.23.2
`pyspark` >=2.4.0
`pyarrow` >=0.10
`numpy` >=1.14
============= ================
Optional dependencies
~~~~~~~~~~~~~~~~~~~~~
============= ================
Package Required version
============= ================
`mlflow` >=1.0
`plotly` >=4.8
`matplotlib` >=3.0.0,<3.3.0
============= ================

View file

@ -1,130 +0,0 @@
======================
Koalas Talks and Blogs
======================
Blog Posts
----------
- `Interoperability between Koalas and Apache Spark (Aug 11, 2020) <https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html>`_
- `Introducing Koalas 1.0 (Jun 24, 2020) <https://databricks.com/blog/2020/06/24/introducing-koalas-1-0.html>`_
- `10 Minutes from pandas to Koalas on Apache Spark (Mar 31, 2020) <https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html>`_
- `Guest Blog: How Virgin Hyperloop One Reduced Processing Time from Hours to Minutes with Koalas (Aug 22, 2019) <https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-processing-time-from-hours-to-minutes-with-koalas.html>`_
- `Koalas: Easy Transition from pandas to Apache Spark (Apr 24, 2019) <https://databricks.com/blog/2019/04/24/koalas-easy-transition-from-pandas-to-apache-spark.html>`_
Data + AI Summit 2020 EUROPE (Nov 18-19, 2020)
----------------------------------------------
Project Zen: Making Spark Pythonic
==================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/-vJLTEOdLvA" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Koalas: Interoperability Between Koalas and Apache Spark
========================================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/eI0Wh2Epo0Q" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Spark + AI Summit 2020 (Jun 24, 2020)
-------------------------------------
Introducing Apache Spark 3.0: A retrospective of the Last 10 Years, and a Look Forward to the Next 10 Years to Come.
====================================================================================================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/OLJKIogf2nU?start=555" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Koalas: Making an Easy Transition from Pandas to Apache Spark
=============================================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/G_-9VbyHcx8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Koalas: Pandas on Apache Spark
==============================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/iUpBSHoqzLM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Webinar @ Databricks (Mar 27, 2020)
-----------------------------------
Reducing Time-To-Insight for Virgin Hyperloop's Data
====================================================
.. raw:: html
<iframe width="560" height="315" src="https://player.vimeo.com/video/397032070" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen"="" allowfullscreen=""></iframe>
PyData New York 2019 (Nov 4, 2019)
----------------------------------
Pandas vs Koalas: The Ultimate Showdown
=======================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/xcGEQUURAuk?start=1470" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Spark + AI Summit Europe 2019 (Oct 16, 2019)
--------------------------------------------
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas
=======================================================================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/scM_WQMhB3A?start=1470" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Koalas: Making an Easy Transition from Pandas to Apache Spark
=============================================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/Wfj2Vuse7as" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Koalas: Pandas on Apache Spark
==============================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/NpAMbzerAp0" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
PyBay 2019 (Aug 17, 2019)
-------------------------
Koalas Easy Transition from pandas to Apache Spark
==================================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/cMDLoGkidEE?v=xcGEQUURAuk?start=1470" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
Spark + AI Summit 2019 (Apr 24, 2019)
-------------------------------------
Official Announcement of Koalas Open Source Project
===================================================
.. raw:: html
<iframe width="560" height="315" src="https://www.youtube.com/embed/Shzb15DZ9Qg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>