[SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section
### What changes were proposed in this pull request? This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts. ### Why are the changes needed? pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete. ### Does this PR introduce _any_ user-facing change? To end users, no because the docs are not released yet. ### How was this patch tested? I manually built the docs and checked the output Closes #33018 from HyukjinKwon/SPARK-35645. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This commit is contained in:
parent
ce53b7199d
commit
27046582e4
|
@ -25,18 +25,12 @@ There are more guides shared with other languages such as
|
|||
`Quick Start <https://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides
|
||||
at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#where-to-go-from-here>`_.
|
||||
|
||||
.. TODO(SPARK-35588): Merge PySpark quickstart and 10 minutes to pandas API on Spark.
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
install
|
||||
quickstart
|
||||
|
||||
For pandas API on Spark:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
ps_install
|
||||
ps_10mins
|
||||
ps_videos_blogs
|
||||
|
||||
|
|
|
@ -46,7 +46,10 @@ If you want to install extra dependencies for a specific component, you can inst
|
|||
|
||||
.. code-block:: bash
|
||||
|
||||
# Spark SQL
|
||||
pip install pyspark[sql]
|
||||
# pandas API on Spark
|
||||
pip install pyspark[pandas_on_spark]
|
||||
|
||||
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:
|
||||
|
||||
|
|
|
@ -1,145 +0,0 @@
|
|||
============
|
||||
Installation
|
||||
============
|
||||
|
||||
Pandas API on Spark requires PySpark so please make sure your PySpark is available.
|
||||
|
||||
To install pandas API on Spark, you can use:
|
||||
|
||||
- `Conda <https://anaconda.org/conda-forge/koalas>`__
|
||||
- `PyPI <https://pypi.org/project/koalas>`__
|
||||
- `Installation from source <../development/ps_contributing.rst#environment-setup>`__
|
||||
|
||||
To install PySpark, you can use:
|
||||
|
||||
- `Installation with the official release channel <https://spark.apache.org/downloads.html>`__
|
||||
- `Conda <https://anaconda.org/conda-forge/pyspark>`__
|
||||
- `PyPI <https://pypi.org/project/pyspark>`__
|
||||
- `Installation from source <https://github.com/apache/spark#building-spark>`__
|
||||
|
||||
|
||||
Python version support
|
||||
----------------------
|
||||
|
||||
Officially Python 3.5 to 3.8.
|
||||
|
||||
.. note::
|
||||
Python 3.5 support is deprecated and will be dropped in the future release.
|
||||
At that point, existing Python 3.5 workflows that use pandas API on Spark will continue to work without
|
||||
modification, but Python 3.5 users will no longer get access to the latest pandas-on-Spark features
|
||||
and bugfixes. We recommend that you upgrade to Python 3.6 or newer.
|
||||
|
||||
Installing pandas API on Spark
|
||||
-------------------------------
|
||||
|
||||
Installing with Conda
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
First you will need `Conda <http://conda.pydata.org/docs/>`__ to be installed.
|
||||
After that, we should create a new conda environment. A conda environment is similar with a
|
||||
virtualenv that allows you to specify a specific version of Python and set of libraries.
|
||||
Run the following commands from a terminal window::
|
||||
|
||||
conda create --name koalas-dev-env
|
||||
|
||||
This will create a minimal environment with only Python installed in it.
|
||||
To put your self inside this environment run::
|
||||
|
||||
conda activate koalas-dev-env
|
||||
|
||||
The final step required is to install pandas API on Spark. This can be done with the
|
||||
following command::
|
||||
|
||||
conda install -c conda-forge koalas
|
||||
|
||||
To install a specific version of pandas API on Spark::
|
||||
|
||||
conda install -c conda-forge koalas=1.3.0
|
||||
|
||||
|
||||
Installing from PyPI
|
||||
~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Pandas API on Spark can be installed via pip from
|
||||
`PyPI <https://pypi.org/project/koalas>`__::
|
||||
|
||||
pip install koalas
|
||||
|
||||
|
||||
Installing from source
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
See the `Contribution Guide <../development/ps_contributing.rst#environment-setup>`__ for complete instructions.
|
||||
|
||||
|
||||
Installing PySpark
|
||||
------------------
|
||||
|
||||
Installing with the official release channel
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
You can install PySpark by downloading a release in `the official release channel <https://spark.apache.org/downloads.html>`__.
|
||||
Once you download the release, un-tar it first as below::
|
||||
|
||||
tar xzvf spark-2.4.4-bin-hadoop2.7.tgz
|
||||
|
||||
After that, make sure set ``SPARK_HOME`` environment variable to indicate the directory you untar-ed::
|
||||
|
||||
cd spark-2.4.4-bin-hadoop2.7
|
||||
export SPARK_HOME=`pwd`
|
||||
|
||||
Also, make sure your ``PYTHONPATH`` can find the PySpark and Py4J under ``$SPARK_HOME/python/lib``::
|
||||
|
||||
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
|
||||
|
||||
|
||||
Installing with Conda
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
PySpark can be installed via `Conda <https://anaconda.org/conda-forge/pyspark>`__::
|
||||
|
||||
conda install -c conda-forge pyspark
|
||||
|
||||
|
||||
Installing with PyPI
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
PySpark can be installed via pip from `PyPI <https://pypi.org/project/pyspark>`__::
|
||||
|
||||
pip install pyspark
|
||||
|
||||
|
||||
Installing from source
|
||||
~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
To install PySpark from source, refer `Building Spark <https://github.com/apache/spark#building-spark>`__.
|
||||
|
||||
Likewise, make sure you set ``SPARK_HOME`` environment variable to the git-cloned directory, and your
|
||||
``PYTHONPATH`` environment variable can find the PySpark and Py4J under ``$SPARK_HOME/python/lib``::
|
||||
|
||||
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
|
||||
|
||||
|
||||
Dependencies
|
||||
------------
|
||||
|
||||
============= ================
|
||||
Package Required version
|
||||
============= ================
|
||||
`pandas` >=0.23.2
|
||||
`pyspark` >=2.4.0
|
||||
`pyarrow` >=0.10
|
||||
`numpy` >=1.14
|
||||
============= ================
|
||||
|
||||
|
||||
Optional dependencies
|
||||
~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
============= ================
|
||||
Package Required version
|
||||
============= ================
|
||||
`mlflow` >=1.0
|
||||
`plotly` >=4.8
|
||||
`matplotlib` >=3.0.0,<3.3.0
|
||||
============= ================
|
|
@ -1,130 +0,0 @@
|
|||
======================
|
||||
Koalas Talks and Blogs
|
||||
======================
|
||||
|
||||
Blog Posts
|
||||
----------
|
||||
|
||||
- `Interoperability between Koalas and Apache Spark (Aug 11, 2020) <https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html>`_
|
||||
- `Introducing Koalas 1.0 (Jun 24, 2020) <https://databricks.com/blog/2020/06/24/introducing-koalas-1-0.html>`_
|
||||
- `10 Minutes from pandas to Koalas on Apache Spark (Mar 31, 2020) <https://databricks.com/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html>`_
|
||||
- `Guest Blog: How Virgin Hyperloop One Reduced Processing Time from Hours to Minutes with Koalas (Aug 22, 2019) <https://databricks.com/blog/2019/08/22/guest-blog-how-virgin-hyperloop-one-reduced-processing-time-from-hours-to-minutes-with-koalas.html>`_
|
||||
- `Koalas: Easy Transition from pandas to Apache Spark (Apr 24, 2019) <https://databricks.com/blog/2019/04/24/koalas-easy-transition-from-pandas-to-apache-spark.html>`_
|
||||
|
||||
|
||||
Data + AI Summit 2020 EUROPE (Nov 18-19, 2020)
|
||||
----------------------------------------------
|
||||
|
||||
Project Zen: Making Spark Pythonic
|
||||
==================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/-vJLTEOdLvA" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
Koalas: Interoperability Between Koalas and Apache Spark
|
||||
========================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/eI0Wh2Epo0Q" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
Spark + AI Summit 2020 (Jun 24, 2020)
|
||||
-------------------------------------
|
||||
|
||||
Introducing Apache Spark 3.0: A retrospective of the Last 10 Years, and a Look Forward to the Next 10 Years to Come.
|
||||
====================================================================================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/OLJKIogf2nU?start=555" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
Koalas: Making an Easy Transition from Pandas to Apache Spark
|
||||
=============================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/G_-9VbyHcx8" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
Koalas: Pandas on Apache Spark
|
||||
==============================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/iUpBSHoqzLM" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
Webinar @ Databricks (Mar 27, 2020)
|
||||
-----------------------------------
|
||||
|
||||
Reducing Time-To-Insight for Virgin Hyperloop's Data
|
||||
====================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://player.vimeo.com/video/397032070" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen"="" allowfullscreen=""></iframe>
|
||||
|
||||
|
||||
PyData New York 2019 (Nov 4, 2019)
|
||||
----------------------------------
|
||||
|
||||
Pandas vs Koalas: The Ultimate Showdown
|
||||
=======================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/xcGEQUURAuk?start=1470" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
Spark + AI Summit Europe 2019 (Oct 16, 2019)
|
||||
--------------------------------------------
|
||||
|
||||
New Developments in the Open Source Ecosystem: Apache Spark 3.0, Delta Lake, and Koalas
|
||||
=======================================================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/scM_WQMhB3A?start=1470" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
Koalas: Making an Easy Transition from Pandas to Apache Spark
|
||||
=============================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/Wfj2Vuse7as" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
Koalas: Pandas on Apache Spark
|
||||
==============================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/NpAMbzerAp0" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
PyBay 2019 (Aug 17, 2019)
|
||||
-------------------------
|
||||
|
||||
Koalas Easy Transition from pandas to Apache Spark
|
||||
==================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/cMDLoGkidEE?v=xcGEQUURAuk?start=1470" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
||||
|
||||
|
||||
Spark + AI Summit 2019 (Apr 24, 2019)
|
||||
-------------------------------------
|
||||
|
||||
Official Announcement of Koalas Open Source Project
|
||||
===================================================
|
||||
|
||||
.. raw:: html
|
||||
|
||||
<iframe width="560" height="315" src="https://www.youtube.com/embed/Shzb15DZ9Qg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
|
Loading…
Reference in a new issue