[SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark

### What changes were proposed in this pull request?

This PR proposes to fix:
- the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one.
- update quickstart of pandas API on Spark, and make it working

The notebooks can be easily reviewed here:

https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb

Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

- To show the working examples of quickstart to end users.
- To allow users to try out the examples without installation easily.

### Does this PR introduce _any_ user-facing change?

No to end users because the existing quickstart of pandas API on Spark is not released yet.

### How was this patch tested?

I manually tested it by uploading built Spark distribution to Binder. See 3bc15310a0

Closes #33041 from HyukjinKwon/SPARK-35588-2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
Hyukjin Kwon 2021-06-24 10:17:22 +09:00
parent 1cf18a2277
commit be9089731a
7 changed files with 14507 additions and 14481 deletions

View file

@ -21,4 +21,4 @@
# Jupyter notebook. # Jupyter notebook.
VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)") VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)")
pip install "pyspark[sql,ml,mllib]<=$VERSION" pip install plotly "pyspark[sql,ml,mllib,pandas_on_spark]<=$VERSION"

View file

@ -75,7 +75,11 @@ numpydoc_show_class_members = False
# These are defined here to allow link substitutions dynamically. # These are defined here to allow link substitutions dynamically.
rst_epilog = """ rst_epilog = """
.. |binder| replace:: Live Notebook .. |binder| replace:: Live Notebook
.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb .. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
.. |binder_df| replace:: Live Notebook: DataFrame
.. _binder_df: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
.. |binder_ps| replace:: Live Notebook: pandas API on Spark
.. _binder_ps: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
.. |examples| replace:: Examples .. |examples| replace:: Examples
.. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
.. |downloading| replace:: Downloading .. |downloading| replace:: Downloading

View file

@ -25,12 +25,16 @@ There are more guides shared with other languages such as
`Quick Start <https://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides `Quick Start <https://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides
at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#where-to-go-from-here>`_. at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#where-to-go-from-here>`_.
.. TODO(SPARK-35588): Merge PySpark quickstart and 10 minutes to pandas API on Spark. There are live notebooks where you can try PySpark out without any other step:
* |binder_df|_
* |binder_ps|_
The list below is the contents of this quickstart page:
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
install install
quickstart quickstart_df
ps_10mins quickstart_ps

View file

@ -49,7 +49,7 @@ If you want to install extra dependencies for a specific component, you can inst
# Spark SQL # Spark SQL
pip install pyspark[sql] pip install pyspark[sql]
# pandas API on Spark # pandas API on Spark
pip install pyspark[pandas_on_spark] pip install pyspark[pandas_on_spark] plotly # to plot your data, you can install plotly together.
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below: For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:

File diff suppressed because one or more lines are too long

View file

@ -4,10 +4,10 @@
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
"source": [ "source": [
"# Quickstart\n", "# Quickstart: DataFrame\n",
"\n", "\n",
"This is a short introduction and quickstart for the PySpark DataFrame API. PySpark DataFrames are lazily evaluated. They are implemented on top of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)s. When Spark [transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) data, it does not immediately compute the transformation but plans how to compute later. When [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n", "This is a short introduction and quickstart for the PySpark DataFrame API. PySpark DataFrames are lazily evaluated. They are implemented on top of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)s. When Spark [transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) data, it does not immediately compute the transformation but plans how to compute later. When [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n",
"This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n", "This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself in 'Live Notebook: DataFrame' at [the quickstart page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html).\n",
"\n", "\n",
"There is also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html), [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n", "There is also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html), [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n",
"\n", "\n",
@ -1167,7 +1167,7 @@
"name": "python", "name": "python",
"nbconvert_exporter": "python", "nbconvert_exporter": "python",
"pygments_lexer": "ipython3", "pygments_lexer": "ipython3",
"version": "3.7.8" "version": "3.7.10"
}, },
"name": "quickstart", "name": "quickstart",
"notebookId": 1927513300154480 "notebookId": 1927513300154480

File diff suppressed because one or more lines are too long