[SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark

### What changes were proposed in this pull request? This PR proposes to fix: - the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one. - update quickstart of pandas API on Spark, and make it working The notebooks can be easily reviewed here: https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? - To show the working examples of quickstart to end users. - To allow users to try out the examples without installation easily. ### Does this PR introduce _any_ user-facing change? No to end users because the existing quickstart of pandas API on Spark is not released yet. ### How was this patch tested? I manually tested it by uploading built Spark distribution to Binder. See 3bc15310a0 Closes #33041 from HyukjinKwon/SPARK-35588-2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 10:17:22 +09:00 · 2021-06-24 10:17:22 +09:00 · be9089731a
parent 1cf18a2277
commit be9089731a
7 changed files with 14507 additions and 14481 deletions
--- a/binder/postBuild
+++ b/binder/postBuild
@ -21,4 +21,4 @@
 # Jupyter notebook.

 VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)")
-pip install "pyspark[sql,ml,mllib]<=$VERSION"
+pip install plotly "pyspark[sql,ml,mllib,pandas_on_spark]<=$VERSION"
--- a/python/docs/source/conf.py
+++ b/python/docs/source/conf.py
@ -75,7 +75,11 @@ numpydoc_show_class_members = False
 # These are defined here to allow link substitutions dynamically.
 rst_epilog = """
 .. |binder| replace:: Live Notebook
-.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb
+.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
+.. |binder_df| replace:: Live Notebook: DataFrame
+.. _binder_df: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
+.. |binder_ps| replace:: Live Notebook: pandas API on Spark
+.. _binder_ps: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
 .. |examples| replace:: Examples
 .. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
 .. |downloading| replace:: Downloading
--- a/python/docs/source/getting_started/index.rst
+++ b/python/docs/source/getting_started/index.rst
@ -25,12 +25,16 @@ There are more guides shared with other languages such as
 `Quick Start <https://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides
 at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#where-to-go-from-here>`_.

-.. TODO(SPARK-35588): Merge PySpark quickstart and 10 minutes to pandas API on Spark.
+There are live notebooks where you can try PySpark out without any other step:
+
+* |binder_df|_
+* |binder_ps|_
+
+The list below is the contents of this quickstart page:

 .. toctree::
   :maxdepth: 2

   install
-   quickstart
-   ps_10mins
-
+   quickstart_df
+   quickstart_ps
--- a/python/docs/source/getting_started/install.rst
+++ b/python/docs/source/getting_started/install.rst
@ -49,7 +49,7 @@ If you want to install extra dependencies for a specific component, you can inst
    # Spark SQL
    pip install pyspark[sql]
    # pandas API on Spark
-    pip install pyspark[pandas_on_spark]
+    pip install pyspark[pandas_on_spark] plotly  # to plot your data, you can install plotly together.

 For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:

--- a/python/docs/source/getting_started/ps_10mins.ipynb
+++ b/python/docs/source/getting_started/ps_10mins.ipynb
--- a/python/docs/source/getting_started/quickstart_df.ipynb
+++ b/python/docs/source/getting_started/quickstart_df.ipynb
@ -4,10 +4,10 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Quickstart\n",
+    "# Quickstart: DataFrame\n",
    "\n",
    "This is a short introduction and quickstart for the PySpark DataFrame API. PySpark DataFrames are lazily evaluated. They are implemented on top of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)s. When Spark [transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) data, it does not immediately compute the transformation but plans how to compute later. When [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n",
-    "This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n",
+    "This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself in 'Live Notebook: DataFrame' at [the quickstart page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html).\n",
    "\n",
    "There is also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html), [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n",
    "\n",
@ -1167,7 +1167,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.8"
+   "version": "3.7.10"
  },
  "name": "quickstart",
  "notebookId": 1927513300154480
--- a/python/docs/source/getting_started/quickstart_ps.ipynb
+++ b/python/docs/source/getting_started/quickstart_ps.ipynb