[SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark
### What changes were proposed in this pull request?
This PR proposes to fix:
- the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one.
- update quickstart of pandas API on Spark, and make it working
The notebooks can be easily reviewed here:
https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html
### Why are the changes needed?
- To show the working examples of quickstart to end users.
- To allow users to try out the examples without installation easily.
### Does this PR introduce _any_ user-facing change?
No to end users because the existing quickstart of pandas API on Spark is not released yet.
### How was this patch tested?
I manually tested it by uploading built Spark distribution to Binder. See 3bc15310a0
Closes #33041 from HyukjinKwon/SPARK-35588-2.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
parent
1cf18a2277
commit
be9089731a
|
@ -21,4 +21,4 @@
|
|||
# Jupyter notebook.
|
||||
|
||||
VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)")
|
||||
pip install "pyspark[sql,ml,mllib]<=$VERSION"
|
||||
pip install plotly "pyspark[sql,ml,mllib,pandas_on_spark]<=$VERSION"
|
||||
|
|
|
@ -75,7 +75,11 @@ numpydoc_show_class_members = False
|
|||
# These are defined here to allow link substitutions dynamically.
|
||||
rst_epilog = """
|
||||
.. |binder| replace:: Live Notebook
|
||||
.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb
|
||||
.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
|
||||
.. |binder_df| replace:: Live Notebook: DataFrame
|
||||
.. _binder_df: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_df.ipynb
|
||||
.. |binder_ps| replace:: Live Notebook: pandas API on Spark
|
||||
.. _binder_ps: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
|
||||
.. |examples| replace:: Examples
|
||||
.. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
|
||||
.. |downloading| replace:: Downloading
|
||||
|
|
|
@ -25,12 +25,16 @@ There are more guides shared with other languages such as
|
|||
`Quick Start <https://spark.apache.org/docs/latest/quick-start.html>`_ in Programming Guides
|
||||
at `the Spark documentation <https://spark.apache.org/docs/latest/index.html#where-to-go-from-here>`_.
|
||||
|
||||
.. TODO(SPARK-35588): Merge PySpark quickstart and 10 minutes to pandas API on Spark.
|
||||
There are live notebooks where you can try PySpark out without any other step:
|
||||
|
||||
* |binder_df|_
|
||||
* |binder_ps|_
|
||||
|
||||
The list below is the contents of this quickstart page:
|
||||
|
||||
.. toctree::
|
||||
:maxdepth: 2
|
||||
|
||||
install
|
||||
quickstart
|
||||
ps_10mins
|
||||
|
||||
quickstart_df
|
||||
quickstart_ps
|
||||
|
|
|
@ -49,7 +49,7 @@ If you want to install extra dependencies for a specific component, you can inst
|
|||
# Spark SQL
|
||||
pip install pyspark[sql]
|
||||
# pandas API on Spark
|
||||
pip install pyspark[pandas_on_spark]
|
||||
pip install pyspark[pandas_on_spark] plotly # to plot your data, you can install plotly together.
|
||||
|
||||
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:
|
||||
|
||||
|
|
File diff suppressed because one or more lines are too long
|
@ -4,10 +4,10 @@
|
|||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Quickstart\n",
|
||||
"# Quickstart: DataFrame\n",
|
||||
"\n",
|
||||
"This is a short introduction and quickstart for the PySpark DataFrame API. PySpark DataFrames are lazily evaluated. They are implemented on top of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)s. When Spark [transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) data, it does not immediately compute the transformation but plans how to compute later. When [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the computation starts.\n",
|
||||
"This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb).\n",
|
||||
"This notebook shows the basic usages of the DataFrame, geared mainly for new users. You can run the latest version of these examples by yourself in 'Live Notebook: DataFrame' at [the quickstart page](https://spark.apache.org/docs/latest/api/python/getting_started/index.html).\n",
|
||||
"\n",
|
||||
"There is also other useful information in Apache Spark documentation site, see the latest version of [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html), [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html), [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html), [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html) and [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html).\n",
|
||||
"\n",
|
||||
|
@ -1167,7 +1167,7 @@
|
|||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.7.8"
|
||||
"version": "3.7.10"
|
||||
},
|
||||
"name": "quickstart",
|
||||
"notebookId": 1927513300154480
|
14489
python/docs/source/getting_started/quickstart_ps.ipynb
Normal file
14489
python/docs/source/getting_started/quickstart_ps.ipynb
Normal file
File diff suppressed because one or more lines are too long
Loading…
Reference in a new issue