b54103016a
### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
69 lines
2.5 KiB
ReStructuredText
69 lines
2.5 KiB
ReStructuredText
.. Licensed to the Apache Software Foundation (ASF) under one
|
||
or more contributor license agreements. See the NOTICE file
|
||
distributed with this work for additional information
|
||
regarding copyright ownership. The ASF licenses this file
|
||
to you under the Apache License, Version 2.0 (the
|
||
"License"); you may not use this file except in compliance
|
||
with the License. You may obtain a copy of the License at
|
||
|
||
.. http://www.apache.org/licenses/LICENSE-2.0
|
||
|
||
.. Unless required by applicable law or agreed to in writing,
|
||
software distributed under the License is distributed on an
|
||
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
|
||
KIND, either express or implied. See the License for the
|
||
specific language governing permissions and limitations
|
||
under the License.
|
||
|
||
.. PySpark documentation master file
|
||
|
||
=====================
|
||
PySpark Documentation
|
||
=====================
|
||
|
||
|binder|_ | `GitHub <https://github.com/apache/spark>`_ | `Issues <https://issues.apache.org/jira/projects/SPARK/issues>`_ | |examples|_ | `Community <https://spark.apache.org/community.html>`_
|
||
|
||
PySpark is an interface for Apache Spark in Python. It not only allows you to write
|
||
Spark applications using Python APIs, but also provides the PySpark shell for
|
||
interactively analyzing your data in a distributed environment. PySpark supports most
|
||
of Spark's features such as Spark SQL, DataFrame, Streaming, MLlib
|
||
(Machine Learning) and Spark Core.
|
||
|
||
.. image:: ../../../docs/img/pyspark-components.png
|
||
:alt: PySpark Compoenents
|
||
|
||
**Spark SQL and DataFrame**
|
||
|
||
Spark SQL is a Spark module for structured data processing. It provides
|
||
a programming abstraction called DataFrame and can also act as distributed
|
||
SQL query engine.
|
||
|
||
**Streaming**
|
||
|
||
Running on top of Spark, the streaming feature in Apache Spark enables powerful
|
||
interactive and analytical applications across both streaming and historical data,
|
||
while inheriting Spark’s ease of use and fault tolerance characteristics.
|
||
|
||
**MLlib**
|
||
|
||
Built on top of Spark, MLlib is a scalable machine learning library that provides
|
||
a uniform set of high-level APIs that help users create and tune practical machine
|
||
learning pipelines.
|
||
|
||
**Spark Core**
|
||
|
||
Spark Core is the underlying general execution engine for the Spark platform that all
|
||
other functionality is built on top of. It provides an RDD (Resilient Distributed Dataset)
|
||
and in-memory computing capabilities.
|
||
|
||
.. toctree::
|
||
:maxdepth: 2
|
||
:hidden:
|
||
|
||
getting_started/index
|
||
user_guide/index
|
||
reference/index
|
||
development/index
|
||
migration_guide/index
|
||
|