b54103016a
### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
137 lines
6.1 KiB
Markdown
137 lines
6.1 KiB
Markdown
---
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
Welcome to the Spark documentation!
|
|
|
|
This readme will walk you through navigating and building the Spark documentation, which is included
|
|
here with the Spark source code. You can also find documentation specific to release versions of
|
|
Spark at https://spark.apache.org/documentation.html.
|
|
|
|
Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the
|
|
documentation yourself. Why build it yourself? So that you have the docs that correspond to
|
|
whichever version of Spark you currently have checked out of revision control.
|
|
|
|
## Prerequisites
|
|
|
|
The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, Java,
|
|
Python, R and SQL.
|
|
|
|
You need to have [Ruby](https://www.ruby-lang.org/en/documentation/installation/) and
|
|
[Python](https://docs.python.org/2/using/unix.html#getting-and-installing-the-latest-version-of-python)
|
|
installed. Also install the following libraries:
|
|
|
|
```sh
|
|
$ sudo gem install jekyll jekyll-redirect-from rouge
|
|
```
|
|
|
|
Note: If you are on a system with both Ruby 1.9 and Ruby 2.0 you may need to replace gem with gem2.0.
|
|
|
|
### R Documentation
|
|
|
|
If you'd like to generate R documentation, you'll need to [install Pandoc](https://pandoc.org/installing.html)
|
|
and install these libraries:
|
|
|
|
```sh
|
|
$ sudo Rscript -e 'install.packages(c("knitr", "devtools", "testthat", "rmarkdown"), repos="https://cloud.r-project.org/")'
|
|
$ sudo Rscript -e 'devtools::install_version("roxygen2", version = "5.0.1", repos="https://cloud.r-project.org/")'
|
|
```
|
|
|
|
Note: Other versions of roxygen2 might work in SparkR documentation generation but `RoxygenNote` field in `$SPARK_HOME/R/pkg/DESCRIPTION` is 5.0.1, which is updated if the version is mismatched.
|
|
|
|
### API Documentation
|
|
|
|
To generate API docs for any language, you'll need to install these libraries:
|
|
|
|
<!--
|
|
TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
|
|
See also https://github.com/sphinx-doc/sphinx/issues/7551.
|
|
-->
|
|
|
|
```sh
|
|
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx
|
|
```
|
|
|
|
## Generating the Documentation HTML
|
|
|
|
We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as
|
|
the github wiki, as the definitive documentation) to enable the documentation to evolve along with
|
|
the source code and be captured by revision control (currently git). This way the code automatically
|
|
includes the version of the documentation that is relevant regardless of which version or release
|
|
you have checked out or downloaded.
|
|
|
|
In this directory you will find text files formatted using Markdown, with an ".md" suffix. You can
|
|
read those text files directly if you want. Start with `index.md`.
|
|
|
|
Execute `jekyll build` from the `docs/` directory to compile the site. Compiling the site with
|
|
Jekyll will create a directory called `_site` containing `index.html` as well as the rest of the
|
|
compiled files.
|
|
|
|
```sh
|
|
$ cd docs
|
|
$ jekyll build
|
|
```
|
|
|
|
You can modify the default Jekyll build as follows:
|
|
|
|
```sh
|
|
# Skip generating API docs (which takes a while)
|
|
$ SKIP_API=1 jekyll build
|
|
|
|
# Serve content locally on port 4000
|
|
$ jekyll serve --watch
|
|
|
|
# Build the site with extra features used on the live page
|
|
$ PRODUCTION=1 jekyll build
|
|
```
|
|
|
|
## API Docs (Scaladoc, Javadoc, Sphinx, roxygen2, MkDocs)
|
|
|
|
You can build just the Spark scaladoc and javadoc by running `./build/sbt unidoc` from the `$SPARK_HOME` directory.
|
|
|
|
Similarly, you can build just the PySpark docs by running `make html` from the
|
|
`$SPARK_HOME/python/docs` directory. Documentation is only generated for classes that are listed as
|
|
public in `__init__.py`. The SparkR docs can be built by running `$SPARK_HOME/R/create-docs.sh`, and
|
|
the SQL docs can be built by running `$SPARK_HOME/sql/create-docs.sh`
|
|
after [building Spark](https://github.com/apache/spark#building-spark) first.
|
|
|
|
When you run `jekyll build` in the `docs` directory, it will also copy over the scaladoc and javadoc for the various
|
|
Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a
|
|
jekyll plugin to run `./build/sbt unidoc` before building the site so if you haven't run it (recently) it
|
|
may take some time as it generates all of the scaladoc and javadoc using [Unidoc](https://github.com/sbt/sbt-unidoc).
|
|
The jekyll plugin also generates the PySpark docs using [Sphinx](http://sphinx-doc.org/), SparkR docs
|
|
using [roxygen2](https://cran.r-project.org/web/packages/roxygen2/index.html) and SQL docs
|
|
using [MkDocs](https://www.mkdocs.org/).
|
|
|
|
NOTE: To skip the step of building and copying over the Scala, Java, Python, R and SQL API docs, run `SKIP_API=1
|
|
jekyll build`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and `SKIP_SQLDOC=1` can be used
|
|
to skip a single step of the corresponding language. `SKIP_SCALADOC` indicates skipping both the Scala and Java docs.
|
|
|
|
### Automatically Rebuilding API Docs
|
|
|
|
`jekyll serve --watch` will only watch what's in `docs/`, and it won't follow symlinks. That means it won't monitor your API docs under `python/docs` or elsewhere.
|
|
|
|
To work around this limitation for Python, install [`entr`](http://eradman.com/entrproject/) and run the following in a separate shell:
|
|
|
|
```sh
|
|
cd "$SPARK_HOME/python/docs"
|
|
find .. -type f -name '*.py' \
|
|
| entr -s 'make html && cp -r _build/html/. ../../docs/api/python'
|
|
```
|
|
|
|
Whenever there is a change to your Python code, `entr` will automatically rebuild the Python API docs and copy them to `docs/`, thus triggering a Jekyll update.
|