spark-instrumented-optimizer/docs/README.md

Welcome to the Spark documentation!

This readme will walk you through navigating and building the Spark documentation, which is included here with the Spark source code. You can also find documentation specific to release versions of Spark at http://spark.apache.org/documentation.html.

Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that corresponds to whichever version of Spark you currently have checked out of revision control.

## Generating the Documentation HTML

We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as the github wiki, as the definitive documentation) to enable the documentation to evolve along with the source code and be captured by revision control (currently git). This way the code automatically includes the version of the documentation that is relevant regardless of which version or release you have checked out or downloaded.

In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md.

The markdown code can be compiled to HTML using the 
[Jekyll tool](http://jekyllrb.com).
To use the `jekyll` command, you will need to have Jekyll installed. 
The easiest way to do this is via a Ruby Gem, see the 
[jekyll installation instructions](http://jekyllrb.com/docs/installation).
If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
`_site` containing index.html as well as the rest of the compiled files.

You can modify the default Jekyll build as follows:

    # Skip generating API docs (which takes a while)
    $ SKIP_SCALADOC=1 jekyll build
    # Serve content locally on port 4000
    $ jekyll serve --watch
    # Build the site with extra features used on the live page
    $ PRODUCTION=1 jekyll build

## Pygments

We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages, so you will also need to install that (it requires Python) by running `sudo easy_install Pygments`.

To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile phase, use the following sytax:

    {% highlight scala %}
    // Your scala code goes here, you can replace scala with many other
    // supported languages too.
    {% endhighlight %}

## API Docs (Scaladoc and Epydoc)

You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory.

Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.

When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc.  The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).

NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.
Updated base README to point to documentation site instead of wiki, updated docs/README.md to describe use of Jekyll, and renmaed things to make them more consistent with the lower-case-with-hyphens convention. 2012-09-05 16:24:09 -04:00			`Welcome to the Spark documentation!`

Removed reference to incubation in Spark user docs. Author: Reynold Xin <rxin@apache.org> Closes #2 from rxin/docs and squashes the following commits: 08bbd5f [Reynold Xin] Removed reference to incubation in Spark user docs. 2014-02-28 00:13:22 -05:00			`This readme will walk you through navigating and building the Spark documentation, which is included here with the Spark source code. You can also find documentation specific to release versions of Spark at http://spark.apache.org/documentation.html.`
Updated base README to point to documentation site instead of wiki, updated docs/README.md to describe use of Jekyll, and renmaed things to make them more consistent with the lower-case-with-hyphens convention. 2012-09-05 16:24:09 -04:00
Adds a jekyll plugin (written in Ruby) to the _plugins directory which generates scala doc by calling `sbt/sbt doc`, copies it over to docs, and updates the links from the api webpage to now point to the copied over scaladoc (making the _site directory easy to just copy over to a public website). 2012-09-13 19:52:53 -04:00			`Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that corresponds to whichever version of Spark you currently have checked out of revision control.`
Updated base README to point to documentation site instead of wiki, updated docs/README.md to describe use of Jekyll, and renmaed things to make them more consistent with the lower-case-with-hyphens convention. 2012-09-05 16:24:09 -04:00
Adds syntax highlighting (via pygments), and some style tweaks to make things easier to read. 2012-09-12 22:27:44 -04:00			`## Generating the Documentation HTML`

Adds a jekyll plugin (written in Ruby) to the _plugins directory which generates scala doc by calling `sbt/sbt doc`, copies it over to docs, and updates the links from the api webpage to now point to the copied over scaladoc (making the _site directory easy to just copy over to a public website). 2012-09-13 19:52:53 -04:00			`We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as the github wiki, as the definitive documentation) to enable the documentation to evolve along with the source code and be captured by revision control (currently git). This way the code automatically includes the version of the documentation that is relevant regardless of which version or release you have checked out or downloaded.`
Updated base README to point to documentation site instead of wiki, updated docs/README.md to describe use of Jekyll, and renmaed things to make them more consistent with the lower-case-with-hyphens convention. 2012-09-05 16:24:09 -04:00
			`In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md.`

Add Jekyll tag to isolate "production-only" doc components. Author: Patrick Wendell <pwendell@gmail.com> Closes #56 from pwendell/jekyll-prod and squashes the following commits: 1bdc3a8 [Patrick Wendell] Add Jekyll tag to isolate "production-only" doc components. 2014-03-02 21:19:01 -05:00			`The markdown code can be compiled to HTML using the`
			`[Jekyll tool](http://jekyllrb.com).`
			To use the `jekyll` command, you will need to have Jekyll installed.
			`The easiest way to do this is via a Ruby Gem, see the`
SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs 2014-05-06 23:07:22 -04:00			`[jekyll installation instructions](http://jekyllrb.com/docs/installation).`
			If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
			Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
			`_site` containing index.html as well as the rest of the compiled files.
Add Jekyll tag to isolate "production-only" doc components. Author: Patrick Wendell <pwendell@gmail.com> Closes #56 from pwendell/jekyll-prod and squashes the following commits: 1bdc3a8 [Patrick Wendell] Add Jekyll tag to isolate "production-only" doc components. 2014-03-02 21:19:01 -05:00
			`You can modify the default Jekyll build as follows:`

			`# Skip generating API docs (which takes a while)`
			`$ SKIP_SCALADOC=1 jekyll build`
			`# Serve content locally on port 4000`
			`$ jekyll serve --watch`
			`# Build the site with extra features used on the live page`
			`$ PRODUCTION=1 jekyll build`
Updated base README to point to documentation site instead of wiki, updated docs/README.md to describe use of Jekyll, and renmaed things to make them more consistent with the lower-case-with-hyphens convention. 2012-09-05 16:24:09 -04:00
Adds syntax highlighting (via pygments), and some style tweaks to make things easier to read. 2012-09-12 22:27:44 -04:00			`## Pygments`

Adds a jekyll plugin (written in Ruby) to the _plugins directory which generates scala doc by calling `sbt/sbt doc`, copies it over to docs, and updates the links from the api webpage to now point to the copied over scaladoc (making the _site directory easy to just copy over to a public website). 2012-09-13 19:52:53 -04:00			We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages, so you will also need to install that (it requires Python) by running `sudo easy_install Pygments`.
Adds syntax highlighting (via pygments), and some style tweaks to make things easier to read. 2012-09-12 22:27:44 -04:00
			`To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile phase, use the following sytax:`

			`{% highlight scala %}`
			`// Your scala code goes here, you can replace scala with many other`
			`// supported languages too.`
			`{% endhighlight %}`

Add epydoc API documentation for PySpark. 2012-12-27 20:55:33 -05:00			`## API Docs (Scaladoc and Epydoc)`
Adds syntax highlighting (via pygments), and some style tweaks to make things easier to read. 2012-09-12 22:27:44 -04:00
Code review feedback 2014-01-06 01:05:30 -05:00			You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory.
Adds a jekyll plugin (written in Ruby) to the _plugins directory which generates scala doc by calling `sbt/sbt doc`, copies it over to docs, and updates the links from the api webpage to now point to the copied over scaladoc (making the _site directory easy to just copy over to a public website). 2012-09-13 19:52:53 -04:00
SPARK-1374: PySpark API for SparkSQL An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries. ``` from pyspark.context import SQLContext sqlCtx = SQLContext(sc) rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}]) srdd = sqlCtx.applySchema(rdd) sqlCtx.registerRDDAsTable(srdd, "table1") srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1") srdd2.collect() ``` The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]``` Author: Ahir Reddy <ahirreddy@gmail.com> Author: Michael Armbrust <michael@databricks.com> Closes #363 from ahirreddy/pysql and squashes the following commits: 0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns 307d6e0 [Ahir Reddy] Style fix 6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies 3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py 29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD f2312c7 [Ahir Reddy] Moved everything into sql.py a19afe4 [Ahir Reddy] Doc fixes 6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL 521ff6d [Ahir Reddy] Trying to get spark to build with hive ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins ded03e7 [Ahir Reddy] Added doc test for HiveContext 22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency e4da06c [Ahir Reddy] Display message if hive is not built into spark 227a0be [Michael Armbrust] Update API links. Fix Hive example. 58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api. Minor fixes. 4285340 [Michael Armbrust] Fix building of Hive API Docs. 38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs. 337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build 40491c9 [Ahir Reddy] PR Changes + Method Visibility 1836944 [Michael Armbrust] Fix comments. e00980f [Michael Armbrust] First draft of python sql programming guide. b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test f98a422 [Ahir Reddy] HiveContexts 79621cf [Ahir Reddy] cleaning up cruft b406ba0 [Ahir Reddy] doctest formatting 20936a5 [Ahir Reddy] Added tests and documentation e4d21b4 [Ahir Reddy] Added pyrolite dependency 79f739d [Ahir Reddy] added more tests 7515ba0 [Ahir Reddy] added more tests :) d26ec5e [Ahir Reddy] added test e9f5b8d [Ahir Reddy] adding tests 906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python 251f99d [Ahir Reddy] for now only allow dictionaries as input 09b9980 [Ahir Reddy] made jrdd explicitly lazy c608947 [Ahir Reddy] SchemaRDD now has all RDD operations 725c91e [Ahir Reddy] awesome row objects 55d1c76 [Ahir Reddy] return row objects 4fe1319 [Ahir Reddy] output dictionaries correctly be079de [Ahir Reddy] returning dictionaries works cd5f79f [Ahir Reddy] Switched to using Scala SQLContext e948bd9 [Ahir Reddy] yippie 4886052 [Ahir Reddy] even better c0fb1c6 [Ahir Reddy] more working 043ca85 [Ahir Reddy] working 5496f9f [Ahir Reddy] doesn't crash b8b904b [Ahir Reddy] Added schema rdd class 67ba875 [Ahir Reddy] java to python, and python to java bcc0f23 [Ahir Reddy] Java to python ab6025d [Ahir Reddy] compiling 2014-04-15 03:07:55 -04:00			Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.
Updates README.md with instructions for running jekyll without building scaladoc (i.e. run `SKIP_SCALADOC=1 jekyll`). 2012-10-08 15:15:44 -04:00
SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs 2014-05-06 23:07:22 -04:00			When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
Add epydoc API documentation for PySpark. 2012-12-27 20:55:33 -05:00
Use a single setting for disabling API doc build 2013-02-25 16:15:12 -05:00			NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.