Merge pull request #203 from JoshRosen/docs/java-programming-guide
Java Programming Guide
This commit is contained in:
commit
bf891a5c18
|
@ -19,7 +19,7 @@ To write a Bagel application, you will need to add Spark, its dependencies, and
|
|||
|
||||
## Programming Model
|
||||
|
||||
Bagel operates on a graph represented as a [distributed dataset]({{HOME_PATH}}programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
|
||||
Bagel operates on a graph represented as a [distributed dataset]({{HOME_PATH}}scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
|
||||
|
||||
For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.
|
||||
|
||||
|
|
|
@ -54,7 +54,7 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
|
|||
|
||||
# Where to Go from Here
|
||||
|
||||
* [Spark Programming Guide]({{HOME_PATH}}programming-guide.html): how to get started using Spark, and details on the API
|
||||
* [Spark Programming Guide]({{HOME_PATH}}scala-programming-guide.html): how to get started using Spark, and details on the API
|
||||
* [Running Spark on Amazon EC2]({{HOME_PATH}}ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
|
||||
* [Running Spark on Mesos]({{HOME_PATH}}running-on-mesos.html): instructions on how to deploy to a private cluster
|
||||
* [Running Spark on YARN]({{HOME_PATH}}running-on-yarn.html): instructions on how to run Spark on top of a YARN cluster
|
||||
|
|
|
@ -2,4 +2,172 @@
|
|||
layout: global
|
||||
title: Java Programming Guide
|
||||
---
|
||||
TODO: Write Java programming guide!
|
||||
|
||||
The Spark Java API
|
||||
([spark.api.java]({{HOME_PATH}}api/core/index.html#spark.api.java.package)) defines
|
||||
[`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) and
|
||||
[`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) clases,
|
||||
which support
|
||||
the same methods as their Scala counterparts but take Java functions and return
|
||||
Java data and collection types.
|
||||
|
||||
Because Java API is similar to the Scala API, this programming guide only
|
||||
covers Java-specific features;
|
||||
the [Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html)
|
||||
provides a more general introduction to Spark concepts and should be read
|
||||
first.
|
||||
|
||||
|
||||
# Key differences in the Java API
|
||||
There are a few key differences between the Java and Scala APIs:
|
||||
|
||||
* Java does not support anonymous or first-class functions, so functions must
|
||||
be implemented by extending the
|
||||
[`spark.api.java.function.Function`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function),
|
||||
[`Function2`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function2), etc.
|
||||
classes.
|
||||
* To maintain type safety, the Java API defines specialized Function and RDD
|
||||
classes for key-value pairs and doubles.
|
||||
* RDD methods like `collect` and `countByKey` return Java collections types,
|
||||
such as `java.util.List` and `java.util.Map`.
|
||||
|
||||
|
||||
## RDD Classes
|
||||
Spark defines additional operations on RDDs of doubles and key-value pairs, such
|
||||
as `stdev` and `join`.
|
||||
|
||||
In the Scala API, these methods are automatically added using Scala's
|
||||
[implicit conversions](http://www.scala-lang.org/node/130) mechanism.
|
||||
|
||||
In the Java API, the extra methods are defined in
|
||||
[`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD) and
|
||||
[`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD)
|
||||
classes. RDD methods like `map` are overloaded by specialized `PairFunction`
|
||||
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
|
||||
types. Common methods like `filter` and `sample` are implemented by
|
||||
each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
|
||||
etc (this acheives the "same-result-type" principle used by the [Scala collections
|
||||
framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).
|
||||
|
||||
## Function Classes
|
||||
|
||||
The following table lists the function classes used by the Java API. Each
|
||||
class has a single abstract method, `call()`, that must be implemented.
|
||||
|
||||
<table class="table">
|
||||
<tr><th>Class</th><th>Function Type</th></tr>
|
||||
|
||||
<tr><td>Function<T, R></td><td>T -> R </td></tr>
|
||||
<tr><td>DoubleFunction<T></td><td>T -> Double </td></tr>
|
||||
<tr><td>PairFunction<T, K, V></td><td>T -> Tuple2<K, V> </td></tr>
|
||||
|
||||
<tr><td>FlatMapFunction<T, R></td><td>T -> Iterable<R> </td></tr>
|
||||
<tr><td>DoubleFlatMapFunction<T></td><td>T -> Iterable<Double> </td></tr>
|
||||
<tr><td>PairFlatMapFunction<T, K, V></td><td>T -> Iterable<Tuple2<K, V>> </td></tr>
|
||||
|
||||
<tr><td>Function2<T1, T2, R></td><td>T1, T2 -> R (function of two arguments)</td></tr>
|
||||
</table>
|
||||
|
||||
# Other Features
|
||||
The Java API supports other Spark features, including
|
||||
[accumulators]({{HOME_PATH}}scala-programming-guide.html#accumulators),
|
||||
[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast_variables), and
|
||||
[caching]({{HOME_PATH}}scala-programming-guide.html#caching).
|
||||
|
||||
# Example
|
||||
|
||||
As an example, we will implement word count using the Java API.
|
||||
|
||||
{% highlight java %}
|
||||
import spark.api.java.*;
|
||||
import spark.api.java.function.*;
|
||||
|
||||
JavaSparkContext sc = new JavaSparkContext(...);
|
||||
JavaRDD<String> lines = ctx.textFile("hdfs://...");
|
||||
JavaRDD<String> words = lines.flatMap(
|
||||
new FlatMapFunction<String, String>() {
|
||||
public Iterable<String> call(String s) {
|
||||
return Arrays.asList(s.split(" "));
|
||||
}
|
||||
}
|
||||
);
|
||||
{% endhighlight %}
|
||||
|
||||
The word count program starts by creating a `JavaSparkContext`, which accepts
|
||||
the same parameters as its Scala counterpart. `JavaSparkContext` supports the
|
||||
same data loading methods as the regular `SparkContext`; here, `textFile`
|
||||
loads lines from text files stored in HDFS.
|
||||
|
||||
To split the lines into words, we use `flatMap` to split each line on
|
||||
whitespace. `flatMap` is passed a `FlatMapFunction` that accepts a string and
|
||||
returns an `java.lang.Iterable` of strings.
|
||||
|
||||
Here, the `FlatMapFunction` was created inline; another option is to subclass
|
||||
`FlatMapFunction` and pass an instance to `flatMap`:
|
||||
|
||||
{% highlight java %}
|
||||
class Split extends FlatMapFunction<String, String> {
|
||||
public Iterable<String> call(String s) {
|
||||
return Arrays.asList(s.split(" "));
|
||||
}
|
||||
);
|
||||
JavaRDD<String> words = lines.flatMap(new Split());
|
||||
{% endhighlight %}
|
||||
|
||||
Continuing with the word count example, we map each word to a `(word, 1)` pair:
|
||||
|
||||
{% highlight java %}
|
||||
import scala.Tuple2;
|
||||
JavaPairRDD<String, Integer> ones = words.map(
|
||||
new PairFunction<String, String, Integer>() {
|
||||
public Tuple2<String, Integer> call(String s) {
|
||||
return new Tuple2(s, 1);
|
||||
}
|
||||
}
|
||||
);
|
||||
{% endhighlight %}
|
||||
|
||||
Note that `map` was passed a `PairFunction<String, String, Integer>` and
|
||||
returned a `JavaPairRDD<String, Integer>`.
|
||||
|
||||
|
||||
|
||||
To finish the word count program, we will use `reduceByKey` to count the
|
||||
occurrences of each word:
|
||||
|
||||
{% highlight java %}
|
||||
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
|
||||
new Function2<Integer, Integer, Integer>() {
|
||||
public Integer call(Integer i1, Integer i2) {
|
||||
return i1 + i2;
|
||||
}
|
||||
}
|
||||
);
|
||||
{% endhighlight %}
|
||||
|
||||
Here, `reduceByKey` is passed a `Function2`, which implements a function with
|
||||
two arguments. The resulting `JavaPairRDD` contains `(word, count)` pairs.
|
||||
|
||||
In this example, we explicitly showed each intermediate RDD. It is also
|
||||
possible to chain the RDD transformations, so the word count example could also
|
||||
be written as:
|
||||
|
||||
{% highlight java %}
|
||||
JavaPairRDD<String, Integer> counts = lines.flatMap(
|
||||
...
|
||||
).map(
|
||||
...
|
||||
).reduceByKey(
|
||||
...
|
||||
);
|
||||
{% endhighlight %}
|
||||
There is no performance difference between these approaches; the choice is
|
||||
a matter of style.
|
||||
|
||||
|
||||
# Where to go from here
|
||||
Spark includes several sample jobs using the Java API in
|
||||
`examples/src/main/java`. You can run them by passing the class name to the
|
||||
`run` script included in Spark -- for example, `./run
|
||||
spark.examples.JavaWordCount`. Each example program prints usage help when run
|
||||
without any arguments.
|
||||
|
|
Loading…
Reference in a new issue