[SPARK-30803][DOCS] Fix the home page link for Scala API document

### What changes were proposed in this pull request?
Change the link to the Scala API document.

```
$ git grep "#org.apache.spark.package"
docs/_layouts/global.html:                                <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
docs/index.md:* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package)
docs/rdd-programming-guide.md:[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/).
```

### Why are the changes needed?
The home page link for Scala API document is incorrect after upgrade to 3.0

### Does this PR introduce any user-facing change?
Document UI change only.

### How was this patch tested?
Local test, attach screenshots below:
Before:
![image](https://user-images.githubusercontent.com/4833765/74335713-c2385300-4dd7-11ea-95d8-f5a3639d2578.png)
After:
![image](https://user-images.githubusercontent.com/4833765/74335727-cbc1bb00-4dd7-11ea-89d9-4dcc1310e679.png)

Closes #27549 from xuanyuanking/scala-doc.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
Yuanjian Li 2020-02-16 09:55:03 -06:00 committed by Sean Owen
parent 0a03e7e679
commit 01cc852982
59 changed files with 355 additions and 355 deletions

View file

@ -82,7 +82,7 @@
<li class="dropdown"> <li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a> <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu"> <ul class="dropdown-menu">
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li> <li><a href="api/scala/org/apache/spark/index.html">Scala</a></li>
<li><a href="api/java/index.html">Java</a></li> <li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li> <li><a href="api/python/index.html">Python</a></li>
<li><a href="api/R/index.html">R</a></li> <li><a href="api/R/index.html">R</a></li>

View file

@ -24,7 +24,7 @@ license: |
Spark provides three locations to configure the system: Spark provides three locations to configure the system:
* [Spark properties](#spark-properties) control most application parameters and can be set by using * [Spark properties](#spark-properties) control most application parameters and can be set by using
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object, or through Java a [SparkConf](api/scala/org/apache/spark/SparkConf.html) object, or through Java
system properties. system properties.
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as * [Environment variables](#environment-variables) can be used to set per-machine settings, such as
the IP address, through the `conf/spark-env.sh` script on each node. the IP address, through the `conf/spark-env.sh` script on each node.
@ -34,7 +34,7 @@ Spark provides three locations to configure the system:
Spark properties control most application settings and are configured separately for each Spark properties control most application settings and are configured separately for each
application. These properties can be set directly on a application. These properties can be set directly on a
[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) passed to your [SparkConf](api/scala/org/apache/spark/SparkConf.html) passed to your
`SparkContext`. `SparkConf` allows you to configure some of the common properties `SparkContext`. `SparkConf` allows you to configure some of the common properties
(e.g. master URL and application name), as well as arbitrary key-value pairs through the (e.g. master URL and application name), as well as arbitrary key-value pairs through the
`set()` method. For example, we could initialize an application with two threads as follows: `set()` method. For example, we could initialize an application with two threads as follows:
@ -1326,7 +1326,7 @@ Apart from these, the following properties are also available, and may be useful
property is useful if you need to register your classes in a custom way, e.g. to specify a custom property is useful if you need to register your classes in a custom way, e.g. to specify a custom
field serializer. Otherwise <code>spark.kryo.classesToRegister</code> is simpler. It should be field serializer. Otherwise <code>spark.kryo.classesToRegister</code> is simpler. It should be
set to classes that extend set to classes that extend
<a href="api/scala/index.html#org.apache.spark.serializer.KryoRegistrator"> <a href="api/scala/org/apache/spark/serializer/KryoRegistrator.html">
<code>KryoRegistrator</code></a>. <code>KryoRegistrator</code></a>.
See the <a href="tuning.html#data-serialization">tuning guide</a> for more details. See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
</td> </td>
@ -1379,7 +1379,7 @@ Apart from these, the following properties are also available, and may be useful
but is quite slow, so we recommend <a href="tuning.html">using but is quite slow, so we recommend <a href="tuning.html">using
<code>org.apache.spark.serializer.KryoSerializer</code> and configuring Kryo serialization</a> <code>org.apache.spark.serializer.KryoSerializer</code> and configuring Kryo serialization</a>
when speed is necessary. Can be any subclass of when speed is necessary. Can be any subclass of
<a href="api/scala/index.html#org.apache.spark.serializer.Serializer"> <a href="api/scala/org/apache/spark/serializer/Serializer.html">
<code>org.apache.spark.Serializer</code></a>. <code>org.apache.spark.Serializer</code></a>.
</td> </td>
</tr> </tr>

View file

@ -25,38 +25,38 @@ license: |
<!-- All the documentation links --> <!-- All the documentation links -->
[EdgeRDD]: api/scala/index.html#org.apache.spark.graphx.EdgeRDD [EdgeRDD]: api/scala/org/apache/spark/graphx/EdgeRDD.html
[VertexRDD]: api/scala/index.html#org.apache.spark.graphx.VertexRDD [VertexRDD]: api/scala/org/apache/spark/graphx/VertexRDD.html
[Edge]: api/scala/index.html#org.apache.spark.graphx.Edge [Edge]: api/scala/org/apache/spark/graphx/Edge.html
[EdgeTriplet]: api/scala/index.html#org.apache.spark.graphx.EdgeTriplet [EdgeTriplet]: api/scala/org/apache/spark/graphx/EdgeTriplet.html
[Graph]: api/scala/index.html#org.apache.spark.graphx.Graph [Graph]: api/scala/org/apache/spark/graphx/Graph$.html
[GraphOps]: api/scala/index.html#org.apache.spark.graphx.GraphOps [GraphOps]: api/scala/org/apache/spark/graphx/GraphOps.html
[Graph.mapVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED] [Graph.mapVertices]: api/scala/org/apache/spark/graphx/Graph.html#mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]
[Graph.reverse]: api/scala/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED] [Graph.reverse]: api/scala/org/apache/spark/graphx/Graph.html#reverse:Graph[VD,ED]
[Graph.subgraph]: api/scala/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED] [Graph.subgraph]: api/scala/org/apache/spark/graphx/Graph.html#subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED]
[Graph.mask]: api/scala/index.html#org.apache.spark.graphx.Graph@mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED] [Graph.mask]: api/scala/org/apache/spark/graphx/Graph.html#mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED]
[Graph.groupEdges]: api/scala/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED] [Graph.groupEdges]: api/scala/org/apache/spark/graphx/Graph.html#groupEdges((ED,ED)⇒ED):Graph[VD,ED]
[GraphOps.joinVertices]: api/scala/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED] [GraphOps.joinVertices]: api/scala/org/apache/spark/graphx/GraphOps.html#joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED]
[Graph.outerJoinVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED] [Graph.outerJoinVertices]: api/scala/org/apache/spark/graphx/Graph.html#outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]
[Graph.aggregateMessages]: api/scala/index.html#org.apache.spark.graphx.Graph@aggregateMessages[A]((EdgeContext[VD,ED,A])⇒Unit,(A,A)⇒A,TripletFields)(ClassTag[A]):VertexRDD[A] [Graph.aggregateMessages]: api/scala/org/apache/spark/graphx/Graph.html#aggregateMessages[A]((EdgeContext[VD,ED,A])⇒Unit,(A,A)⇒A,TripletFields)(ClassTag[A]):VertexRDD[A]
[EdgeContext]: api/scala/index.html#org.apache.spark.graphx.EdgeContext [EdgeContext]: api/scala/org/apache/spark/graphx/EdgeContext.html
[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]] [GraphOps.collectNeighborIds]: api/scala/org/apache/spark/graphx/GraphOps.html#collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]] [GraphOps.collectNeighbors]: api/scala/org/apache/spark/graphx/GraphOps.html#collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
[RDD Persistence]: rdd-programming-guide.html#rdd-persistence [RDD Persistence]: rdd-programming-guide.html#rdd-persistence
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED] [Graph.cache]: api/scala/org/apache/spark/graphx/Graph.html#cache():Graph[VD,ED]
[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED] [GraphOps.pregel]: api/scala/org/apache/spark/graphx/GraphOps.html#pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$ [PartitionStrategy]: api/scala/org/apache/spark/graphx/PartitionStrategy$.html
[GraphLoader.edgeListFile]: api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int] [GraphLoader.edgeListFile]: api/scala/org/apache/spark/graphx/GraphLoader$.html#edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]
[Graph.apply]: api/scala/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED] [Graph.apply]: api/scala/org/apache/spark/graphx/Graph$.html#apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
[Graph.fromEdgeTuples]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int] [Graph.fromEdgeTuples]: api/scala/org/apache/spark/graphx/Graph$.html#fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]
[Graph.fromEdges]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED] [Graph.fromEdges]: api/scala/org/apache/spark/graphx/Graph$.html#fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy [PartitionStrategy]: api/scala/org/apache/spark/graphx/PartitionStrategy$.html
[PageRank]: api/scala/index.html#org.apache.spark.graphx.lib.PageRank$ [PageRank]: api/scala/org/apache/spark/graphx/lib/PageRank$.html
[ConnectedComponents]: api/scala/index.html#org.apache.spark.graphx.lib.ConnectedComponents$ [ConnectedComponents]: api/scala/org/apache/spark/graphx/lib/ConnectedComponents$.html
[TriangleCount]: api/scala/index.html#org.apache.spark.graphx.lib.TriangleCount$ [TriangleCount]: api/scala/org/apache/spark/graphx/lib/TriangleCount$.html
[Graph.partitionBy]: api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED] [Graph.partitionBy]: api/scala/org/apache/spark/graphx/Graph.html#partitionBy(PartitionStrategy):Graph[VD,ED]
[EdgeContext.sendToSrc]: api/scala/index.html#org.apache.spark.graphx.EdgeContext@sendToSrc(msg:A):Unit [EdgeContext.sendToSrc]: api/scala/org/apache/spark/graphx/EdgeContext.html#sendToSrc(msg:A):Unit
[EdgeContext.sendToDst]: api/scala/index.html#org.apache.spark.graphx.EdgeContext@sendToDst(msg:A):Unit [EdgeContext.sendToDst]: api/scala/org/apache/spark/graphx/EdgeContext.html#sendToDst(msg:A):Unit
[TripletFields]: api/java/org/apache/spark/graphx/TripletFields.html [TripletFields]: api/java/org/apache/spark/graphx/TripletFields.html
[TripletFields.All]: api/java/org/apache/spark/graphx/TripletFields.html#All [TripletFields.All]: api/java/org/apache/spark/graphx/TripletFields.html#All
[TripletFields.None]: api/java/org/apache/spark/graphx/TripletFields.html#None [TripletFields.None]: api/java/org/apache/spark/graphx/TripletFields.html#None
@ -74,7 +74,7 @@ license: |
# Overview # Overview
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level,
GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing a GraphX extends the Spark [RDD](api/scala/org/apache/spark/rdd/RDD.html) by introducing a
new [Graph](#property_graph) abstraction: a directed multigraph with properties new [Graph](#property_graph) abstraction: a directed multigraph with properties
attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental
operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
@ -99,7 +99,7 @@ getting started with Spark refer to the [Spark Quick Start Guide](quick-start.ht
# The Property Graph # The Property Graph
The [property graph](api/scala/index.html#org.apache.spark.graphx.Graph) is a directed multigraph The [property graph](api/scala/org/apache/spark/graphx/Graph.html) is a directed multigraph
with user defined objects attached to each vertex and edge. A directed multigraph is a directed with user defined objects attached to each vertex and edge. A directed multigraph is a directed
graph with potentially multiple parallel edges sharing the same source and destination vertex. The graph with potentially multiple parallel edges sharing the same source and destination vertex. The
ability to support parallel edges simplifies modeling scenarios where there can be multiple ability to support parallel edges simplifies modeling scenarios where there can be multiple
@ -175,7 +175,7 @@ val userGraph: Graph[(String, String), String]
There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
generators and these are discussed in more detail in the section on generators and these are discussed in more detail in the section on
[graph builders](#graph_builders). Probably the most general method is to use the [graph builders](#graph_builders). Probably the most general method is to use the
[Graph object](api/scala/index.html#org.apache.spark.graphx.Graph$). For example the following [Graph object](api/scala/org/apache/spark/graphx/Graph$.html). For example the following
code constructs a graph from a collection of RDDs: code constructs a graph from a collection of RDDs:
{% highlight scala %} {% highlight scala %}

View file

@ -118,7 +118,7 @@ options for deployment:
**API Docs:** **API Docs:**
* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package) * [Spark Scala API (Scaladoc)](api/scala/org/apache/spark/index.html)
* [Spark Java API (Javadoc)](api/java/index.html) * [Spark Java API (Javadoc)](api/java/index.html)
* [Spark Python API (Sphinx)](api/python/index.html) * [Spark Python API (Sphinx)](api/python/index.html)
* [Spark R API (Roxygen2)](api/R/index.html) * [Spark R API (Roxygen2)](api/R/index.html)

View file

@ -55,10 +55,10 @@ other first-order optimizations.
Quasi-Newton](https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/andrew07scalable.pdf) Quasi-Newton](https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/andrew07scalable.pdf)
(OWL-QN) is an extension of L-BFGS that can effectively handle L1 and elastic net regularization. (OWL-QN) is an extension of L-BFGS that can effectively handle L1 and elastic net regularization.
L-BFGS is used as a solver for [LinearRegression](api/scala/index.html#org.apache.spark.ml.regression.LinearRegression), L-BFGS is used as a solver for [LinearRegression](api/scala/org/apache/spark/ml/regression/LinearRegression.html),
[LogisticRegression](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression), [LogisticRegression](api/scala/org/apache/spark/ml/classification/LogisticRegression.html),
[AFTSurvivalRegression](api/scala/index.html#org.apache.spark.ml.regression.AFTSurvivalRegression) [AFTSurvivalRegression](api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html)
and [MultilayerPerceptronClassifier](api/scala/index.html#org.apache.spark.ml.classification.MultilayerPerceptronClassifier). and [MultilayerPerceptronClassifier](api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html).
MLlib L-BFGS solver calls the corresponding implementation in [breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGS.scala). MLlib L-BFGS solver calls the corresponding implementation in [breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGS.scala).
@ -108,4 +108,4 @@ It solves certain optimization problems iteratively through the following proced
Since it involves solving a weighted least squares (WLS) problem by `WeightedLeastSquares` in each iteration, Since it involves solving a weighted least squares (WLS) problem by `WeightedLeastSquares` in each iteration,
it also requires the number of features to be no more than 4096. it also requires the number of features to be no more than 4096.
Currently IRLS is used as the default solver of [GeneralizedLinearRegression](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression). Currently IRLS is used as the default solver of [GeneralizedLinearRegression](api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html).

View file

@ -71,7 +71,7 @@ $\alpha$ and `regParam` corresponds to $\lambda$.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression). More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/classification/LogisticRegression.html).
{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %} {% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
</div> </div>
@ -109,12 +109,12 @@ only available on the driver.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary) [`LogisticRegressionTrainingSummary`](api/scala/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
provides a summary for a provides a summary for a
[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel). [`LogisticRegressionModel`](api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html).
In the case of binary classification, certain additional metrics are In the case of binary classification, certain additional metrics are
available, e.g. ROC curve. The binary summary can be accessed via the available, e.g. ROC curve. The binary summary can be accessed via the
`binarySummary` method. See [`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary). `binarySummary` method. See [`BinaryLogisticRegressionTrainingSummary`](api/scala/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
Continuing the earlier example: Continuing the earlier example:
@ -216,7 +216,7 @@ We use two feature transformers to prepare the data; these help index categories
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier). More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html).
{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %} {% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
@ -261,7 +261,7 @@ We use two feature transformers to prepare the data; these help index categories
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %} {% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %}
</div> </div>
@ -302,7 +302,7 @@ We use two feature transformers to prepare the data; these help index categories
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.GBTClassifier) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/GBTClassifier.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %} {% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %}
</div> </div>
@ -358,7 +358,7 @@ MLPC employs backpropagation for learning the model. We use the logistic loss fu
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.MultilayerPerceptronClassifier) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %} {% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %}
</div> </div>
@ -403,7 +403,7 @@ in Spark ML supports binary classification with linear SVM. Internally, it optim
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.LinearSVC) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/LinearSVC.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/LinearSVCExample.scala %} {% include_example scala/org/apache/spark/examples/ml/LinearSVCExample.scala %}
</div> </div>
@ -447,7 +447,7 @@ The example below demonstrates how to load the
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.OneVsRest) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/OneVsRest.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %} {% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
</div> </div>
@ -501,7 +501,7 @@ setting the parameter $\lambda$ (default to $1.0$).
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.NaiveBayes) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/NaiveBayes.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/NaiveBayesExample.scala %} {% include_example scala/org/apache/spark/examples/ml/NaiveBayesExample.scala %}
</div> </div>
@ -544,7 +544,7 @@ We scale features to be between 0 and 1 to prevent the exploding gradient proble
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.FMClassifier) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/FMClassifier.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/FMClassifierExample.scala %} {% include_example scala/org/apache/spark/examples/ml/FMClassifierExample.scala %}
</div> </div>
@ -585,7 +585,7 @@ regression model and extracting model summary statistics.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.LinearRegression). More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/regression/LinearRegression.html).
{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %} {% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
</div> </div>
@ -726,7 +726,7 @@ function and extracting model summary statistics.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
</div> </div>
@ -768,7 +768,7 @@ We use a feature transformer to index categorical features, adding metadata to t
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor). More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html).
{% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %}
</div> </div>
@ -810,7 +810,7 @@ We use a feature transformer to index categorical features, adding metadata to t
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.RandomForestRegressor) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/RandomForestRegressorExample.scala %} {% include_example scala/org/apache/spark/examples/ml/RandomForestRegressorExample.scala %}
</div> </div>
@ -851,7 +851,7 @@ be true in general.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GBTRegressor) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/GBTRegressor.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeRegressorExample.scala %} {% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeRegressorExample.scala %}
</div> </div>
@ -945,7 +945,7 @@ The implementation matches the result from R's survival function
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.AFTSurvivalRegression) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
</div> </div>
@ -1025,7 +1025,7 @@ is treated as piecewise linear function. The rules for prediction therefore are:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`IsotonicRegression` Scala docs](api/scala/index.html#org.apache.spark.ml.regression.IsotonicRegression) for details on the API. Refer to the [`IsotonicRegression` Scala docs](api/scala/org/apache/spark/ml/regression/IsotonicRegression.html) for details on the API.
{% include_example scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala %}
</div> </div>
@ -1066,7 +1066,7 @@ We scale features to be between 0 and 1 to prevent the exploding gradient proble
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.FMRegressor) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/FMRegressor.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/FMRegressorExample.scala %} {% include_example scala/org/apache/spark/examples/ml/FMRegressorExample.scala %}
</div> </div>

View file

@ -85,7 +85,7 @@ called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/KMeans.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
</div> </div>
@ -123,7 +123,7 @@ and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel`
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/LDA.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %} {% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
</div> </div>
@ -166,7 +166,7 @@ Bisecting K-means can often be much faster than regular K-means, but it will gen
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %} {% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
</div> </div>
@ -255,7 +255,7 @@ model.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %} {% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
</div> </div>
@ -302,7 +302,7 @@ using truncated power iteration on a normalized pair-wise similarity matrix of t
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %} {% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
</div> </div>

View file

@ -115,7 +115,7 @@ explicit (`implicitPrefs` is `false`).
We evaluate the recommendation model by measuring the root-mean-square error of We evaluate the recommendation model by measuring the root-mean-square error of
rating prediction. rating prediction.
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.ml.recommendation.ALS) Refer to the [`ALS` Scala docs](api/scala/org/apache/spark/ml/recommendation/ALS.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %} {% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}

View file

@ -42,7 +42,7 @@ The schema of the `image` column is:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource) [`ImageDataSource`](api/scala/org/apache/spark/ml/source/image/ImageDataSource.html)
implements a Spark SQL data source API for loading image data as a DataFrame. implements a Spark SQL data source API for loading image data as a DataFrame.
{% highlight scala %} {% highlight scala %}
@ -133,7 +133,7 @@ The schemas of the columns are:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`LibSVMDataSource`](api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource) [`LibSVMDataSource`](api/scala/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame. implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
{% highlight scala %} {% highlight scala %}

View file

@ -96,8 +96,8 @@ when using text as features. Our feature vectors could then be passed to a lear
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [HashingTF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and Refer to the [HashingTF Scala docs](api/scala/org/apache/spark/ml/feature/HashingTF.html) and
the [IDF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IDF) for more details on the API. the [IDF Scala docs](api/scala/org/apache/spark/ml/feature/IDF.html) for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %} {% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %}
</div> </div>
@ -135,7 +135,7 @@ In the following code segment, we start with a set of documents, each of which i
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Word2Vec Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Word2Vec) Refer to the [Word2Vec Scala docs](api/scala/org/apache/spark/ml/feature/Word2Vec.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %} {% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %}
@ -200,8 +200,8 @@ Each vector represents the token counts of the document over the vocabulary.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [CountVectorizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) Refer to the [CountVectorizer Scala docs](api/scala/org/apache/spark/ml/feature/CountVectorizer.html)
and the [CountVectorizerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel) and the [CountVectorizerModel Scala docs](api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %}
@ -286,7 +286,7 @@ The resulting feature vectors could then be passed to a learning algorithm.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [FeatureHasher Scala docs](api/scala/index.html#org.apache.spark.ml.feature.FeatureHasher) Refer to the [FeatureHasher Scala docs](api/scala/org/apache/spark/ml/feature/FeatureHasher.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %} {% include_example scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %}
@ -313,9 +313,9 @@ for more details on the API.
## Tokenizer ## Tokenizer
[Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class provides this functionality. The example below shows how to split sentences into sequences of words. [Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/org/apache/spark/ml/feature/Tokenizer.html) class provides this functionality. The example below shows how to split sentences into sequences of words.
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more [RegexTokenizer](api/scala/org/apache/spark/ml/feature/RegexTokenizer.html) allows more
advanced tokenization based on regular expression (regex) matching. advanced tokenization based on regular expression (regex) matching.
By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as delimiters to split the input text. By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as delimiters to split the input text.
Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
@ -326,8 +326,8 @@ for more details on the API.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Tokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) Refer to the [Tokenizer Scala docs](api/scala/org/apache/spark/ml/feature/Tokenizer.html)
and the [RegexTokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) and the [RegexTokenizer Scala docs](api/scala/org/apache/spark/ml/feature/RegexTokenizer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %}
@ -395,7 +395,7 @@ filtered out.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [StopWordsRemover Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StopWordsRemover) Refer to the [StopWordsRemover Scala docs](api/scala/org/apache/spark/ml/feature/StopWordsRemover.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %} {% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %}
@ -430,7 +430,7 @@ An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (t
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [NGram Scala docs](api/scala/index.html#org.apache.spark.ml.feature.NGram) Refer to the [NGram Scala docs](api/scala/org/apache/spark/ml/feature/NGram.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/NGramExample.scala %} {% include_example scala/org/apache/spark/examples/ml/NGramExample.scala %}
@ -468,7 +468,7 @@ for `inputCol`.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Binarizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) Refer to the [Binarizer Scala docs](api/scala/org/apache/spark/ml/feature/Binarizer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/BinarizerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/BinarizerExample.scala %}
@ -493,14 +493,14 @@ for more details on the API.
## PCA ## PCA
[PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/index.html#org.apache.spark.ml.feature.PCA) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components. [PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/org/apache/spark/ml/feature/PCA.html) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
**Examples** **Examples**
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [PCA Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PCA) Refer to the [PCA Scala docs](api/scala/org/apache/spark/ml/feature/PCA.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/PCAExample.scala %} {% include_example scala/org/apache/spark/examples/ml/PCAExample.scala %}
@ -525,14 +525,14 @@ for more details on the API.
## PolynomialExpansion ## PolynomialExpansion
[Polynomial expansion](http://en.wikipedia.org/wiki/Polynomial_expansion) is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A [PolynomialExpansion](api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion) class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space. [Polynomial expansion](http://en.wikipedia.org/wiki/Polynomial_expansion) is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A [PolynomialExpansion](api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html) class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.
**Examples** **Examples**
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [PolynomialExpansion Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion) Refer to the [PolynomialExpansion Scala docs](api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala %} {% include_example scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala %}
@ -561,7 +561,7 @@ The [Discrete Cosine
Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform) Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
transforms a length $N$ real-valued sequence in the time domain into transforms a length $N$ real-valued sequence in the time domain into
another length $N$ real-valued sequence in the frequency domain. A another length $N$ real-valued sequence in the frequency domain. A
[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class [DCT](api/scala/org/apache/spark/ml/feature/DCT.html) class
provides this functionality, implementing the provides this functionality, implementing the
[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II) [DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
and scaling the result by $1/\sqrt{2}$ such that the representing matrix and scaling the result by $1/\sqrt{2}$ such that the representing matrix
@ -574,7 +574,7 @@ $0$th DCT coefficient and _not_ the $N/2$th).
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [DCT Scala docs](api/scala/index.html#org.apache.spark.ml.feature.DCT) Refer to the [DCT Scala docs](api/scala/org/apache/spark/ml/feature/DCT.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/DCTExample.scala %} {% include_example scala/org/apache/spark/examples/ml/DCTExample.scala %}
@ -704,7 +704,7 @@ Notice that the rows containing "d" or "e" are mapped to index "3.0"
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [StringIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer) Refer to the [StringIndexer Scala docs](api/scala/org/apache/spark/ml/feature/StringIndexer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/StringIndexerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/StringIndexerExample.scala %}
@ -770,7 +770,7 @@ labels (they will be inferred from the columns' metadata):
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [IndexToString Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IndexToString) Refer to the [IndexToString Scala docs](api/scala/org/apache/spark/ml/feature/IndexToString.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/IndexToStringExample.scala %} {% include_example scala/org/apache/spark/examples/ml/IndexToStringExample.scala %}
@ -809,7 +809,7 @@ for more details on the API.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [OneHotEncoder Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) for more details on the API. Refer to the [OneHotEncoder Scala docs](api/scala/org/apache/spark/ml/feature/OneHotEncoder.html) for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %} {% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}
</div> </div>
@ -835,7 +835,7 @@ Refer to the [OneHotEncoder Python docs](api/python/pyspark.ml.html#pyspark.ml.f
`VectorIndexer` helps index categorical features in datasets of `Vector`s. `VectorIndexer` helps index categorical features in datasets of `Vector`s.
It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following: It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:
1. Take an input column of type [Vector](api/scala/index.html#org.apache.spark.ml.linalg.Vector) and a parameter `maxCategories`. 1. Take an input column of type [Vector](api/scala/org/apache/spark/ml/linalg/Vector.html) and a parameter `maxCategories`.
2. Decide which features should be categorical based on the number of distinct values, where features with at most `maxCategories` are declared categorical. 2. Decide which features should be categorical based on the number of distinct values, where features with at most `maxCategories` are declared categorical.
3. Compute 0-based category indices for each categorical feature. 3. Compute 0-based category indices for each categorical feature.
4. Index categorical features and transform original feature values to indices. 4. Index categorical features and transform original feature values to indices.
@ -849,7 +849,7 @@ In the example below, we read in a dataset of labeled points and then use `Vecto
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [VectorIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer) Refer to the [VectorIndexer Scala docs](api/scala/org/apache/spark/ml/feature/VectorIndexer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/VectorIndexerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/VectorIndexerExample.scala %}
@ -910,7 +910,7 @@ then `interactedCol` as the output column contains:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Interaction Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Interaction) Refer to the [Interaction Scala docs](api/scala/org/apache/spark/ml/feature/Interaction.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/InteractionExample.scala %} {% include_example scala/org/apache/spark/examples/ml/InteractionExample.scala %}
@ -944,7 +944,7 @@ The following example demonstrates how to load a dataset in libsvm format and th
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Normalizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Normalizer) Refer to the [Normalizer Scala docs](api/scala/org/apache/spark/ml/feature/Normalizer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/NormalizerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/NormalizerExample.scala %}
@ -986,7 +986,7 @@ The following example demonstrates how to load a dataset in libsvm format and th
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [StandardScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StandardScaler) Refer to the [StandardScaler Scala docs](api/scala/org/apache/spark/ml/feature/StandardScaler.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/StandardScalerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/StandardScalerExample.scala %}
@ -1030,7 +1030,7 @@ The following example demonstrates how to load a dataset in libsvm format and th
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [RobustScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RobustScaler) Refer to the [RobustScaler Scala docs](api/scala/org/apache/spark/ml/feature/RobustScaler.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/RobustScalerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/RobustScalerExample.scala %}
@ -1078,8 +1078,8 @@ The following example demonstrates how to load a dataset in libsvm format and th
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [MinMaxScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) Refer to the [MinMaxScaler Scala docs](api/scala/org/apache/spark/ml/feature/MinMaxScaler.html)
and the [MinMaxScalerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel) and the [MinMaxScalerModel Scala docs](api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala %}
@ -1121,8 +1121,8 @@ The following example demonstrates how to load a dataset in libsvm format and th
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [MaxAbsScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MaxAbsScaler) Refer to the [MaxAbsScaler Scala docs](api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html)
and the [MaxAbsScalerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MaxAbsScalerModel) and the [MaxAbsScalerModel Scala docs](api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala %}
@ -1157,7 +1157,7 @@ Note that if you have no idea of the upper and lower bounds of the targeted colu
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`. Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
More details can be found in the API docs for [Bucketizer](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer). More details can be found in the API docs for [Bucketizer](api/scala/org/apache/spark/ml/feature/Bucketizer.html).
**Examples** **Examples**
@ -1166,7 +1166,7 @@ The following example demonstrates how to bucketize a column of `Double`s into a
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Bucketizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer) Refer to the [Bucketizer Scala docs](api/scala/org/apache/spark/ml/feature/Bucketizer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/BucketizerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/BucketizerExample.scala %}
@ -1216,7 +1216,7 @@ This example below demonstrates how to transform vectors using a transforming ve
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [ElementwiseProduct Scala docs](api/scala/index.html#org.apache.spark.ml.feature.ElementwiseProduct) Refer to the [ElementwiseProduct Scala docs](api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/ElementwiseProductExample.scala %} {% include_example scala/org/apache/spark/examples/ml/ElementwiseProductExample.scala %}
@ -1276,7 +1276,7 @@ This is the output of the `SQLTransformer` with statement `"SELECT *, (v1 + v2)
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [SQLTransformer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.SQLTransformer) Refer to the [SQLTransformer Scala docs](api/scala/org/apache/spark/ml/feature/SQLTransformer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/SQLTransformerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/SQLTransformerExample.scala %}
@ -1336,7 +1336,7 @@ output column to `features`, after transformation we should get the following Da
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [VectorAssembler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorAssembler) Refer to the [VectorAssembler Scala docs](api/scala/org/apache/spark/ml/feature/VectorAssembler.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/VectorAssemblerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/VectorAssemblerExample.scala %}
@ -1387,7 +1387,7 @@ to avoid this kind of inconsistent state.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [VectorSizeHint Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint) Refer to the [VectorSizeHint Scala docs](api/scala/org/apache/spark/ml/feature/VectorSizeHint.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala %} {% include_example scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala %}
@ -1426,7 +1426,7 @@ NaN values, they will be handled specially and placed into their own bucket, for
are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4]. are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a [approxQuantile](api/scala/org/apache/spark/sql/DataFrameStatFunctions.html) for a
detailed description). The precision of the approximation can be controlled with the detailed description). The precision of the approximation can be controlled with the
`relativeError` parameter. When set to zero, exact quantiles are calculated `relativeError` parameter. When set to zero, exact quantiles are calculated
(**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds (**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds
@ -1470,7 +1470,7 @@ a categorical one. Given `numBuckets = 3`, we should get the following DataFrame
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [QuantileDiscretizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.QuantileDiscretizer) Refer to the [QuantileDiscretizer Scala docs](api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala %}
@ -1539,7 +1539,7 @@ the relevant column.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Imputer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Imputer) Refer to the [Imputer Scala docs](api/scala/org/apache/spark/ml/feature/Imputer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %}
@ -1620,7 +1620,7 @@ Suppose also that we have potential input attributes for the `userFeatures`, i.e
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [VectorSlicer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer) Refer to the [VectorSlicer Scala docs](api/scala/org/apache/spark/ml/feature/VectorSlicer.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/VectorSlicerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/VectorSlicerExample.scala %}
@ -1706,7 +1706,7 @@ id | country | hour | clicked | features | label
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [RFormula Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RFormula) Refer to the [RFormula Scala docs](api/scala/org/apache/spark/ml/feature/RFormula.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/RFormulaExample.scala %} {% include_example scala/org/apache/spark/examples/ml/RFormulaExample.scala %}
@ -1770,7 +1770,7 @@ id | features | clicked | selectedFeatures
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [ChiSqSelector Scala docs](api/scala/index.html#org.apache.spark.ml.feature.ChiSqSelector) Refer to the [ChiSqSelector Scala docs](api/scala/org/apache/spark/ml/feature/ChiSqSelector.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala %} {% include_example scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala %}
@ -1856,7 +1856,7 @@ Bucketed Random Projection accepts arbitrary vectors as input features, and supp
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [BucketedRandomProjectionLSH Scala docs](api/scala/index.html#org.apache.spark.ml.feature.BucketedRandomProjectionLSH) Refer to the [BucketedRandomProjectionLSH Scala docs](api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %} {% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
@ -1897,7 +1897,7 @@ The input sets for MinHash are represented as binary vectors, where the vector i
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [MinHashLSH Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH) Refer to the [MinHashLSH Scala docs](api/scala/org/apache/spark/ml/feature/MinHashLSH.html)
for more details on the API. for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %} {% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}

View file

@ -75,7 +75,7 @@ The `FPGrowthModel` provides:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.fpm.FPGrowth) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/fpm/FPGrowth.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/FPGrowthExample.scala %} {% include_example scala/org/apache/spark/examples/ml/FPGrowthExample.scala %}
</div> </div>
@ -128,7 +128,7 @@ pattern mining problem.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.fpm.PrefixSpan) for more details. Refer to the [Scala API docs](api/scala/org/apache/spark/ml/fpm/PrefixSpan.html) for more details.
{% include_example scala/org/apache/spark/examples/ml/PrefixSpanExample.scala %} {% include_example scala/org/apache/spark/examples/ml/PrefixSpanExample.scala %}
</div> </div>

View file

@ -187,7 +187,7 @@ val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
{% endhighlight %} {% endhighlight %}
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail. Refer to the [`MLUtils` Scala docs](api/scala/org/apache/spark/mllib/util/MLUtils$.html) for further detail.
</div> </div>
<div data-lang="java" markdown="1"> <div data-lang="java" markdown="1">
@ -341,9 +341,9 @@ In the `spark.ml` package, there exists one breaking API change and one behavior
In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs: In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
* Gradient-Boosted Trees * Gradient-Boosted Trees
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs. * *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/org/apache/spark/mllib/tree/loss/Loss.html) method was changed. This is only an issues for users who wrote their own losses for GBTs.
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters. * *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.html) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm. * *(Breaking change)* The return value of [`LDA.run`](api/scala/org/apache/spark/mllib/clustering/LDA.html) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
In the `spark.ml` package, several major API changes occurred, including: In the `spark.ml` package, several major API changes occurred, including:
@ -359,12 +359,12 @@ changes for future releases.
In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental. In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed. * *(Breaking change)* In [`ALS`](api/scala/org/apache/spark/mllib/recommendation/ALS.html), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed.
* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`. * *(Breaking change)* [`StandardScalerModel`](api/scala/org/apache/spark/mllib/feature/StandardScalerModel.html) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD) remains an Experimental component. In it, there were two changes: * *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.html) remains an Experimental component. In it, there were two changes:
* The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods. * The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods.
* Variable `model` is no longer public. * Variable `model` is no longer public.
* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) remains an Experimental component. In it and its associated classes, there were several changes: * *(Breaking change)* [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html) remains an Experimental component. In it and its associated classes, there were several changes:
* In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.) * In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.)
* In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training. * In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use. * `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use.
@ -373,31 +373,31 @@ In the `spark.mllib` package, there were several breaking changes. The first ch
In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here: In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:
* The old [SchemaRDD](https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in `spark.ml` which used to use SchemaRDD now use DataFrame. * The old [SchemaRDD](https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/org/apache/spark/sql/DataFrame.html) with a somewhat modified API. All algorithms in `spark.ml` which used to use SchemaRDD now use DataFrame.
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`. * In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details. * Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
Other changes were in `LogisticRegression`: Other changes were in `LogisticRegression`:
* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future). * The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future. * In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html). The option to use an intercept will be added in the future.
## Upgrading from MLlib 1.1 to 1.2 ## Upgrading from MLlib 1.1 to 1.2
The only API changes in MLlib v1.2 are in The only API changes in MLlib v1.2 are in
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree), [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html),
which continues to be an experimental API in MLlib 1.2: which continues to be an experimental API in MLlib 1.2:
1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number 1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number
of classes. In MLlib v1.1, this argument was called `numClasses` in Python and of classes. In MLlib v1.1, this argument was called `numClasses` in Python and
`numClassesForClassification` in Scala. In MLlib v1.2, the names are both set to `numClasses`. `numClassesForClassification` in Scala. In MLlib v1.2, the names are both set to `numClasses`.
This `numClasses` parameter is specified either via This `numClasses` parameter is specified either via
[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy) [`Strategy`](api/scala/org/apache/spark/mllib/tree/configuration/Strategy.html)
or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) or via [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html)
static `trainClassifier` and `trainRegressor` methods. static `trainClassifier` and `trainRegressor` methods.
2. *(Breaking change)* The API for 2. *(Breaking change)* The API for
[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has changed. [`Node`](api/scala/org/apache/spark/mllib/tree/model/Node.html) has changed.
This should generally not affect user code, unless the user manually constructs decision trees This should generally not affect user code, unless the user manually constructs decision trees
(instead of using the `trainClassifier` or `trainRegressor` methods). (instead of using the `trainClassifier` or `trainRegressor` methods).
The tree `Node` now includes more information, including the probability of the predicted label The tree `Node` now includes more information, including the probability of the predicted label
@ -411,7 +411,7 @@ Examples in the Spark distribution and examples in the
## Upgrading from MLlib 1.0 to 1.1 ## Upgrading from MLlib 1.0 to 1.1
The only API changes in MLlib v1.1 are in The only API changes in MLlib v1.1 are in
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree), [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html),
which continues to be an experimental API in MLlib 1.1: which continues to be an experimental API in MLlib 1.1:
1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match 1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match
@ -421,12 +421,12 @@ and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf nodes. In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf nodes.
In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf nodes. In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf nodes.
This depth is specified by the `maxDepth` parameter in This depth is specified by the `maxDepth` parameter in
[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy) [`Strategy`](api/scala/org/apache/spark/mllib/tree/configuration/Strategy.html)
or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) or via [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html)
static `trainClassifier` and `trainRegressor` methods. static `trainClassifier` and `trainRegressor` methods.
2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor` 2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor`
methods to build a [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree), methods to build a [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html),
rather than using the old parameter class `Strategy`. These new training methods explicitly rather than using the old parameter class `Strategy`. These new training methods explicitly
separate classification and regression, and they replace specialized parameter types with separate classification and regression, and they replace specialized parameter types with
simple `String` types. simple `String` types.

View file

@ -238,7 +238,7 @@ notes, then it should be treated as a bug to be fixed.
This section gives code examples illustrating the functionality discussed above. This section gives code examples illustrating the functionality discussed above.
For more info, please refer to the API documentation For more info, please refer to the API documentation
([Scala](api/scala/index.html#org.apache.spark.ml.package), ([Scala](api/scala/org/apache/spark/ml/package.html),
[Java](api/java/org/apache/spark/ml/package-summary.html), [Java](api/java/org/apache/spark/ml/package-summary.html),
and [Python](api/python/pyspark.ml.html)). and [Python](api/python/pyspark.ml.html)).
@ -250,9 +250,9 @@ This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`Estimator` Scala docs](api/scala/index.html#org.apache.spark.ml.Estimator), Refer to the [`Estimator` Scala docs](api/scala/org/apache/spark/ml/Estimator.html),
the [`Transformer` Scala docs](api/scala/index.html#org.apache.spark.ml.Transformer) and the [`Transformer` Scala docs](api/scala/org/apache/spark/ml/Transformer.html) and
the [`Params` Scala docs](api/scala/index.html#org.apache.spark.ml.param.Params) for details on the API. the [`Params` Scala docs](api/scala/org/apache/spark/ml/param/Params.html) for details on the API.
{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %} {% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
</div> </div>
@ -285,7 +285,7 @@ This example follows the simple text document `Pipeline` illustrated in the figu
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`Pipeline` Scala docs](api/scala/index.html#org.apache.spark.ml.Pipeline) for details on the API. Refer to the [`Pipeline` Scala docs](api/scala/org/apache/spark/ml/Pipeline.html) for details on the API.
{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %} {% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
</div> </div>

View file

@ -50,7 +50,7 @@ correlation methods are currently Pearson's and Spearman's correlation.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`Correlation`](api/scala/index.html#org.apache.spark.ml.stat.Correlation$) [`Correlation`](api/scala/org/apache/spark/ml/stat/Correlation$.html)
computes the correlation matrix for the input Dataset of Vectors using the specified method. computes the correlation matrix for the input Dataset of Vectors using the specified method.
The output will be a DataFrame that contains the correlation matrix of the column of vectors. The output will be a DataFrame that contains the correlation matrix of the column of vectors.
@ -87,7 +87,7 @@ the Chi-squared statistic is computed. All label and feature values must be cate
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`ChiSquareTest` Scala docs](api/scala/index.html#org.apache.spark.ml.stat.ChiSquareTest$) for details on the API. Refer to the [`ChiSquareTest` Scala docs](api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html) for details on the API.
{% include_example scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala %} {% include_example scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala %}
</div> </div>
@ -114,7 +114,7 @@ as well as the total count.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
The following example demonstrates using [`Summarizer`](api/scala/index.html#org.apache.spark.ml.stat.Summarizer$) The following example demonstrates using [`Summarizer`](api/scala/org/apache/spark/ml/stat/Summarizer$.html)
to compute the mean and variance for a vector column of the input dataframe, with and without a weight column. to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
{% include_example scala/org/apache/spark/examples/ml/SummarizerExample.scala %} {% include_example scala/org/apache/spark/examples/ml/SummarizerExample.scala %}
@ -133,4 +133,4 @@ Refer to the [`Summarizer` Python docs](api/python/index.html#pyspark.ml.stat.Su
{% include_example python/ml/summarizer_example.py %} {% include_example python/ml/summarizer_example.py %}
</div> </div>
</div> </div>

View file

@ -49,12 +49,12 @@ Built-in Cross-Validation and other tooling allow users to optimize hyperparamet
An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*. An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*.
Tuning may be done for individual `Estimator`s such as `LogisticRegression`, or for entire `Pipeline`s which include multiple algorithms, featurization, and other steps. Users can tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately. Tuning may be done for individual `Estimator`s such as `LogisticRegression`, or for entire `Pipeline`s which include multiple algorithms, featurization, and other steps. Users can tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
MLlib supports model selection using tools such as [`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) and [`TrainValidationSplit`](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit). MLlib supports model selection using tools such as [`CrossValidator`](api/scala/org/apache/spark/ml/tuning/CrossValidator.html) and [`TrainValidationSplit`](api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html).
These tools require the following items: These tools require the following items:
* [`Estimator`](api/scala/index.html#org.apache.spark.ml.Estimator): algorithm or `Pipeline` to tune * [`Estimator`](api/scala/org/apache/spark/ml/Estimator.html): algorithm or `Pipeline` to tune
* Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter grid" to search over * Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter grid" to search over
* [`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator): metric to measure how well a fitted `Model` does on held-out test data * [`Evaluator`](api/scala/org/apache/spark/ml/evaluation/Evaluator.html): metric to measure how well a fitted `Model` does on held-out test data
At a high level, these model selection tools work as follows: At a high level, these model selection tools work as follows:
@ -63,13 +63,13 @@ At a high level, these model selection tools work as follows:
* For each `ParamMap`, they fit the `Estimator` using those parameters, get the fitted `Model`, and evaluate the `Model`'s performance using the `Evaluator`. * For each `ParamMap`, they fit the `Estimator` using those parameters, get the fitted `Model`, and evaluate the `Model`'s performance using the `Evaluator`.
* They select the `Model` produced by the best-performing set of parameters. * They select the `Model` produced by the best-performing set of parameters.
The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator) The `Evaluator` can be a [`RegressionEvaluator`](api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html)
for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator) for regression problems, a [`BinaryClassificationEvaluator`](api/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html)
for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator) for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html)
for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName` for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
method in each of these evaluators. method in each of these evaluators.
To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility. To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility.
By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`. By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters. The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters.
@ -93,7 +93,7 @@ However, it is also a well-established method for choosing parameters which is m
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) for details on the API. Refer to the [`CrossValidator` Scala docs](api/scala/org/apache/spark/ml/tuning/CrossValidator.html) for details on the API.
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %} {% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
</div> </div>
@ -133,7 +133,7 @@ Like `CrossValidator`, `TrainValidationSplit` finally fits the `Estimator` using
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`TrainValidationSplit` Scala docs](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit) for details on the API. Refer to the [`TrainValidationSplit` Scala docs](api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html) for details on the API.
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %} {% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
</div> </div>

View file

@ -55,12 +55,12 @@ initialization via k-means\|\|.
The following code snippets can be executed in `spark-shell`. The following code snippets can be executed in `spark-shell`.
In the following example after loading and parsing data, we use the In the following example after loading and parsing data, we use the
[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data [`KMeans`](api/scala/org/apache/spark/mllib/clustering/KMeans.html) object to cluster the data
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact, the Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact, the
optimal *k* is usually one where there is an "elbow" in the WSSSE graph. optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`KMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel) for details on the API. Refer to the [`KMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/KMeans.html) and [`KMeansModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/KMeansModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}
</div> </div>
@ -111,11 +111,11 @@ has the following parameters:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
In the following example after loading and parsing data, we use a In the following example after loading and parsing data, we use a
[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture) [GaussianMixture](api/scala/org/apache/spark/mllib/clustering/GaussianMixture.html)
object to cluster the data into two clusters. The number of desired clusters is passed object to cluster the data into two clusters. The number of desired clusters is passed
to the algorithm. We then output the parameters of the mixture model. to the algorithm. We then output the parameters of the mixture model.
Refer to the [`GaussianMixture` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture) and [`GaussianMixtureModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixtureModel) for details on the API. Refer to the [`GaussianMixture` Scala docs](api/scala/org/apache/spark/mllib/clustering/GaussianMixture.html) and [`GaussianMixtureModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/GaussianMixtureExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/GaussianMixtureExample.scala %}
</div> </div>
@ -172,15 +172,15 @@ In the following, we show code snippets to demonstrate how to use PIC in `spark.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`PowerIterationClustering`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering) [`PowerIterationClustering`](api/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.html)
implements the PIC algorithm. implements the PIC algorithm.
It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
affinity matrix. affinity matrix.
Calling `PowerIterationClustering.run` returns a Calling `PowerIterationClustering.run` returns a
[`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel), [`PowerIterationClusteringModel`](api/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html),
which contains the computed clustering assignments. which contains the computed clustering assignments.
Refer to the [`PowerIterationClustering` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering) and [`PowerIterationClusteringModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel) for details on the API. Refer to the [`PowerIterationClustering` Scala docs](api/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.html) and [`PowerIterationClusteringModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala %}
</div> </div>
@ -278,9 +278,9 @@ separately.
**Expectation Maximization** **Expectation Maximization**
Implemented in Implemented in
[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer) [`EMLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/EMLDAOptimizer.html)
and and
[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel). [`DistributedLDAModel`](api/scala/org/apache/spark/mllib/clustering/DistributedLDAModel.html).
For the parameters provided to `LDA`: For the parameters provided to `LDA`:
@ -350,13 +350,13 @@ perplexity of the provided `documents` given the inferred topics.
**Examples** **Examples**
In the following example, we load word count vectors representing a corpus of documents. In the following example, we load word count vectors representing a corpus of documents.
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) We then use [LDA](api/scala/org/apache/spark/mllib/clustering/LDA.html)
to infer three topics from the documents. The number of desired clusters is passed to infer three topics from the documents. The number of desired clusters is passed
to the algorithm. We then output the topics, represented as probability distributions over words. to the algorithm. We then output the topics, represented as probability distributions over words.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`LDA` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) and [`DistributedLDAModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel) for details on the API. Refer to the [`LDA` Scala docs](api/scala/org/apache/spark/mllib/clustering/LDA.html) and [`DistributedLDAModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/DistributedLDAModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala %}
</div> </div>
@ -398,7 +398,7 @@ The implementation in MLlib has the following parameters:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API. Refer to the [`BisectingKMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/BisectingKMeans.html) and [`BisectingKMeansModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
</div> </div>
@ -451,7 +451,7 @@ This example shows how to estimate clusters on streaming data.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`StreamingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.StreamingKMeans) for details on the API. Refer to the [`StreamingKMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/StreamingKMeans.html) for details on the API.
And Refer to [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for details on StreamingContext. And Refer to [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for details on StreamingContext.
{% include_example scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala %}

View file

@ -76,11 +76,11 @@ best parameter learned from a sampled subset to the full dataset and expect simi
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
In the following example, we load rating data. Each row consists of a user, a product and a rating. In the following example, we load rating data. Each row consists of a user, a product and a rating.
We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$) We use the default [ALS.train()](api/scala/org/apache/spark/mllib/recommendation/ALS$.html)
method which assumes ratings are explicit. We evaluate the method which assumes ratings are explicit. We evaluate the
recommendation model by measuring the Mean Squared Error of rating prediction. recommendation model by measuring the Mean Squared Error of rating prediction.
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for more details on the API. Refer to the [`ALS` Scala docs](api/scala/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}

View file

@ -42,13 +42,13 @@ of the vector.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
The base class of local vectors is The base class of local vectors is
[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two [`Vector`](api/scala/org/apache/spark/mllib/linalg/Vector.html), and we provide two
implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) and implementations: [`DenseVector`](api/scala/org/apache/spark/mllib/linalg/DenseVector.html) and
[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend [`SparseVector`](api/scala/org/apache/spark/mllib/linalg/SparseVector.html). We recommend
using the factory methods implemented in using the factory methods implemented in
[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors. [`Vectors`](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) to create local vectors.
Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API. Refer to the [`Vector` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vector.html) and [`Vectors` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.linalg.{Vector, Vectors} import org.apache.spark.mllib.linalg.{Vector, Vectors}
@ -138,9 +138,9 @@ For multiclass classification, labels should be class indices starting from zero
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
A labeled point is represented by the case class A labeled point is represented by the case class
[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint). [`LabeledPoint`](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html).
Refer to the [`LabeledPoint` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) for details on the API. Refer to the [`LabeledPoint` Scala docs](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vectors
@ -211,10 +211,10 @@ After loading, the feature indices are converted to zero-based.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training [`MLUtils.loadLibSVMFile`](api/scala/org/apache/spark/mllib/util/MLUtils$.html) reads training
examples stored in LIBSVM format. examples stored in LIBSVM format.
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for details on the API. Refer to the [`MLUtils` Scala docs](api/scala/org/apache/spark/mllib/util/MLUtils$.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LabeledPoint
@ -272,14 +272,14 @@ is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the m
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
The base class of local matrices is The base class of local matrices is
[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide two [`Matrix`](api/scala/org/apache/spark/mllib/linalg/Matrix.html), and we provide two
implementations: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix), implementations: [`DenseMatrix`](api/scala/org/apache/spark/mllib/linalg/DenseMatrix.html),
and [`SparseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseMatrix). and [`SparseMatrix`](api/scala/org/apache/spark/mllib/linalg/SparseMatrix.html).
We recommend using the factory methods implemented We recommend using the factory methods implemented
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) to create local in [`Matrices`](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) to create local
matrices. Remember, local matrices in MLlib are stored in column-major order. matrices. Remember, local matrices in MLlib are stored in column-major order.
Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) for details on the API. Refer to the [`Matrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrix.html) and [`Matrices` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.linalg.{Matrix, Matrices} import org.apache.spark.mllib.linalg.{Matrix, Matrices}
@ -377,12 +377,12 @@ limited by the integer range but it should be much smaller in practice.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
A [`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be A [`RowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
created from an `RDD[Vector]` instance. Then we can compute its column summary statistics and decompositions. created from an `RDD[Vector]` instance. Then we can compute its column summary statistics and decompositions.
[QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix. [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
For [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), please refer to [Dimensionality reduction](mllib-dimensionality-reduction.html). For [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), please refer to [Dimensionality reduction](mllib-dimensionality-reduction.html).
Refer to the [`RowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) for details on the API. Refer to the [`RowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.Vector
@ -463,13 +463,13 @@ vector.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
An An
[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) [`IndexedRowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
can be created from an `RDD[IndexedRow]` instance, where can be created from an `RDD[IndexedRow]` instance, where
[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a [`IndexedRow`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
its row indices. its row indices.
Refer to the [`IndexedRowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) for details on the API. Refer to the [`IndexedRowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix} import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
@ -568,14 +568,14 @@ dimensions of the matrix are huge and the matrix is very sparse.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
A A
[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) [`CoordinateMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
can be created from an `RDD[MatrixEntry]` instance, where can be created from an `RDD[MatrixEntry]` instance, where
[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a [`MatrixEntry`](api/scala/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix` wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
with sparse rows by calling `toIndexedRowMatrix`. Other computations for with sparse rows by calling `toIndexedRowMatrix`. Other computations for
`CoordinateMatrix` are not currently supported. `CoordinateMatrix` are not currently supported.
Refer to the [`CoordinateMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) for details on the API. Refer to the [`CoordinateMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
@ -678,12 +678,12 @@ the sub-matrix at the given index with size `rowsPerBlock` x `colsPerBlock`.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
A [`BlockMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix) can be A [`BlockMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) can be
most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`. most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`.
`toBlockMatrix` creates blocks of size 1024 x 1024 by default. `toBlockMatrix` creates blocks of size 1024 x 1024 by default.
Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`. Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`.
Refer to the [`BlockMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix) for details on the API. Refer to the [`BlockMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry} import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}

View file

@ -151,7 +151,7 @@ When tuning these parameters, be careful to validate on held-out test data to av
* **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit. * **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) since those are often trained deeper than individual trees. * **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) since those are often trained deeper than individual trees.
* **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain). * **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain).
@ -167,13 +167,13 @@ These parameters may be tuned. Be careful to validate on held-out test data whe
* The default value is conservatively chosen to be 256 MiB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`. * The default value is conservatively chosen to be 256 MiB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`.
* *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics. * *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics.
* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint. * **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) and [`GradientBoostedTrees`](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
* **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter. * **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter.
### Caching and checkpointing ### Caching and checkpointing
MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) when `numTrees` is set to be large. MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) when `numTrees` is set to be large.
* **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration. * **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration.
* This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration). * This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration).
@ -207,7 +207,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`DecisionTree` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel) for details on the API. Refer to the [`DecisionTree` Scala docs](api/scala/org/apache/spark/mllib/tree/DecisionTree.html) and [`DecisionTreeModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/DecisionTreeModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/DecisionTreeClassificationExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/DecisionTreeClassificationExample.scala %}
</div> </div>
@ -238,7 +238,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`DecisionTree` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel) for details on the API. Refer to the [`DecisionTree` Scala docs](api/scala/org/apache/spark/mllib/tree/DecisionTree.html) and [`DecisionTreeModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/DecisionTreeModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/DecisionTreeRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/DecisionTreeRegressionExample.scala %}
</div> </div>

View file

@ -77,7 +77,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`SingularValueDecomposition` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.SingularValueDecomposition) for details on the API. Refer to the [`SingularValueDecomposition` Scala docs](api/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/SVDExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/SVDExample.scala %}
@ -117,14 +117,14 @@ the rotation matrix are called principal components. PCA is used widely in dimen
The following code demonstrates how to compute principal components on a `RowMatrix` The following code demonstrates how to compute principal components on a `RowMatrix`
and use them to project the vectors into a low-dimensional space. and use them to project the vectors into a low-dimensional space.
Refer to the [`RowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) for details on the API. Refer to the [`RowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/PCAOnRowMatrixExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/PCAOnRowMatrixExample.scala %}
The following code demonstrates how to compute principal components on source vectors The following code demonstrates how to compute principal components on source vectors
and use them to project the vectors into a low-dimensional space while keeping associated labels: and use them to project the vectors into a low-dimensional space while keeping associated labels:
Refer to the [`PCA` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.PCA) for details on the API. Refer to the [`PCA` Scala docs](api/scala/org/apache/spark/mllib/feature/PCA.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/PCAOnSourceVectorExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/PCAOnSourceVectorExample.scala %}

View file

@ -24,7 +24,7 @@ license: |
An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning) An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
is a learning algorithm which creates a model composed of a set of other base models. is a learning algorithm which creates a model composed of a set of other base models.
`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$). `spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`RandomForest`](api/scala/org/apache/spark/mllib/tree/RandomForest$.html).
Both use [decision trees](mllib-decision-tree.html) as their base models. Both use [decision trees](mllib-decision-tree.html) as their base models.
## Gradient-Boosted Trees vs. Random Forests ## Gradient-Boosted Trees vs. Random Forests
@ -111,7 +111,7 @@ The test error is calculated to measure the algorithm accuracy.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API. Refer to the [`RandomForest` Scala docs](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) and [`RandomForestModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/RandomForestModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %}
</div> </div>
@ -142,7 +142,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API. Refer to the [`RandomForest` Scala docs](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) and [`RandomForestModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/RandomForestModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %}
</div> </div>
@ -252,7 +252,7 @@ The test error is calculated to measure the algorithm accuracy.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`GradientBoostedTrees` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel) for details on the API. Refer to the [`GradientBoostedTrees` Scala docs](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`GradientBoostedTreesModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala %}
</div> </div>
@ -283,7 +283,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`GradientBoostedTrees` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel) for details on the API. Refer to the [`GradientBoostedTrees` Scala docs](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`GradientBoostedTreesModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala %}
</div> </div>

View file

@ -117,7 +117,7 @@ The following code snippets illustrate how to load a sample dataset, train a bin
data, and evaluate the performance of the algorithm by several binary evaluation metrics. data, and evaluate the performance of the algorithm by several binary evaluation metrics.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) and [`BinaryClassificationMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics) for details on the API. Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`BinaryClassificationMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala %}
@ -243,7 +243,7 @@ The following code snippets illustrate how to load a sample dataset, train a mul
the data, and evaluate the performance of the algorithm by several multiclass classification evaluation metrics. the data, and evaluate the performance of the algorithm by several multiclass classification evaluation metrics.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`MulticlassMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.MulticlassMetrics) for details on the API. Refer to the [`MulticlassMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/MulticlassMetricsExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/MulticlassMetricsExample.scala %}
@ -393,7 +393,7 @@ True classes:
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`MultilabelMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.MultilabelMetrics) for details on the API. Refer to the [`MultilabelMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/MultiLabelMetricsExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/MultiLabelMetricsExample.scala %}
@ -521,7 +521,7 @@ expanded world of non-positive weights are "the same as never having interacted
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`RegressionMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RegressionMetrics) and [`RankingMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RankingMetrics) for details on the API. Refer to the [`RegressionMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.html) and [`RankingMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RankingMetrics.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RankingMetricsExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/RankingMetricsExample.scala %}

View file

@ -69,12 +69,12 @@ We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
TF and IDF are implemented in [HashingTF](api/scala/index.html#org.apache.spark.mllib.feature.HashingTF) TF and IDF are implemented in [HashingTF](api/scala/org/apache/spark/mllib/feature/HashingTF.html)
and [IDF](api/scala/index.html#org.apache.spark.mllib.feature.IDF). and [IDF](api/scala/org/apache/spark/mllib/feature/IDF.html).
`HashingTF` takes an `RDD[Iterable[_]]` as the input. `HashingTF` takes an `RDD[Iterable[_]]` as the input.
Each record could be an iterable of strings or other types. Each record could be an iterable of strings or other types.
Refer to the [`HashingTF` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.HashingTF) for details on the API. Refer to the [`HashingTF` Scala docs](api/scala/org/apache/spark/mllib/feature/HashingTF.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}
</div> </div>
@ -135,7 +135,7 @@ Here we assume the extracted file is `text8` and in same directory as you run th
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`Word2Vec` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.Word2Vec) for details on the API. Refer to the [`Word2Vec` Scala docs](api/scala/org/apache/spark/mllib/feature/Word2Vec.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/Word2VecExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/Word2VecExample.scala %}
</div> </div>
@ -159,19 +159,19 @@ against features with very large variances exerting an overly large influence du
### Model Fitting ### Model Fitting
[`StandardScaler`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) has the [`StandardScaler`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) has the
following parameters in the constructor: following parameters in the constructor:
* `withMean` False by default. Centers the data with mean before scaling. It will build a dense * `withMean` False by default. Centers the data with mean before scaling. It will build a dense
output, so take care when applying to sparse input. output, so take care when applying to sparse input.
* `withStd` True by default. Scales the data to unit standard deviation. * `withStd` True by default. Scales the data to unit standard deviation.
We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) method in We provide a [`fit`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) method in
`StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then `StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then
return a model which can transform the input dataset into unit standard deviation and/or zero mean features return a model which can transform the input dataset into unit standard deviation and/or zero mean features
depending how we configure the `StandardScaler`. depending how we configure the `StandardScaler`.
This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) This model implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
which can apply the standardization on a `Vector` to produce a transformed `Vector` or on which can apply the standardization on a `Vector` to produce a transformed `Vector` or on
an `RDD[Vector]` to produce a transformed `RDD[Vector]`. an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
@ -185,7 +185,7 @@ so that the new features have unit standard deviation and/or zero mean.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`StandardScaler` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) for details on the API. Refer to the [`StandardScaler` Scala docs](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/StandardScalerExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/StandardScalerExample.scala %}
</div> </div>
@ -203,12 +203,12 @@ Normalizer scales individual samples to have unit $L^p$ norm. This is a common o
classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors
is the cosine similarity of the vectors. is the cosine similarity of the vectors.
[`Normalizer`](api/scala/index.html#org.apache.spark.mllib.feature.Normalizer) has the following [`Normalizer`](api/scala/org/apache/spark/mllib/feature/Normalizer.html) has the following
parameter in the constructor: parameter in the constructor:
* `p` Normalization in $L^p$ space, $p = 2$ by default. * `p` Normalization in $L^p$ space, $p = 2$ by default.
`Normalizer` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) `Normalizer` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
which can apply the normalization on a `Vector` to produce a transformed `Vector` or on which can apply the normalization on a `Vector` to produce a transformed `Vector` or on
an `RDD[Vector]` to produce a transformed `RDD[Vector]`. an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
@ -221,7 +221,7 @@ with $L^2$ norm, and $L^\infty$ norm.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`Normalizer` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.Normalizer) for details on the API. Refer to the [`Normalizer` Scala docs](api/scala/org/apache/spark/mllib/feature/Normalizer.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/NormalizerExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/NormalizerExample.scala %}
</div> </div>
@ -239,7 +239,7 @@ Refer to the [`Normalizer` Python docs](api/python/pyspark.mllib.html#pyspark.ml
features for use in model construction. It reduces the size of the feature space, which can improve features for use in model construction. It reduces the size of the feature space, which can improve
both speed and statistical learning behavior. both speed and statistical learning behavior.
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements [`ChiSqSelector`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) implements
Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which [Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`: features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
@ -257,7 +257,7 @@ The number of features to select can be tuned using a held-out validation set.
### Model Fitting ### Model Fitting
The [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method takes The [`fit`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) method takes
an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then
returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space. returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to
@ -272,7 +272,7 @@ The following example shows the basic use of ChiSqSelector. The data set used ha
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`ChiSqSelector` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) Refer to the [`ChiSqSelector` Scala docs](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html)
for details on the API. for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/ChiSqSelectorExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/ChiSqSelectorExample.scala %}
@ -312,11 +312,11 @@ v_N
\end{pmatrix} \end{pmatrix}
\]` \]`
[`ElementwiseProduct`](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) has the following parameter in the constructor: [`ElementwiseProduct`](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) has the following parameter in the constructor:
* `scalingVec`: the transforming vector. * `scalingVec`: the transforming vector.
`ElementwiseProduct` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`. `ElementwiseProduct` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
### Example ### Example
@ -325,7 +325,7 @@ This example below demonstrates how to transform vectors using a transforming ve
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`ElementwiseProduct` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) for details on the API. Refer to the [`ElementwiseProduct` Scala docs](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/ElementwiseProductExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/ElementwiseProductExample.scala %}
</div> </div>

View file

@ -54,18 +54,18 @@ We refer users to the papers for more details.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) implements the [`FPGrowth`](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
FP-growth algorithm. FP-growth algorithm.
It takes an `RDD` of transactions, where each transaction is an `Array` of items of a generic type. It takes an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
Calling `FPGrowth.run` with transactions returns an Calling `FPGrowth.run` with transactions returns an
[`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel) [`FPGrowthModel`](api/scala/org/apache/spark/mllib/fpm/FPGrowthModel.html)
that stores the frequent itemsets with their frequencies. The following that stores the frequent itemsets with their frequencies. The following
example illustrates how to mine frequent itemsets and association rules example illustrates how to mine frequent itemsets and association rules
(see [Association (see [Association
Rules](mllib-frequent-pattern-mining.html#association-rules) for Rules](mllib-frequent-pattern-mining.html#association-rules) for
details) from `transactions`. details) from `transactions`.
Refer to the [`FPGrowth` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) for details on the API. Refer to the [`FPGrowth` Scala docs](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/SimpleFPGrowth.scala %} {% include_example scala/org/apache/spark/examples/mllib/SimpleFPGrowth.scala %}
@ -111,7 +111,7 @@ Refer to the [`FPGrowth` Python docs](api/python/pyspark.mllib.html#pyspark.mlli
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[AssociationRules](api/scala/index.html#org.apache.spark.mllib.fpm.AssociationRules) [AssociationRules](api/scala/org/apache/spark/mllib/fpm/AssociationRules.html)
implements a parallel rule generation algorithm for constructing rules implements a parallel rule generation algorithm for constructing rules
that have a single item as the consequent. that have a single item as the consequent.
@ -168,13 +168,13 @@ The following example illustrates PrefixSpan running on the sequences
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) implements the [`PrefixSpan`](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
PrefixSpan algorithm. PrefixSpan algorithm.
Calling `PrefixSpan.run` returns a Calling `PrefixSpan.run` returns a
[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel) [`PrefixSpanModel`](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
that stores the frequent sequences with their frequencies. that stores the frequent sequences with their frequencies.
Refer to the [`PrefixSpan` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) and [`PrefixSpanModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel) for details on the API. Refer to the [`PrefixSpan` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) and [`PrefixSpanModel` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/PrefixSpanExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/PrefixSpanExample.scala %}

View file

@ -74,7 +74,7 @@ i.e. 4710.28,500.00. The data are split to training and testing set.
Model is created using the training set and a mean squared error is calculated from the predicted Model is created using the training set and a mean squared error is calculated from the predicted
labels and real labels in the test set. labels and real labels in the test set.
Refer to the [`IsotonicRegression` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegression) and [`IsotonicRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegressionModel) for details on the API. Refer to the [`IsotonicRegression` Scala docs](api/scala/org/apache/spark/mllib/regression/IsotonicRegression.html) and [`IsotonicRegressionModel` Scala docs](api/scala/org/apache/spark/mllib/regression/IsotonicRegressionModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala %}
</div> </div>

View file

@ -184,7 +184,7 @@ training algorithm on this training data using a static method in the algorithm
object, and make predictions with the resulting model to compute the training object, and make predictions with the resulting model to compute the training
error. error.
Refer to the [`SVMWithSGD` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD) and [`SVMModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMModel) for details on the API. Refer to the [`SVMWithSGD` Scala docs](api/scala/org/apache/spark/mllib/classification/SVMWithSGD.html) and [`SVMModel` Scala docs](api/scala/org/apache/spark/mllib/classification/SVMModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/SVMWithSGDExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/SVMWithSGDExample.scala %}
@ -305,11 +305,11 @@ We recommend L-BFGS over mini-batch gradient descent for faster convergence.
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
The following code illustrates how to load a sample multiclass dataset, split it into train and The following code illustrates how to load a sample multiclass dataset, split it into train and
test, and use test, and use
[LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html)
to fit a logistic regression model. to fit a logistic regression model.
Then the model is evaluated against the test dataset and saved to disk. Then the model is evaluated against the test dataset and saved to disk.
Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) and [`LogisticRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel) for details on the API. Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`LogisticRegressionModel` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/LogisticRegressionWithLBFGSExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/LogisticRegressionWithLBFGSExample.scala %}
@ -438,8 +438,8 @@ regularization parameter (`regParam`) along with various parameters associated w
gradient descent (`stepSize`, `numIterations`, `miniBatchFraction`). For each of them, we support gradient descent (`stepSize`, `numIterations`, `miniBatchFraction`). For each of them, we support
all three possible regularizations (none, L1 or L2). all three possible regularizations (none, L1 or L2).
For Logistic Regression, [L-BFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS) For Logistic Regression, [L-BFGS](api/scala/org/apache/spark/mllib/optimization/LBFGS.html)
version is implemented under [LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS), and this version is implemented under [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html), and this
version supports both binary and multinomial Logistic Regression while SGD version only supports version supports both binary and multinomial Logistic Regression while SGD version only supports
binary Logistic Regression. However, L-BFGS version doesn't support L1 regularization but SGD one binary Logistic Regression. However, L-BFGS version doesn't support L1 regularization but SGD one
supports L1 regularization. When L1 regularization is not required, L-BFGS version is strongly supports L1 regularization. When L1 regularization is not required, L-BFGS version is strongly
@ -448,10 +448,10 @@ inverse Hessian matrix using quasi-Newton method.
Algorithms are all implemented in Scala: Algorithms are all implemented in Scala:
* [SVMWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD) * [SVMWithSGD](api/scala/org/apache/spark/mllib/classification/SVMWithSGD.html)
* [LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) * [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html)
* [LogisticRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD) * [LogisticRegressionWithSGD](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithSGD.html)
* [LinearRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD) * [LinearRegressionWithSGD](api/scala/org/apache/spark/mllib/regression/LinearRegressionWithSGD.html)
* [RidgeRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD) * [RidgeRegressionWithSGD](api/scala/org/apache/spark/mllib/regression/RidgeRegressionWithSGD.html)
* [LassoWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD) * [LassoWithSGD](api/scala/org/apache/spark/mllib/regression/LassoWithSGD.html)

View file

@ -46,14 +46,14 @@ sparsity. Since the training data is only used once, it is not necessary to cach
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[NaiveBayes](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements [NaiveBayes](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) implements
multinomial naive Bayes. It takes an RDD of multinomial naive Bayes. It takes an RDD of
[LabeledPoint](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional [LabeledPoint](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) and an optional
smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
[NaiveBayesModel](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which [NaiveBayesModel](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
can be used for evaluation and prediction. can be used for evaluation and prediction.
Refer to the [`NaiveBayes` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes) and [`NaiveBayesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel) for details on the API. Refer to the [`NaiveBayes` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) and [`NaiveBayesModel` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
</div> </div>

View file

@ -111,12 +111,12 @@ As an alternative to just use the subgradient `$R'(\wv)$` of the regularizer in
direction, an improved update for some cases can be obtained by using the proximal operator direction, an improved update for some cases can be obtained by using the proximal operator
instead. instead.
For the L1-regularizer, the proximal operator is given by soft thresholding, as implemented in For the L1-regularizer, the proximal operator is given by soft thresholding, as implemented in
[L1Updater](api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater). [L1Updater](api/scala/org/apache/spark/mllib/optimization/L1Updater.html).
### Update schemes for distributed SGD ### Update schemes for distributed SGD
The SGD implementation in The SGD implementation in
[GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) uses [GradientDescent](api/scala/org/apache/spark/mllib/optimization/GradientDescent.html) uses
a simple (distributed) sampling of the data examples. a simple (distributed) sampling of the data examples.
We recall that the loss part of the optimization problem `$\eqref{eq:regPrimal}$` is We recall that the loss part of the optimization problem `$\eqref{eq:regPrimal}$` is
`$\frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)$`, and therefore `$\frac1n \sum_{i=1}^n L'_{\wv,i}$` would `$\frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)$`, and therefore `$\frac1n \sum_{i=1}^n L'_{\wv,i}$` would
@ -169,7 +169,7 @@ are developed, see the
section for example. section for example.
The SGD class The SGD class
[GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) [GradientDescent](api/scala/org/apache/spark/mllib/optimization/GradientDescent.html)
sets the following parameters: sets the following parameters:
* `Gradient` is a class that computes the stochastic gradient of the function * `Gradient` is a class that computes the stochastic gradient of the function
@ -195,15 +195,15 @@ each iteration, to compute the gradient direction.
L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various
ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective
function, and updater into optimizer yourself instead of using the training APIs like function, and updater into optimizer yourself instead of using the training APIs like
[LogisticRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD). [LogisticRegressionWithSGD](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithSGD.html).
See the example below. It will be addressed in the next release. See the example below. It will be addressed in the next release.
The L1 regularization by using The L1 regularization by using
[L1Updater](api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater) will not work since the [L1Updater](api/scala/org/apache/spark/mllib/optimization/L1Updater.html) will not work since the
soft-thresholding logic in L1Updater is designed for gradient descent. See the developer's note. soft-thresholding logic in L1Updater is designed for gradient descent. See the developer's note.
The L-BFGS method The L-BFGS method
[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS) [LBFGS.runLBFGS](api/scala/org/apache/spark/mllib/optimization/LBFGS.html)
has the following parameters: has the following parameters:
* `Gradient` is a class that computes the gradient of the objective function * `Gradient` is a class that computes the gradient of the objective function
@ -233,7 +233,7 @@ L-BFGS optimizer.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
Refer to the [`LBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS) and [`SquaredL2Updater` Scala docs](api/scala/index.html#org.apache.spark.mllib.optimization.SquaredL2Updater) for details on the API. Refer to the [`LBFGS` Scala docs](api/scala/org/apache/spark/mllib/optimization/LBFGS.html) and [`SquaredL2Updater` Scala docs](api/scala/org/apache/spark/mllib/optimization/SquaredL2Updater.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/LBFGSExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/LBFGSExample.scala %}
</div> </div>

View file

@ -62,7 +62,7 @@ To export a supported `model` (see table above) to PMML, simply call `model.toPM
As well as exporting the PMML model to a String (`model.toPMML` as in the example above), you can export the PMML model to other formats. As well as exporting the PMML model to a String (`model.toPMML` as in the example above), you can export the PMML model to other formats.
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API. Refer to the [`KMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/KMeans.html) and [`Vectors` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) for details on the API.
Here a complete example of building a KMeansModel and print it out in PMML format: Here a complete example of building a KMeansModel and print it out in PMML format:
{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}

View file

@ -48,12 +48,12 @@ available in `Statistics`.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of [`colStats()`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) returns an instance of
[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary), [`MultivariateStatisticalSummary`](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
total count. total count.
Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary) for details on the API. Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}
</div> </div>
@ -91,11 +91,11 @@ correlation methods are currently Pearson's and Spearman's correlation.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively. an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API. Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
</div> </div>
@ -137,7 +137,7 @@ python.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to [`sampleByKeyExact()`](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) allows users to
sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
@ -181,7 +181,7 @@ independence tests.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
run Pearson's chi-squared tests. The following example demonstrates how to run and interpret run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
hypothesis tests. hypothesis tests.
@ -221,11 +221,11 @@ message.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to [`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests. and interpret the hypothesis tests.
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API. Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
</div> </div>
@ -269,7 +269,7 @@ all prior batches.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`StreamingTest`](api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest) [`StreamingTest`](api/scala/org/apache/spark/mllib/stat/test/StreamingTest.html)
provides streaming hypothesis testing. provides streaming hypothesis testing.
{% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
@ -292,12 +292,12 @@ uniform, standard normal, or Poisson.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) provides factory [`RandomRDDs`](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) provides factory
methods to generate random double RDDs or vector RDDs. methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`. distribution `N(0, 1)`, and then map it to `N(1, 4)`.
Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) for details on the API. Refer to the [`RandomRDDs` Scala docs](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) for details on the API.
{% highlight scala %} {% highlight scala %}
import org.apache.spark.SparkContext import org.apache.spark.SparkContext
@ -370,11 +370,11 @@ mean of PDFs of normal distributions centered around each of the samples.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods [`KernelDensity`](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) provides methods
to compute kernel density estimates from an RDD of samples. The following example demonstrates how to compute kernel density estimates from an RDD of samples. The following example demonstrates how
to do so. to do so.
Refer to the [`KernelDensity` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) for details on the API. Refer to the [`KernelDensity` Scala docs](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/KernelDensityEstimationExample.scala %} {% include_example scala/org/apache/spark/examples/mllib/KernelDensityEstimationExample.scala %}
</div> </div>

View file

@ -57,7 +57,7 @@ scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string] textFile: org.apache.spark.sql.Dataset[String] = [value: string]
{% endhighlight %} {% endhighlight %}
You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the _[API doc](api/scala/index.html#org.apache.spark.sql.Dataset)_. You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the _[API doc](api/scala/org/apache/spark/sql/Dataset.html)_.
{% highlight scala %} {% highlight scala %}
scala> textFile.count() // Number of items in this Dataset scala> textFile.count() // Number of items in this Dataset

View file

@ -149,8 +149,8 @@ $ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit examples/src/main/pytho
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
The first thing a Spark program must do is to create a [SparkContext](api/scala/index.html#org.apache.spark.SparkContext) object, which tells Spark The first thing a Spark program must do is to create a [SparkContext](api/scala/org/apache/spark/SparkContext.html) object, which tells Spark
how to access a cluster. To create a `SparkContext` you first need to build a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object how to access a cluster. To create a `SparkContext` you first need to build a [SparkConf](api/scala/org/apache/spark/SparkConf.html) object
that contains information about your application. that contains information about your application.
Only one SparkContext should be active per JVM. You must `stop()` the active SparkContext before creating a new one. Only one SparkContext should be active per JVM. You must `stop()` the active SparkContext before creating a new one.
@ -500,7 +500,7 @@ then this approach should work well for such cases.
If you have custom serialized binary data (such as loading data from Cassandra / HBase), then you will first need to If you have custom serialized binary data (such as loading data from Cassandra / HBase), then you will first need to
transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler. transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) trait is provided A [Converter](api/scala/org/apache/spark/api/python/Converter.html) trait is provided
for this. Simply extend this trait and implement your transformation code in the ```convert``` for this. Simply extend this trait and implement your transformation code in the ```convert```
method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark
classpath. classpath.
@ -856,7 +856,7 @@ by a key.
In Scala, these operations are automatically available on RDDs containing In Scala, these operations are automatically available on RDDs containing
[Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2) objects [Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2) objects
(the built-in tuples in the language, created by simply writing `(a, b)`). The key-value pair operations are available in the (the built-in tuples in the language, created by simply writing `(a, b)`). The key-value pair operations are available in the
[PairRDDFunctions](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) class, [PairRDDFunctions](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) class,
which automatically wraps around an RDD of tuples. which automatically wraps around an RDD of tuples.
For example, the following code uses the `reduceByKey` operation on key-value pairs to count how For example, the following code uses the `reduceByKey` operation on key-value pairs to count how
@ -946,12 +946,12 @@ We could also use `counts.sortByKey()`, for example, to sort the pairs alphabeti
The following table lists some of the common transformations supported by Spark. Refer to the The following table lists some of the common transformations supported by Spark. Refer to the
RDD API doc RDD API doc
([Scala](api/scala/index.html#org.apache.spark.rdd.RDD), ([Scala](api/scala/org/apache/spark/rdd/RDD.html),
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html), [Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
[Python](api/python/pyspark.html#pyspark.RDD), [Python](api/python/pyspark.html#pyspark.RDD),
[R](api/R/index.html)) [R](api/R/index.html))
and pair RDD functions doc and pair RDD functions doc
([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions), ([Scala](api/scala/org/apache/spark/rdd/PairRDDFunctions.html),
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)) [Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
for details. for details.
@ -1060,13 +1060,13 @@ for details.
The following table lists some of the common actions supported by Spark. Refer to the The following table lists some of the common actions supported by Spark. Refer to the
RDD API doc RDD API doc
([Scala](api/scala/index.html#org.apache.spark.rdd.RDD), ([Scala](api/scala/org/apache/spark/rdd/RDD.html),
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html), [Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
[Python](api/python/pyspark.html#pyspark.RDD), [Python](api/python/pyspark.html#pyspark.RDD),
[R](api/R/index.html)) [R](api/R/index.html))
and pair RDD functions doc and pair RDD functions doc
([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions), ([Scala](api/scala/org/apache/spark/rdd/PairRDDFunctions.html),
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)) [Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
for details. for details.
@ -1208,7 +1208,7 @@ In addition, each persisted RDD can be stored using a different *storage level*,
to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space),
replicate it across nodes. replicate it across nodes.
These levels are set by passing a These levels are set by passing a
`StorageLevel` object ([Scala](api/scala/index.html#org.apache.spark.storage.StorageLevel), `StorageLevel` object ([Scala](api/scala/org/apache/spark/storage/StorageLevel.html),
[Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html), [Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html),
[Python](api/python/pyspark.html#pyspark.StorageLevel)) [Python](api/python/pyspark.html#pyspark.StorageLevel))
to `persist()`. The `cache()` method is a shorthand for using the default storage level, to `persist()`. The `cache()` method is a shorthand for using the default storage level,
@ -1404,11 +1404,11 @@ res2: Long = 10
{% endhighlight %} {% endhighlight %}
While this code used the built-in support for accumulators of type Long, programmers can also While this code used the built-in support for accumulators of type Long, programmers can also
create their own types by subclassing [AccumulatorV2](api/scala/index.html#org.apache.spark.util.AccumulatorV2). create their own types by subclassing [AccumulatorV2](api/scala/org/apache/spark/util/AccumulatorV2.html).
The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting
the accumulator to zero, `add` for adding another value into the accumulator, the accumulator to zero, `add` for adding another value into the accumulator,
`merge` for merging another same-type accumulator into this one. Other methods that must be overridden `merge` for merging another same-type accumulator into this one. Other methods that must be overridden
are contained in the [API documentation](api/scala/index.html#org.apache.spark.util.AccumulatorV2). For example, supposing we had a `MyVector` class are contained in the [API documentation](api/scala/org/apache/spark/util/AccumulatorV2.html). For example, supposing we had a `MyVector` class
representing mathematical vectors, we could write: representing mathematical vectors, we could write:
{% highlight scala %} {% highlight scala %}
@ -1457,11 +1457,11 @@ accum.value();
{% endhighlight %} {% endhighlight %}
While this code used the built-in support for accumulators of type Long, programmers can also While this code used the built-in support for accumulators of type Long, programmers can also
create their own types by subclassing [AccumulatorV2](api/scala/index.html#org.apache.spark.util.AccumulatorV2). create their own types by subclassing [AccumulatorV2](api/scala/org/apache/spark/util/AccumulatorV2.html).
The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting
the accumulator to zero, `add` for adding another value into the accumulator, the accumulator to zero, `add` for adding another value into the accumulator,
`merge` for merging another same-type accumulator into this one. Other methods that must be overridden `merge` for merging another same-type accumulator into this one. Other methods that must be overridden
are contained in the [API documentation](api/scala/index.html#org.apache.spark.util.AccumulatorV2). For example, supposing we had a `MyVector` class are contained in the [API documentation](api/scala/org/apache/spark/util/AccumulatorV2.html). For example, supposing we had a `MyVector` class
representing mathematical vectors, we could write: representing mathematical vectors, we could write:
{% highlight java %} {% highlight java %}
@ -1620,4 +1620,4 @@ For help on deploying, the [cluster mode overview](cluster-overview.html) descri
in distributed operation and supported cluster managers. in distributed operation and supported cluster managers.
Finally, full API documentation is available in Finally, full API documentation is available in
[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/). [Scala](api/scala/org/apache/spark/), [Java](api/java/), [Python](api/python/) and [R](api/R/).

View file

@ -118,4 +118,4 @@ To load all files recursively, you can use:
<div data-lang="r" markdown="1"> <div data-lang="r" markdown="1">
{% include_example recursive_file_lookup r/RSparkSQLExample.R %} {% include_example recursive_file_lookup r/RSparkSQLExample.R %}
</div> </div>
</div> </div>

View file

@ -23,7 +23,7 @@ license: |
{:toc} {:toc}
Spark SQL also includes a data source that can read data from other databases using JDBC. This Spark SQL also includes a data source that can read data from other databases using JDBC. This
functionality should be preferred over using [JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD). functionality should be preferred over using [JdbcRDD](api/scala/org/apache/spark/rdd/JdbcRDD.html).
This is because the results are returned This is because the results are returned
as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources. as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
The JDBC data source is also easier to use from Java or Python as it does not require the user to The JDBC data source is also easier to use from Java or Python as it does not require the user to

View file

@ -93,4 +93,4 @@ SELECT * FROM jsonTable
</div> </div>
</div> </div>

View file

@ -27,7 +27,7 @@ license: |
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`: The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} {% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
</div> </div>
@ -104,7 +104,7 @@ As an example, the following creates a DataFrame based on the content of a JSON
## Untyped Dataset Operations (aka DataFrame Operations) ## Untyped Dataset Operations (aka DataFrame Operations)
DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/SparkDataFrame.html). DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/org/apache/spark/sql/Dataset.html), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/SparkDataFrame.html).
As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets. As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets.
@ -114,9 +114,9 @@ Here we include some basic examples of structured data processing using Datasets
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} {% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset). For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/org/apache/spark/sql/Dataset.html).
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.functions$). In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/org/apache/spark/sql/functions$.html).
</div> </div>
<div data-lang="java" markdown="1"> <div data-lang="java" markdown="1">
@ -222,7 +222,7 @@ SELECT * FROM global_temp.temp_view
## Creating Datasets ## Creating Datasets
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
a specialized [Encoder](api/scala/index.html#org.apache.spark.sql.Encoder) to serialize the objects a specialized [Encoder](api/scala/org/apache/spark/sql/Encoder.html) to serialize the objects
for processing or transmitting over the network. While both encoders and standard serialization are for processing or transmitting over the network. While both encoders and standard serialization are
responsible for turning an object into bytes, encoders are code generated dynamically and use a format responsible for turning an object into bytes, encoders are code generated dynamically and use a format
that allows Spark to perform many operations like filtering, sorting and hashing without deserializing that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
@ -351,16 +351,16 @@ For example:
## Aggregations ## Aggregations
The [built-in DataFrames functions](api/scala/index.html#org.apache.spark.sql.functions$) provide common The [built-in DataFrames functions](api/scala/org/apache/spark/sql/functions$.html) provide common
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc. aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc.
While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in
[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$) and [Scala](api/scala/org/apache/spark/sql/expressions/scalalang/typed$.html) and
[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets. [Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets.
Moreover, users are not limited to the predefined aggregate functions and can create their own. Moreover, users are not limited to the predefined aggregate functions and can create their own.
### Type-Safe User-Defined Aggregate Functions ### Type-Safe User-Defined Aggregate Functions
User-defined aggregations for strongly typed Datasets revolve around the [Aggregator](api/scala/index.html#org.apache.spark.sql.expressions.Aggregator) abstract class. User-defined aggregations for strongly typed Datasets revolve around the [Aggregator](api/scala/org/apache/spark/sql/expressions/Aggregator.html) abstract class.
For example, a type-safe user-defined average can look like: For example, a type-safe user-defined average can look like:
<div class="codetabs"> <div class="codetabs">

View file

@ -737,11 +737,11 @@ and writing data out (`DataFrame.write`),
and deprecated the old APIs (e.g., `SQLContext.parquetFile`, `SQLContext.jsonFile`). and deprecated the old APIs (e.g., `SQLContext.parquetFile`, `SQLContext.jsonFile`).
See the API docs for `SQLContext.read` ( See the API docs for `SQLContext.read` (
<a href="api/scala/index.html#org.apache.spark.sql.SQLContext@read:DataFrameReader">Scala</a>, <a href="api/scala/org/apache/spark/sql/SQLContext.html#read:DataFrameReader">Scala</a>,
<a href="api/java/org/apache/spark/sql/SQLContext.html#read()">Java</a>, <a href="api/java/org/apache/spark/sql/SQLContext.html#read()">Java</a>,
<a href="api/python/pyspark.sql.html#pyspark.sql.SQLContext.read">Python</a> <a href="api/python/pyspark.sql.html#pyspark.sql.SQLContext.read">Python</a>
) and `DataFrame.write` ( ) and `DataFrame.write` (
<a href="api/scala/index.html#org.apache.spark.sql.DataFrame@write:DataFrameWriter">Scala</a>, <a href="api/scala/org/apache/spark/sql/DataFrame.html#write:DataFrameWriter">Scala</a>,
<a href="api/java/org/apache/spark/sql/Dataset.html#write()">Java</a>, <a href="api/java/org/apache/spark/sql/Dataset.html#write()">Java</a>,
<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrame.write">Python</a> <a href="api/python/pyspark.sql.html#pyspark.sql.DataFrame.write">Python</a>
) more information. ) more information.

View file

@ -61,7 +61,7 @@ In Scala and Java, a DataFrame is represented by a Dataset of `Row`s.
In [the Scala API][scala-datasets], `DataFrame` is simply a type alias of `Dataset[Row]`. In [the Scala API][scala-datasets], `DataFrame` is simply a type alias of `Dataset[Row]`.
While, in [Java API][java-datasets], users need to use `Dataset<Row>` to represent a `DataFrame`. While, in [Java API][java-datasets], users need to use `Dataset<Row>` to represent a `DataFrame`.
[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset [scala-datasets]: api/scala/org/apache/spark/sql/Dataset.html
[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html [java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames. Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.

View file

@ -106,4 +106,4 @@ ANALYZE TABLE table_identifier [ partition_spec ]
max_col_len 13 max_col_len 13
histogram NULL histogram NULL
{% endhighlight %} {% endhighlight %}

View file

@ -51,4 +51,4 @@ REFRESH "hdfs://path/to/table";
- [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html) - [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html)
- [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html) - [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html)
- [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html) - [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html)
- [REFRESH TABLE](sql-ref-syntax-aux-refresh-table.html) - [REFRESH TABLE](sql-ref-syntax-aux-refresh-table.html)

View file

@ -55,4 +55,4 @@ REFRESH TABLE tempDB.view1;
### Related Statements ### Related Statements
- [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html) - [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html)
- [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html) - [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html)
- [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html) - [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html)

View file

@ -22,4 +22,4 @@ license: |
* [ADD FILE](sql-ref-syntax-aux-resource-mgmt-add-file.html) * [ADD FILE](sql-ref-syntax-aux-resource-mgmt-add-file.html)
* [ADD JAR](sql-ref-syntax-aux-resource-mgmt-add-jar.html) * [ADD JAR](sql-ref-syntax-aux-resource-mgmt-add-jar.html)
* [LIST FILE](sql-ref-syntax-aux-resource-mgmt-list-file.html) * [LIST FILE](sql-ref-syntax-aux-resource-mgmt-list-file.html)
* [LIST JAR](sql-ref-syntax-aux-resource-mgmt-list-jar.html) * [LIST JAR](sql-ref-syntax-aux-resource-mgmt-list-jar.html)

View file

@ -104,4 +104,4 @@ SHOW TABLES LIKE 'sam*|suj';
- [CREATE TABLE](sql-ref-syntax-ddl-create-table.html) - [CREATE TABLE](sql-ref-syntax-ddl-create-table.html)
- [DROP TABLE](sql-ref-syntax-ddl-drop-table.html) - [DROP TABLE](sql-ref-syntax-ddl-drop-table.html)
- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html) - [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html)
- [DROP DATABASE](sql-ref-syntax-ddl-drop-database.html) - [DROP DATABASE](sql-ref-syntax-ddl-drop-database.html)

View file

@ -25,4 +25,4 @@ license: |
* [SHOW TABLES](sql-ref-syntax-aux-show-tables.html) * [SHOW TABLES](sql-ref-syntax-aux-show-tables.html)
* [SHOW TBLPROPERTIES](sql-ref-syntax-aux-show-tblproperties.html) * [SHOW TBLPROPERTIES](sql-ref-syntax-aux-show-tblproperties.html)
* [SHOW PARTITIONS](sql-ref-syntax-aux-show-partitions.html) * [SHOW PARTITIONS](sql-ref-syntax-aux-show-partitions.html)
* [SHOW CREATE TABLE](sql-ref-syntax-aux-show-create-table.html) * [SHOW CREATE TABLE](sql-ref-syntax-aux-show-create-table.html)

View file

@ -77,4 +77,4 @@ DROP DATABASE IF EXISTS inventory_db CASCADE;
### Related statements ### Related statements
- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html) - [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html)
- [DESCRIBE DATABASE](sql-ref-syntax-aux-describe-database.html) - [DESCRIBE DATABASE](sql-ref-syntax-aux-describe-database.html)
- [SHOW DATABASES](sql-ref-syntax-aux-show-databases.html) - [SHOW DATABASES](sql-ref-syntax-aux-show-databases.html)

View file

@ -102,4 +102,4 @@ DROP TEMPORARY FUNCTION IF EXISTS test_avg;
### Related statements ### Related statements
- [CREATE FUNCTION](sql-ref-syntax-ddl-create-function.html) - [CREATE FUNCTION](sql-ref-syntax-ddl-create-function.html)
- [DESCRIBE FUNCTION](sql-ref-syntax-aux-describe-function.html) - [DESCRIBE FUNCTION](sql-ref-syntax-aux-describe-function.html)
- [SHOW FUNCTION](sql-ref-syntax-aux-show-functions.html) - [SHOW FUNCTION](sql-ref-syntax-aux-show-functions.html)

View file

@ -84,4 +84,4 @@ INSERT OVERWRITE [ LOCAL ] DIRECTORY directory_path
### Related Statements ### Related Statements
* [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html) * [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html)
* [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html) * [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html)
* [INSERT OVERWRITE DIRECTORY statement](sql-ref-syntax-dml-insert-overwrite-directory.html) * [INSERT OVERWRITE DIRECTORY statement](sql-ref-syntax-dml-insert-overwrite-directory.html)

View file

@ -82,4 +82,4 @@ INSERT OVERWRITE DIRECTORY
### Related Statements ### Related Statements
* [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html) * [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html)
* [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html) * [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html)
* [INSERT OVERWRITE DIRECTORY with Hive format statement](sql-ref-syntax-dml-insert-overwrite-directory-hive.html) * [INSERT OVERWRITE DIRECTORY with Hive format statement](sql-ref-syntax-dml-insert-overwrite-directory-hive.html)

View file

@ -22,4 +22,4 @@ license: |
Data Manipulation Statements are used to add, change, or delete data. Spark SQL supports the following Data Manipulation Statements: Data Manipulation Statements are used to add, change, or delete data. Spark SQL supports the following Data Manipulation Statements:
- [INSERT](sql-ref-syntax-dml-insert.html) - [INSERT](sql-ref-syntax-dml-insert.html)
- [LOAD](sql-ref-syntax-dml-load.html) - [LOAD](sql-ref-syntax-dml-load.html)

View file

@ -96,4 +96,4 @@ SELECT age, name FROM person CLUSTER BY age;
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html) - [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html) - [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html) - [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html) - [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)

View file

@ -91,4 +91,4 @@ SELECT age, name FROM person DISTRIBUTE BY age;
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html) - [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html) - [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html) - [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html) - [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)

View file

@ -184,4 +184,4 @@ SELECT /*+ REPARTITION(zip_code) */ name, age, zip_code FROM person
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html) - [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html) - [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html) - [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html) - [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)

View file

@ -143,4 +143,4 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { named_expression [ , ... ] }
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html) - [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html) - [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html) - [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html) - [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)

View file

@ -28,7 +28,7 @@ in Scala or Java.
## Implementing a Custom Receiver ## Implementing a Custom Receiver
This starts with implementing a **Receiver** This starts with implementing a **Receiver**
([Scala doc](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver), ([Scala doc](api/scala/org/apache/spark/streaming/receiver/Receiver.html),
[Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)). [Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)).
A custom receiver must extend this abstract class by implementing two methods A custom receiver must extend this abstract class by implementing two methods

View file

@ -23,4 +23,4 @@ replicated commit log service. Please read the [Kafka documentation](https://ka
thoroughly before starting an integration using Spark. thoroughly before starting an integration using Spark.
At the moment, Spark requires Kafka 0.10 and higher. See At the moment, Spark requires Kafka 0.10 and higher. See
<a href="streaming-kafka-0-10-integration.html">Kafka 0.10 integration documentation</a> for details. <a href="streaming-kafka-0-10-integration.html">Kafka 0.10 integration documentation</a> for details.

View file

@ -59,7 +59,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
.storageLevel(StorageLevel.MEMORY_AND_DISK_2) .storageLevel(StorageLevel.MEMORY_AND_DISK_2)
.build() .build()
See the [API docs](api/scala/index.html#org.apache.spark.streaming.kinesis.KinesisInputDStream) See the [API docs](api/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.html)
and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala). Refer to the [Running the Example](#running-the-example) subsection for instructions on how to run the example. and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala). Refer to the [Running the Example](#running-the-example) subsection for instructions on how to run the example.
</div> </div>

View file

@ -57,7 +57,7 @@ Spark Streaming provides a high-level abstraction called *discretized stream* or
which represents a continuous stream of data. DStreams can be created either from input data which represents a continuous stream of data. DStreams can be created either from input data
streams from sources such as Kafka, and Kinesis, or by applying high-level streams from sources such as Kafka, and Kinesis, or by applying high-level
operations on other DStreams. Internally, a DStream is represented as a sequence of operations on other DStreams. Internally, a DStream is represented as a sequence of
[RDDs](api/scala/index.html#org.apache.spark.rdd.RDD). [RDDs](api/scala/org/apache/spark/rdd/RDD.html).
This guide shows you how to start writing Spark Streaming programs with DStreams. You can This guide shows you how to start writing Spark Streaming programs with DStreams. You can
write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2),
@ -80,7 +80,7 @@ do is as follows.
<div data-lang="scala" markdown="1" > <div data-lang="scala" markdown="1" >
First, we import the names of the Spark Streaming classes and some implicit First, we import the names of the Spark Streaming classes and some implicit
conversions from StreamingContext into our environment in order to add useful methods to conversions from StreamingContext into our environment in order to add useful methods to
other classes we need (like DStream). [StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) is the other classes we need (like DStream). [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) is the
main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second. main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second.
{% highlight scala %} {% highlight scala %}
@ -185,7 +185,7 @@ JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).itera
generating multiple new records from each record in the source DStream. In this case, generating multiple new records from each record in the source DStream. In this case,
each line will be split into multiple words and the stream of words is represented as the each line will be split into multiple words and the stream of words is represented as the
`words` DStream. Note that we defined the transformation using a `words` DStream. Note that we defined the transformation using a
[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object. [FlatMapFunction](api/scala/org/apache/spark/api/java/function/FlatMapFunction.html) object.
As we will discover along the way, there are a number of such convenience classes in the Java API As we will discover along the way, there are a number of such convenience classes in the Java API
that help defines DStream transformations. that help defines DStream transformations.
@ -201,9 +201,9 @@ wordCounts.print();
{% endhighlight %} {% endhighlight %}
The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word, The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word,
1)` pairs, using a [PairFunction](api/scala/index.html#org.apache.spark.api.java.function.PairFunction) 1)` pairs, using a [PairFunction](api/scala/org/apache/spark/api/java/function/PairFunction.html)
object. Then, it is reduced to get the frequency of words in each batch of data, object. Then, it is reduced to get the frequency of words in each batch of data,
using a [Function2](api/scala/index.html#org.apache.spark.api.java.function.Function2) object. using a [Function2](api/scala/org/apache/spark/api/java/function/Function2.html) object.
Finally, `wordCounts.print()` will print a few of the counts generated every second. Finally, `wordCounts.print()` will print a few of the counts generated every second.
Note that when these lines are executed, Spark Streaming only sets up the computation it Note that when these lines are executed, Spark Streaming only sets up the computation it
@ -435,7 +435,7 @@ To initialize a Spark Streaming program, a **StreamingContext** object has to be
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
A [StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) object can be created from a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object. A [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) object can be created from a [SparkConf](api/scala/org/apache/spark/SparkConf.html) object.
{% highlight scala %} {% highlight scala %}
import org.apache.spark._ import org.apache.spark._
@ -451,7 +451,7 @@ or a special __"local[\*]"__ string to run in local mode. In practice, when runn
you will not want to hardcode `master` in the program, you will not want to hardcode `master` in the program,
but rather [launch the application with `spark-submit`](submitting-applications.html) and but rather [launch the application with `spark-submit`](submitting-applications.html) and
receive it there. However, for local testing and unit tests, you can pass "local[\*]" to run Spark Streaming receive it there. However, for local testing and unit tests, you can pass "local[\*]" to run Spark Streaming
in-process (detects the number of cores in the local system). Note that this internally creates a [SparkContext](api/scala/index.html#org.apache.spark.SparkContext) (starting point of all Spark functionality) which can be accessed as `ssc.sparkContext`. in-process (detects the number of cores in the local system). Note that this internally creates a [SparkContext](api/scala/org/apache/spark/SparkContext.html) (starting point of all Spark functionality) which can be accessed as `ssc.sparkContext`.
The batch interval must be set based on the latency requirements of your application The batch interval must be set based on the latency requirements of your application
and available cluster resources. See the [Performance Tuning](#setting-the-right-batch-interval) and available cluster resources. See the [Performance Tuning](#setting-the-right-batch-interval)
@ -584,7 +584,7 @@ Input DStreams are DStreams representing the stream of input data received from
sources. In the [quick example](#a-quick-example), `lines` was an input DStream as it represented sources. In the [quick example](#a-quick-example), `lines` was an input DStream as it represented
the stream of data received from the netcat server. Every input DStream the stream of data received from the netcat server. Every input DStream
(except file stream, discussed later in this section) is associated with a **Receiver** (except file stream, discussed later in this section) is associated with a **Receiver**
([Scala doc](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver), ([Scala doc](api/scala/org/apache/spark/streaming/receiver/Receiver.html),
[Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)) object which receives the [Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)) object which receives the
data from a source and stores it in Spark's memory for processing. data from a source and stores it in Spark's memory for processing.
@ -739,7 +739,7 @@ DStreams can be created with data streams received through custom receivers. See
For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream. For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
For more details on streams from sockets and files, see the API documentations of the relevant functions in For more details on streams from sockets and files, see the API documentations of the relevant functions in
[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) for
Scala, [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html) Scala, [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
for Java, and [StreamingContext](api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext) for Python. for Java, and [StreamingContext](api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext) for Python.
@ -1219,8 +1219,8 @@ joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))
In fact, you can also dynamically change the dataset you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to. In fact, you can also dynamically change the dataset you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to.
The complete list of DStream transformations is available in the API documentation. For the Scala API, The complete list of DStream transformations is available in the API documentation. For the Scala API,
see [DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream) see [DStream](api/scala/org/apache/spark/streaming/dstream/DStream.html)
and [PairDStreamFunctions](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions). and [PairDStreamFunctions](api/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.html).
For the Java API, see [JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html) For the Java API, see [JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html)
and [JavaPairDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaPairDStream.html). and [JavaPairDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaPairDStream.html).
For the Python API, see [DStream](api/python/pyspark.streaming.html#pyspark.streaming.DStream). For the Python API, see [DStream](api/python/pyspark.streaming.html#pyspark.streaming.DStream).
@ -2067,7 +2067,7 @@ for prime time, the old one be can be brought down. Note that this can be done f
sending the data to two destinations (i.e., the earlier and upgraded applications). sending the data to two destinations (i.e., the earlier and upgraded applications).
- The existing application is shutdown gracefully (see - The existing application is shutdown gracefully (see
[`StreamingContext.stop(...)`](api/scala/index.html#org.apache.spark.streaming.StreamingContext) [`StreamingContext.stop(...)`](api/scala/org/apache/spark/streaming/StreamingContext.html)
or [`JavaStreamingContext.stop(...)`](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html) or [`JavaStreamingContext.stop(...)`](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
for graceful shutdown options) which ensure data that has been received is completely for graceful shutdown options) which ensure data that has been received is completely
processed before shutdown. Then the processed before shutdown. Then the
@ -2104,7 +2104,7 @@ In that case, consider
[reducing](#reducing-the-batch-processing-times) the batch processing time. [reducing](#reducing-the-batch-processing-times) the batch processing time.
The progress of a Spark Streaming program can also be monitored using the The progress of a Spark Streaming program can also be monitored using the
[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener) interface, [StreamingListener](api/scala/org/apache/spark/streaming/scheduler/StreamingListener.html) interface,
which allows you to get receiver status and processing times. Note that this is a developer API which allows you to get receiver status and processing times. Note that this is a developer API
and it is likely to be improved upon (i.e., more information reported) in the future. and it is likely to be improved upon (i.e., more information reported) in the future.
@ -2197,7 +2197,7 @@ computation is not high enough. For example, for distributed reduce operations l
and `reduceByKeyAndWindow`, the default number of parallel tasks is controlled by and `reduceByKeyAndWindow`, the default number of parallel tasks is controlled by
the `spark.default.parallelism` [configuration property](configuration.html#spark-properties). You the `spark.default.parallelism` [configuration property](configuration.html#spark-properties). You
can pass the level of parallelism as an argument (see can pass the level of parallelism as an argument (see
[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions) [`PairDStreamFunctions`](api/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.html)
documentation), or set the `spark.default.parallelism` documentation), or set the `spark.default.parallelism`
[configuration property](configuration.html#spark-properties) to change the default. [configuration property](configuration.html#spark-properties) to change the default.
@ -2205,9 +2205,9 @@ documentation), or set the `spark.default.parallelism`
{:.no_toc} {:.no_toc}
The overheads of data serialization can be reduced by tuning the serialization formats. In the case of streaming, there are two types of data that are being serialized. The overheads of data serialization can be reduced by tuning the serialization formats. In the case of streaming, there are two types of data that are being serialized.
* **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format. * **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/org/apache/spark/storage/StorageLevel$.html). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format.
* **Persisted RDDs generated by Streaming Operations**: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of [StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$), persisted RDDs generated by streaming computations are persisted with [StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$) (i.e. serialized) by default to minimize GC overheads. * **Persisted RDDs generated by Streaming Operations**: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of [StorageLevel.MEMORY_ONLY](api/scala/org/apache/spark/storage/StorageLevel$.html), persisted RDDs generated by streaming computations are persisted with [StorageLevel.MEMORY_ONLY_SER](api/scala/org/apache/spark/storage/StorageLevel.html$) (i.e. serialized) by default to minimize GC overheads.
In both cases, using Kryo serialization can reduce both CPU and memory overheads. See the [Spark Tuning Guide](tuning.html#data-serialization) for more details. For Kryo, consider registering custom classes, and disabling object reference tracking (see Kryo-related configurations in the [Configuration Guide](configuration.html#compression-and-serialization)). In both cases, using Kryo serialization can reduce both CPU and memory overheads. See the [Spark Tuning Guide](tuning.html#data-serialization) for more details. For Kryo, consider registering custom classes, and disabling object reference tracking (see Kryo-related configurations in the [Configuration Guide](configuration.html#compression-and-serialization)).
@ -2247,7 +2247,7 @@ A good approach to figure out the right batch size for your application is to te
conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system
is able to keep up with the data rate, you can check the value of the end-to-end delay experienced is able to keep up with the data rate, you can check the value of the end-to-end delay experienced
by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the
[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener) [StreamingListener](api/scala/org/apache/spark/streaming/scheduler/StreamingListener.html)
interface). interface).
If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise, If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise,
if the delay is continuously increasing, it means that the system is unable to keep up and it if the delay is continuously increasing, it means that the system is unable to keep up and it
@ -2485,10 +2485,10 @@ additional effort may be necessary to achieve exactly-once semantics. There are
* Third-party DStream data sources can be found in [Third Party Projects](https://spark.apache.org/third-party-projects.html) * Third-party DStream data sources can be found in [Third Party Projects](https://spark.apache.org/third-party-projects.html)
* API documentation * API documentation
- Scala docs - Scala docs
* [StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) and * [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) and
[DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream) [DStream](api/scala/org/apache/spark/streaming/dstream/DStream.html)
* [KafkaUtils](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$), * [KafkaUtils](api/scala/org/apache/spark/streaming/kafka/KafkaUtils$.html),
[KinesisUtils](api/scala/index.html#org.apache.spark.streaming.kinesis.KinesisInputDStream), [KinesisUtils](api/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.html),
- Java docs - Java docs
* [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html), * [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html),
[JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html) and [JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html) and

View file

@ -498,13 +498,13 @@ to track the read position in the stream. The engine uses checkpointing and writ
# API using Datasets and DataFrames # API using Datasets and DataFrames
Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point `SparkSession` Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point `SparkSession`
([Scala](api/scala/index.html#org.apache.spark.sql.SparkSession)/[Java](api/java/org/apache/spark/sql/SparkSession.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession)/[R](api/R/sparkR.session.html) docs) ([Scala](api/scala/org/apache/spark/sql/SparkSession.html)/[Java](api/java/org/apache/spark/sql/SparkSession.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession)/[R](api/R/sparkR.session.html) docs)
to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the
[DataFrame/Dataset Programming Guide](sql-programming-guide.html). [DataFrame/Dataset Programming Guide](sql-programming-guide.html).
## Creating streaming DataFrames and streaming Datasets ## Creating streaming DataFrames and streaming Datasets
Streaming DataFrames can be created through the `DataStreamReader` interface Streaming DataFrames can be created through the `DataStreamReader` interface
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader) docs) ([Scala](api/scala/org/apache/spark/sql/streaming/DataStreamReader.html)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader) docs)
returned by `SparkSession.readStream()`. In [R](api/R/read.stream.html), with the `read.stream()` method. Similar to the read interface for creating static DataFrame, you can specify the details of the source data format, schema, options, etc. returned by `SparkSession.readStream()`. In [R](api/R/read.stream.html), with the `read.stream()` method. Similar to the read interface for creating static DataFrame, you can specify the details of the source data format, schema, options, etc.
#### Input Sources #### Input Sources
@ -557,7 +557,7 @@ Here are the details of all the sources in Spark.
NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. Spark may not clean up some source files in some circumstances - e.g. the application doesn't shut down gracefully, too many files are queued to clean up. NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. Spark may not clean up some source files in some circumstances - e.g. the application doesn't shut down gracefully, too many files are queued to clean up.
<br/><br/> <br/><br/>
For file-format-specific options, see the related methods in <code>DataStreamReader</code> For file-format-specific options, see the related methods in <code>DataStreamReader</code>
(<a href="api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>/<a (<a href="api/scala/org/apache/spark/sql/streaming/DataStreamReader.html">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>/<a
href="api/R/read.stream.html">R</a>). href="api/R/read.stream.html">R</a>).
E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>. E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>.
<br/><br/> <br/><br/>
@ -1622,7 +1622,7 @@ However, as a side effect, data from the slower streams will be aggressively dro
this configuration judiciously. this configuration judiciously.
### Arbitrary Stateful Operations ### Arbitrary Stateful Operations
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)). Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/org/apache/spark/sql/streaming/GroupState.html)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)).
Though Spark cannot check and force it, the state function should be implemented with respect to the semantics of the output mode. For example, in Update mode Spark doesn't expect that the state function will emit rows which are older than current watermark plus allowed late record delay, whereas in Append mode the state function can emit these rows. Though Spark cannot check and force it, the state function should be implemented with respect to the semantics of the output mode. For example, in Update mode Spark doesn't expect that the state function will emit rows which are older than current watermark plus allowed late record delay, whereas in Append mode the state function can emit these rows.
@ -1679,7 +1679,7 @@ end-to-end exactly once per query. Ensuring end-to-end exactly once for the last
## Starting Streaming Queries ## Starting Streaming Queries
Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter` Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter`
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamWriter) docs) ([Scala](api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamWriter) docs)
returned through `Dataset.writeStream()`. You will have to specify one or more of the following in this interface. returned through `Dataset.writeStream()`. You will have to specify one or more of the following in this interface.
- *Details of the output sink:* Data format, location, etc. - *Details of the output sink:* Data format, location, etc.
@ -1863,7 +1863,7 @@ Here are the details of all the sinks in Spark.
<code>path</code>: path to the output directory, must be specified. <code>path</code>: path to the output directory, must be specified.
<br/><br/> <br/><br/>
For file-format-specific options, see the related methods in DataFrameWriter For file-format-specific options, see the related methods in DataFrameWriter
(<a href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a (<a href="api/scala/org/apache/spark/sql/DataFrameWriter.html">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a
href="api/R/write.stream.html">R</a>). href="api/R/write.stream.html">R</a>).
E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code> E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>
</td> </td>
@ -2175,7 +2175,7 @@ Since Spark 2.4, `foreach` is available in Scala, Java and Python.
<div class="codetabs"> <div class="codetabs">
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
In Scala, you have to extend the class `ForeachWriter` ([docs](api/scala/index.html#org.apache.spark.sql.ForeachWriter)). In Scala, you have to extend the class `ForeachWriter` ([docs](api/scala/org/apache/spark/sql/ForeachWriter.html)).
{% highlight scala %} {% highlight scala %}
streamingDatasetOfString.writeStream.foreach( streamingDatasetOfString.writeStream.foreach(
@ -2564,7 +2564,7 @@ lastProgress(query) # the most recent progress update of this streaming qu
</div> </div>
You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources. You can use `sparkSession.streams()` to get the `StreamingQueryManager` You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources. You can use `sparkSession.streams()` to get the `StreamingQueryManager`
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryManager)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager) docs) ([Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager) docs)
that can be used to manage the currently active queries. that can be used to manage the currently active queries.
<div class="codetabs"> <div class="codetabs">
@ -2624,7 +2624,7 @@ There are multiple ways to monitor active streaming queries. You can either push
You can directly get the current status and metrics of an active query using You can directly get the current status and metrics of an active query using
`streamingQuery.lastProgress()` and `streamingQuery.status()`. `streamingQuery.lastProgress()` and `streamingQuery.status()`.
`lastProgress()` returns a `StreamingQueryProgress` object `lastProgress()` returns a `StreamingQueryProgress` object
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryProgress) in [Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html) and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
and a dictionary with the same fields in Python. It has all the information about and a dictionary with the same fields in Python. It has all the information about
the progress made in the last trigger of the stream - what data was processed, the progress made in the last trigger of the stream - what data was processed,
@ -2632,7 +2632,7 @@ what were the processing rates, latencies, etc. There is also
`streamingQuery.recentProgress` which returns an array of last few progresses. `streamingQuery.recentProgress` which returns an array of last few progresses.
In addition, `streamingQuery.status()` returns a `StreamingQueryStatus` object In addition, `streamingQuery.status()` returns a `StreamingQueryStatus` object
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryStatus) in [Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html) and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
and a dictionary with the same fields in Python. It gives information about and a dictionary with the same fields in Python. It gives information about
what the query is immediately doing - is a trigger active, is data being processed, etc. what the query is immediately doing - is a trigger active, is data being processed, etc.
@ -2853,7 +2853,7 @@ Will print something like the following.
You can also asynchronously monitor all queries associated with a You can also asynchronously monitor all queries associated with a
`SparkSession` by attaching a `StreamingQueryListener` `SparkSession` by attaching a `StreamingQueryListener`
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryListener)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html) docs). ([Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryListener.html)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html) docs).
Once you attach your custom `StreamingQueryListener` object with Once you attach your custom `StreamingQueryListener` object with
`sparkSession.streams.attachListener()`, you will get callbacks when a query is started and `sparkSession.streams.attachListener()`, you will get callbacks when a query is started and
stopped and when there is progress made in an active query. Here is an example, stopped and when there is progress made in an active query. Here is an example,

View file

@ -260,7 +260,7 @@ enough. Spark automatically sets the number of "map" tasks to run on each file a
(though you can control it through optional parameters to `SparkContext.textFile`, etc), and for (though you can control it through optional parameters to `SparkContext.textFile`, etc), and for
distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses the largest distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses the largest
parent RDD's number of partitions. You can pass the level of parallelism as a second argument parent RDD's number of partitions. You can pass the level of parallelism as a second argument
(see the [`spark.PairRDDFunctions`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) documentation), (see the [`spark.PairRDDFunctions`](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) documentation),
or set the config property `spark.default.parallelism` to change the default. or set the config property `spark.default.parallelism` to change the default.
In general, we recommend 2-3 tasks per CPU core in your cluster. In general, we recommend 2-3 tasks per CPU core in your cluster.