[SPARK-30803][DOCS] Fix the home page link for Scala API document
### What changes were proposed in this pull request? Change the link to the Scala API document. ``` $ git grep "#org.apache.spark.package" docs/_layouts/global.html: <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li> docs/index.md:* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package) docs/rdd-programming-guide.md:[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/). ``` ### Why are the changes needed? The home page link for Scala API document is incorrect after upgrade to 3.0 ### Does this PR introduce any user-facing change? Document UI change only. ### How was this patch tested? Local test, attach screenshots below: Before: ![image](https://user-images.githubusercontent.com/4833765/74335713-c2385300-4dd7-11ea-95d8-f5a3639d2578.png) After: ![image](https://user-images.githubusercontent.com/4833765/74335727-cbc1bb00-4dd7-11ea-89d9-4dcc1310e679.png) Closes #27549 from xuanyuanking/scala-doc. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
parent
0a03e7e679
commit
01cc852982
|
@ -82,7 +82,7 @@
|
||||||
<li class="dropdown">
|
<li class="dropdown">
|
||||||
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
|
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
|
||||||
<ul class="dropdown-menu">
|
<ul class="dropdown-menu">
|
||||||
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
|
<li><a href="api/scala/org/apache/spark/index.html">Scala</a></li>
|
||||||
<li><a href="api/java/index.html">Java</a></li>
|
<li><a href="api/java/index.html">Java</a></li>
|
||||||
<li><a href="api/python/index.html">Python</a></li>
|
<li><a href="api/python/index.html">Python</a></li>
|
||||||
<li><a href="api/R/index.html">R</a></li>
|
<li><a href="api/R/index.html">R</a></li>
|
||||||
|
|
|
@ -24,7 +24,7 @@ license: |
|
||||||
Spark provides three locations to configure the system:
|
Spark provides three locations to configure the system:
|
||||||
|
|
||||||
* [Spark properties](#spark-properties) control most application parameters and can be set by using
|
* [Spark properties](#spark-properties) control most application parameters and can be set by using
|
||||||
a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object, or through Java
|
a [SparkConf](api/scala/org/apache/spark/SparkConf.html) object, or through Java
|
||||||
system properties.
|
system properties.
|
||||||
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
|
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
|
||||||
the IP address, through the `conf/spark-env.sh` script on each node.
|
the IP address, through the `conf/spark-env.sh` script on each node.
|
||||||
|
@ -34,7 +34,7 @@ Spark provides three locations to configure the system:
|
||||||
|
|
||||||
Spark properties control most application settings and are configured separately for each
|
Spark properties control most application settings and are configured separately for each
|
||||||
application. These properties can be set directly on a
|
application. These properties can be set directly on a
|
||||||
[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) passed to your
|
[SparkConf](api/scala/org/apache/spark/SparkConf.html) passed to your
|
||||||
`SparkContext`. `SparkConf` allows you to configure some of the common properties
|
`SparkContext`. `SparkConf` allows you to configure some of the common properties
|
||||||
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
|
(e.g. master URL and application name), as well as arbitrary key-value pairs through the
|
||||||
`set()` method. For example, we could initialize an application with two threads as follows:
|
`set()` method. For example, we could initialize an application with two threads as follows:
|
||||||
|
@ -1326,7 +1326,7 @@ Apart from these, the following properties are also available, and may be useful
|
||||||
property is useful if you need to register your classes in a custom way, e.g. to specify a custom
|
property is useful if you need to register your classes in a custom way, e.g. to specify a custom
|
||||||
field serializer. Otherwise <code>spark.kryo.classesToRegister</code> is simpler. It should be
|
field serializer. Otherwise <code>spark.kryo.classesToRegister</code> is simpler. It should be
|
||||||
set to classes that extend
|
set to classes that extend
|
||||||
<a href="api/scala/index.html#org.apache.spark.serializer.KryoRegistrator">
|
<a href="api/scala/org/apache/spark/serializer/KryoRegistrator.html">
|
||||||
<code>KryoRegistrator</code></a>.
|
<code>KryoRegistrator</code></a>.
|
||||||
See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
|
See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
|
||||||
</td>
|
</td>
|
||||||
|
@ -1379,7 +1379,7 @@ Apart from these, the following properties are also available, and may be useful
|
||||||
but is quite slow, so we recommend <a href="tuning.html">using
|
but is quite slow, so we recommend <a href="tuning.html">using
|
||||||
<code>org.apache.spark.serializer.KryoSerializer</code> and configuring Kryo serialization</a>
|
<code>org.apache.spark.serializer.KryoSerializer</code> and configuring Kryo serialization</a>
|
||||||
when speed is necessary. Can be any subclass of
|
when speed is necessary. Can be any subclass of
|
||||||
<a href="api/scala/index.html#org.apache.spark.serializer.Serializer">
|
<a href="api/scala/org/apache/spark/serializer/Serializer.html">
|
||||||
<code>org.apache.spark.Serializer</code></a>.
|
<code>org.apache.spark.Serializer</code></a>.
|
||||||
</td>
|
</td>
|
||||||
</tr>
|
</tr>
|
||||||
|
|
|
@ -25,38 +25,38 @@ license: |
|
||||||
|
|
||||||
<!-- All the documentation links -->
|
<!-- All the documentation links -->
|
||||||
|
|
||||||
[EdgeRDD]: api/scala/index.html#org.apache.spark.graphx.EdgeRDD
|
[EdgeRDD]: api/scala/org/apache/spark/graphx/EdgeRDD.html
|
||||||
[VertexRDD]: api/scala/index.html#org.apache.spark.graphx.VertexRDD
|
[VertexRDD]: api/scala/org/apache/spark/graphx/VertexRDD.html
|
||||||
[Edge]: api/scala/index.html#org.apache.spark.graphx.Edge
|
[Edge]: api/scala/org/apache/spark/graphx/Edge.html
|
||||||
[EdgeTriplet]: api/scala/index.html#org.apache.spark.graphx.EdgeTriplet
|
[EdgeTriplet]: api/scala/org/apache/spark/graphx/EdgeTriplet.html
|
||||||
[Graph]: api/scala/index.html#org.apache.spark.graphx.Graph
|
[Graph]: api/scala/org/apache/spark/graphx/Graph$.html
|
||||||
[GraphOps]: api/scala/index.html#org.apache.spark.graphx.GraphOps
|
[GraphOps]: api/scala/org/apache/spark/graphx/GraphOps.html
|
||||||
[Graph.mapVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]
|
[Graph.mapVertices]: api/scala/org/apache/spark/graphx/Graph.html#mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]
|
||||||
[Graph.reverse]: api/scala/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED]
|
[Graph.reverse]: api/scala/org/apache/spark/graphx/Graph.html#reverse:Graph[VD,ED]
|
||||||
[Graph.subgraph]: api/scala/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED]
|
[Graph.subgraph]: api/scala/org/apache/spark/graphx/Graph.html#subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED]
|
||||||
[Graph.mask]: api/scala/index.html#org.apache.spark.graphx.Graph@mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED]
|
[Graph.mask]: api/scala/org/apache/spark/graphx/Graph.html#mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED]
|
||||||
[Graph.groupEdges]: api/scala/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED]
|
[Graph.groupEdges]: api/scala/org/apache/spark/graphx/Graph.html#groupEdges((ED,ED)⇒ED):Graph[VD,ED]
|
||||||
[GraphOps.joinVertices]: api/scala/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED]
|
[GraphOps.joinVertices]: api/scala/org/apache/spark/graphx/GraphOps.html#joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED]
|
||||||
[Graph.outerJoinVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]
|
[Graph.outerJoinVertices]: api/scala/org/apache/spark/graphx/Graph.html#outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]
|
||||||
[Graph.aggregateMessages]: api/scala/index.html#org.apache.spark.graphx.Graph@aggregateMessages[A]((EdgeContext[VD,ED,A])⇒Unit,(A,A)⇒A,TripletFields)(ClassTag[A]):VertexRDD[A]
|
[Graph.aggregateMessages]: api/scala/org/apache/spark/graphx/Graph.html#aggregateMessages[A]((EdgeContext[VD,ED,A])⇒Unit,(A,A)⇒A,TripletFields)(ClassTag[A]):VertexRDD[A]
|
||||||
[EdgeContext]: api/scala/index.html#org.apache.spark.graphx.EdgeContext
|
[EdgeContext]: api/scala/org/apache/spark/graphx/EdgeContext.html
|
||||||
[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
|
[GraphOps.collectNeighborIds]: api/scala/org/apache/spark/graphx/GraphOps.html#collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
|
||||||
[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
|
[GraphOps.collectNeighbors]: api/scala/org/apache/spark/graphx/GraphOps.html#collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
|
||||||
[RDD Persistence]: rdd-programming-guide.html#rdd-persistence
|
[RDD Persistence]: rdd-programming-guide.html#rdd-persistence
|
||||||
[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
|
[Graph.cache]: api/scala/org/apache/spark/graphx/Graph.html#cache():Graph[VD,ED]
|
||||||
[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
|
[GraphOps.pregel]: api/scala/org/apache/spark/graphx/GraphOps.html#pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
|
||||||
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$
|
[PartitionStrategy]: api/scala/org/apache/spark/graphx/PartitionStrategy$.html
|
||||||
[GraphLoader.edgeListFile]: api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]
|
[GraphLoader.edgeListFile]: api/scala/org/apache/spark/graphx/GraphLoader$.html#edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]
|
||||||
[Graph.apply]: api/scala/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
|
[Graph.apply]: api/scala/org/apache/spark/graphx/Graph$.html#apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
|
||||||
[Graph.fromEdgeTuples]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]
|
[Graph.fromEdgeTuples]: api/scala/org/apache/spark/graphx/Graph$.html#fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]
|
||||||
[Graph.fromEdges]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
|
[Graph.fromEdges]: api/scala/org/apache/spark/graphx/Graph$.html#fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
|
||||||
[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy
|
[PartitionStrategy]: api/scala/org/apache/spark/graphx/PartitionStrategy$.html
|
||||||
[PageRank]: api/scala/index.html#org.apache.spark.graphx.lib.PageRank$
|
[PageRank]: api/scala/org/apache/spark/graphx/lib/PageRank$.html
|
||||||
[ConnectedComponents]: api/scala/index.html#org.apache.spark.graphx.lib.ConnectedComponents$
|
[ConnectedComponents]: api/scala/org/apache/spark/graphx/lib/ConnectedComponents$.html
|
||||||
[TriangleCount]: api/scala/index.html#org.apache.spark.graphx.lib.TriangleCount$
|
[TriangleCount]: api/scala/org/apache/spark/graphx/lib/TriangleCount$.html
|
||||||
[Graph.partitionBy]: api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]
|
[Graph.partitionBy]: api/scala/org/apache/spark/graphx/Graph.html#partitionBy(PartitionStrategy):Graph[VD,ED]
|
||||||
[EdgeContext.sendToSrc]: api/scala/index.html#org.apache.spark.graphx.EdgeContext@sendToSrc(msg:A):Unit
|
[EdgeContext.sendToSrc]: api/scala/org/apache/spark/graphx/EdgeContext.html#sendToSrc(msg:A):Unit
|
||||||
[EdgeContext.sendToDst]: api/scala/index.html#org.apache.spark.graphx.EdgeContext@sendToDst(msg:A):Unit
|
[EdgeContext.sendToDst]: api/scala/org/apache/spark/graphx/EdgeContext.html#sendToDst(msg:A):Unit
|
||||||
[TripletFields]: api/java/org/apache/spark/graphx/TripletFields.html
|
[TripletFields]: api/java/org/apache/spark/graphx/TripletFields.html
|
||||||
[TripletFields.All]: api/java/org/apache/spark/graphx/TripletFields.html#All
|
[TripletFields.All]: api/java/org/apache/spark/graphx/TripletFields.html#All
|
||||||
[TripletFields.None]: api/java/org/apache/spark/graphx/TripletFields.html#None
|
[TripletFields.None]: api/java/org/apache/spark/graphx/TripletFields.html#None
|
||||||
|
@ -74,7 +74,7 @@ license: |
|
||||||
# Overview
|
# Overview
|
||||||
|
|
||||||
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level,
|
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level,
|
||||||
GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing a
|
GraphX extends the Spark [RDD](api/scala/org/apache/spark/rdd/RDD.html) by introducing a
|
||||||
new [Graph](#property_graph) abstraction: a directed multigraph with properties
|
new [Graph](#property_graph) abstraction: a directed multigraph with properties
|
||||||
attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental
|
attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental
|
||||||
operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
|
operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
|
||||||
|
@ -99,7 +99,7 @@ getting started with Spark refer to the [Spark Quick Start Guide](quick-start.ht
|
||||||
|
|
||||||
# The Property Graph
|
# The Property Graph
|
||||||
|
|
||||||
The [property graph](api/scala/index.html#org.apache.spark.graphx.Graph) is a directed multigraph
|
The [property graph](api/scala/org/apache/spark/graphx/Graph.html) is a directed multigraph
|
||||||
with user defined objects attached to each vertex and edge. A directed multigraph is a directed
|
with user defined objects attached to each vertex and edge. A directed multigraph is a directed
|
||||||
graph with potentially multiple parallel edges sharing the same source and destination vertex. The
|
graph with potentially multiple parallel edges sharing the same source and destination vertex. The
|
||||||
ability to support parallel edges simplifies modeling scenarios where there can be multiple
|
ability to support parallel edges simplifies modeling scenarios where there can be multiple
|
||||||
|
@ -175,7 +175,7 @@ val userGraph: Graph[(String, String), String]
|
||||||
There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
|
There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
|
||||||
generators and these are discussed in more detail in the section on
|
generators and these are discussed in more detail in the section on
|
||||||
[graph builders](#graph_builders). Probably the most general method is to use the
|
[graph builders](#graph_builders). Probably the most general method is to use the
|
||||||
[Graph object](api/scala/index.html#org.apache.spark.graphx.Graph$). For example the following
|
[Graph object](api/scala/org/apache/spark/graphx/Graph$.html). For example the following
|
||||||
code constructs a graph from a collection of RDDs:
|
code constructs a graph from a collection of RDDs:
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
|
|
|
@ -118,7 +118,7 @@ options for deployment:
|
||||||
|
|
||||||
**API Docs:**
|
**API Docs:**
|
||||||
|
|
||||||
* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package)
|
* [Spark Scala API (Scaladoc)](api/scala/org/apache/spark/index.html)
|
||||||
* [Spark Java API (Javadoc)](api/java/index.html)
|
* [Spark Java API (Javadoc)](api/java/index.html)
|
||||||
* [Spark Python API (Sphinx)](api/python/index.html)
|
* [Spark Python API (Sphinx)](api/python/index.html)
|
||||||
* [Spark R API (Roxygen2)](api/R/index.html)
|
* [Spark R API (Roxygen2)](api/R/index.html)
|
||||||
|
|
|
@ -55,10 +55,10 @@ other first-order optimizations.
|
||||||
Quasi-Newton](https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/andrew07scalable.pdf)
|
Quasi-Newton](https://www.microsoft.com/en-us/research/wp-content/uploads/2007/01/andrew07scalable.pdf)
|
||||||
(OWL-QN) is an extension of L-BFGS that can effectively handle L1 and elastic net regularization.
|
(OWL-QN) is an extension of L-BFGS that can effectively handle L1 and elastic net regularization.
|
||||||
|
|
||||||
L-BFGS is used as a solver for [LinearRegression](api/scala/index.html#org.apache.spark.ml.regression.LinearRegression),
|
L-BFGS is used as a solver for [LinearRegression](api/scala/org/apache/spark/ml/regression/LinearRegression.html),
|
||||||
[LogisticRegression](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression),
|
[LogisticRegression](api/scala/org/apache/spark/ml/classification/LogisticRegression.html),
|
||||||
[AFTSurvivalRegression](api/scala/index.html#org.apache.spark.ml.regression.AFTSurvivalRegression)
|
[AFTSurvivalRegression](api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html)
|
||||||
and [MultilayerPerceptronClassifier](api/scala/index.html#org.apache.spark.ml.classification.MultilayerPerceptronClassifier).
|
and [MultilayerPerceptronClassifier](api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html).
|
||||||
|
|
||||||
MLlib L-BFGS solver calls the corresponding implementation in [breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGS.scala).
|
MLlib L-BFGS solver calls the corresponding implementation in [breeze](https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/LBFGS.scala).
|
||||||
|
|
||||||
|
@ -108,4 +108,4 @@ It solves certain optimization problems iteratively through the following proced
|
||||||
|
|
||||||
Since it involves solving a weighted least squares (WLS) problem by `WeightedLeastSquares` in each iteration,
|
Since it involves solving a weighted least squares (WLS) problem by `WeightedLeastSquares` in each iteration,
|
||||||
it also requires the number of features to be no more than 4096.
|
it also requires the number of features to be no more than 4096.
|
||||||
Currently IRLS is used as the default solver of [GeneralizedLinearRegression](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression).
|
Currently IRLS is used as the default solver of [GeneralizedLinearRegression](api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html).
|
||||||
|
|
|
@ -71,7 +71,7 @@ $\alpha$ and `regParam` corresponds to $\lambda$.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression).
|
More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/classification/LogisticRegression.html).
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/LogisticRegressionWithElasticNetExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -109,12 +109,12 @@ only available on the driver.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
[`LogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionTrainingSummary)
|
[`LogisticRegressionTrainingSummary`](api/scala/org/apache/spark/ml/classification/LogisticRegressionTrainingSummary.html)
|
||||||
provides a summary for a
|
provides a summary for a
|
||||||
[`LogisticRegressionModel`](api/scala/index.html#org.apache.spark.ml.classification.LogisticRegressionModel).
|
[`LogisticRegressionModel`](api/scala/org/apache/spark/ml/classification/LogisticRegressionModel.html).
|
||||||
In the case of binary classification, certain additional metrics are
|
In the case of binary classification, certain additional metrics are
|
||||||
available, e.g. ROC curve. The binary summary can be accessed via the
|
available, e.g. ROC curve. The binary summary can be accessed via the
|
||||||
`binarySummary` method. See [`BinaryLogisticRegressionTrainingSummary`](api/scala/index.html#org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary).
|
`binarySummary` method. See [`BinaryLogisticRegressionTrainingSummary`](api/scala/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
|
||||||
|
|
||||||
Continuing the earlier example:
|
Continuing the earlier example:
|
||||||
|
|
||||||
|
@ -216,7 +216,7 @@ We use two feature transformers to prepare the data; these help index categories
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
|
More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.html).
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
|
||||||
|
|
||||||
|
@ -261,7 +261,7 @@ We use two feature transformers to prepare the data; these help index categories
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.RandomForestClassifier) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/RandomForestClassifier.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/RandomForestClassifierExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -302,7 +302,7 @@ We use two feature transformers to prepare the data; these help index categories
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.GBTClassifier) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/GBTClassifier.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeClassifierExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -358,7 +358,7 @@ MLPC employs backpropagation for learning the model. We use the logistic loss fu
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.MultilayerPerceptronClassifier) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/MultilayerPerceptronClassifier.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/MultilayerPerceptronClassifierExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -403,7 +403,7 @@ in Spark ML supports binary classification with linear SVM. Internally, it optim
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.LinearSVC) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/LinearSVC.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/LinearSVCExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/LinearSVCExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -447,7 +447,7 @@ The example below demonstrates how to load the
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.OneVsRest) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/OneVsRest.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/OneVsRestExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -501,7 +501,7 @@ setting the parameter $\lambda$ (default to $1.0$).
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.NaiveBayes) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/NaiveBayes.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/NaiveBayesExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/NaiveBayesExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -544,7 +544,7 @@ We scale features to be between 0 and 1 to prevent the exploding gradient proble
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.classification.FMClassifier) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/classification/FMClassifier.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/FMClassifierExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/FMClassifierExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -585,7 +585,7 @@ regression model and extracting model summary statistics.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.LinearRegression).
|
More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/regression/LinearRegression.html).
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/LinearRegressionWithElasticNetExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -726,7 +726,7 @@ function and extracting model summary statistics.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GeneralizedLinearRegression) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/GeneralizedLinearRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -768,7 +768,7 @@ We use a feature transformer to index categorical features, adding metadata to t
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor).
|
More details on parameters can be found in the [Scala API documentation](api/scala/org/apache/spark/ml/regression/DecisionTreeRegressor.html).
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -810,7 +810,7 @@ We use a feature transformer to index categorical features, adding metadata to t
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.RandomForestRegressor) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/RandomForestRegressor.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/RandomForestRegressorExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/RandomForestRegressorExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -851,7 +851,7 @@ be true in general.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.GBTRegressor) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/GBTRegressor.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeRegressorExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/GradientBoostedTreeRegressorExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -945,7 +945,7 @@ The implementation matches the result from R's survival function
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.AFTSurvivalRegression) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/AFTSurvivalRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -1025,7 +1025,7 @@ is treated as piecewise linear function. The rules for prediction therefore are:
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [`IsotonicRegression` Scala docs](api/scala/index.html#org.apache.spark.ml.regression.IsotonicRegression) for details on the API.
|
Refer to the [`IsotonicRegression` Scala docs](api/scala/org/apache/spark/ml/regression/IsotonicRegression.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/IsotonicRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -1066,7 +1066,7 @@ We scale features to be between 0 and 1 to prevent the exploding gradient proble
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.regression.FMRegressor) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/regression/FMRegressor.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/FMRegressorExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/FMRegressorExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -85,7 +85,7 @@ called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/KMeans.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -123,7 +123,7 @@ and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel`
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/LDA.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -166,7 +166,7 @@ Bisecting K-means can often be much faster than regular K-means, but it will gen
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -255,7 +255,7 @@ model.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/GaussianMixture.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -302,7 +302,7 @@ using truncated power iteration on a normalized pair-wise similarity matrix of t
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.PowerIterationClustering) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/clustering/PowerIterationClustering.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/PowerIterationClusteringExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -115,7 +115,7 @@ explicit (`implicitPrefs` is `false`).
|
||||||
We evaluate the recommendation model by measuring the root-mean-square error of
|
We evaluate the recommendation model by measuring the root-mean-square error of
|
||||||
rating prediction.
|
rating prediction.
|
||||||
|
|
||||||
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.ml.recommendation.ALS)
|
Refer to the [`ALS` Scala docs](api/scala/org/apache/spark/ml/recommendation/ALS.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}
|
||||||
|
|
|
@ -42,7 +42,7 @@ The schema of the `image` column is:
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`ImageDataSource`](api/scala/index.html#org.apache.spark.ml.source.image.ImageDataSource)
|
[`ImageDataSource`](api/scala/org/apache/spark/ml/source/image/ImageDataSource.html)
|
||||||
implements a Spark SQL data source API for loading image data as a DataFrame.
|
implements a Spark SQL data source API for loading image data as a DataFrame.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
|
@ -133,7 +133,7 @@ The schemas of the columns are:
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`LibSVMDataSource`](api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource)
|
[`LibSVMDataSource`](api/scala/org/apache/spark/ml/source/libsvm/LibSVMDataSource.html)
|
||||||
implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
|
implements a Spark SQL data source API for loading `LIBSVM` data as a DataFrame.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
|
|
|
@ -96,8 +96,8 @@ when using text as features. Our feature vectors could then be passed to a lear
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [HashingTF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and
|
Refer to the [HashingTF Scala docs](api/scala/org/apache/spark/ml/feature/HashingTF.html) and
|
||||||
the [IDF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IDF) for more details on the API.
|
the [IDF Scala docs](api/scala/org/apache/spark/ml/feature/IDF.html) for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -135,7 +135,7 @@ In the following code segment, we start with a set of documents, each of which i
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Word2Vec Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Word2Vec)
|
Refer to the [Word2Vec Scala docs](api/scala/org/apache/spark/ml/feature/Word2Vec.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %}
|
||||||
|
@ -200,8 +200,8 @@ Each vector represents the token counts of the document over the vocabulary.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [CountVectorizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
|
Refer to the [CountVectorizer Scala docs](api/scala/org/apache/spark/ml/feature/CountVectorizer.html)
|
||||||
and the [CountVectorizerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel)
|
and the [CountVectorizerModel Scala docs](api/scala/org/apache/spark/ml/feature/CountVectorizerModel.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %}
|
||||||
|
@ -286,7 +286,7 @@ The resulting feature vectors could then be passed to a learning algorithm.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [FeatureHasher Scala docs](api/scala/index.html#org.apache.spark.ml.feature.FeatureHasher)
|
Refer to the [FeatureHasher Scala docs](api/scala/org/apache/spark/ml/feature/FeatureHasher.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %}
|
||||||
|
@ -313,9 +313,9 @@ for more details on the API.
|
||||||
|
|
||||||
## Tokenizer
|
## Tokenizer
|
||||||
|
|
||||||
[Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) class provides this functionality. The example below shows how to split sentences into sequences of words.
|
[Tokenization](http://en.wikipedia.org/wiki/Lexical_analysis#Tokenization) is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple [Tokenizer](api/scala/org/apache/spark/ml/feature/Tokenizer.html) class provides this functionality. The example below shows how to split sentences into sequences of words.
|
||||||
|
|
||||||
[RegexTokenizer](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer) allows more
|
[RegexTokenizer](api/scala/org/apache/spark/ml/feature/RegexTokenizer.html) allows more
|
||||||
advanced tokenization based on regular expression (regex) matching.
|
advanced tokenization based on regular expression (regex) matching.
|
||||||
By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as delimiters to split the input text.
|
By default, the parameter "pattern" (regex, default: `"\\s+"`) is used as delimiters to split the input text.
|
||||||
Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
|
Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
|
||||||
|
@ -326,8 +326,8 @@ for more details on the API.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Tokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)
|
Refer to the [Tokenizer Scala docs](api/scala/org/apache/spark/ml/feature/Tokenizer.html)
|
||||||
and the [RegexTokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RegexTokenizer)
|
and the [RegexTokenizer Scala docs](api/scala/org/apache/spark/ml/feature/RegexTokenizer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %}
|
||||||
|
@ -395,7 +395,7 @@ filtered out.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [StopWordsRemover Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StopWordsRemover)
|
Refer to the [StopWordsRemover Scala docs](api/scala/org/apache/spark/ml/feature/StopWordsRemover.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %}
|
||||||
|
@ -430,7 +430,7 @@ An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (t
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [NGram Scala docs](api/scala/index.html#org.apache.spark.ml.feature.NGram)
|
Refer to the [NGram Scala docs](api/scala/org/apache/spark/ml/feature/NGram.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/NGramExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/NGramExample.scala %}
|
||||||
|
@ -468,7 +468,7 @@ for `inputCol`.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Binarizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Binarizer)
|
Refer to the [Binarizer Scala docs](api/scala/org/apache/spark/ml/feature/Binarizer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/BinarizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/BinarizerExample.scala %}
|
||||||
|
@ -493,14 +493,14 @@ for more details on the API.
|
||||||
|
|
||||||
## PCA
|
## PCA
|
||||||
|
|
||||||
[PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/index.html#org.apache.spark.ml.feature.PCA) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
|
[PCA](http://en.wikipedia.org/wiki/Principal_component_analysis) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A [PCA](api/scala/org/apache/spark/ml/feature/PCA.html) class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.
|
||||||
|
|
||||||
**Examples**
|
**Examples**
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [PCA Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PCA)
|
Refer to the [PCA Scala docs](api/scala/org/apache/spark/ml/feature/PCA.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/PCAExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/PCAExample.scala %}
|
||||||
|
@ -525,14 +525,14 @@ for more details on the API.
|
||||||
|
|
||||||
## PolynomialExpansion
|
## PolynomialExpansion
|
||||||
|
|
||||||
[Polynomial expansion](http://en.wikipedia.org/wiki/Polynomial_expansion) is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A [PolynomialExpansion](api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion) class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.
|
[Polynomial expansion](http://en.wikipedia.org/wiki/Polynomial_expansion) is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A [PolynomialExpansion](api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html) class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.
|
||||||
|
|
||||||
**Examples**
|
**Examples**
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [PolynomialExpansion Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion)
|
Refer to the [PolynomialExpansion Scala docs](api/scala/org/apache/spark/ml/feature/PolynomialExpansion.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala %}
|
||||||
|
@ -561,7 +561,7 @@ The [Discrete Cosine
|
||||||
Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
|
Transform](https://en.wikipedia.org/wiki/Discrete_cosine_transform)
|
||||||
transforms a length $N$ real-valued sequence in the time domain into
|
transforms a length $N$ real-valued sequence in the time domain into
|
||||||
another length $N$ real-valued sequence in the frequency domain. A
|
another length $N$ real-valued sequence in the frequency domain. A
|
||||||
[DCT](api/scala/index.html#org.apache.spark.ml.feature.DCT) class
|
[DCT](api/scala/org/apache/spark/ml/feature/DCT.html) class
|
||||||
provides this functionality, implementing the
|
provides this functionality, implementing the
|
||||||
[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
|
[DCT-II](https://en.wikipedia.org/wiki/Discrete_cosine_transform#DCT-II)
|
||||||
and scaling the result by $1/\sqrt{2}$ such that the representing matrix
|
and scaling the result by $1/\sqrt{2}$ such that the representing matrix
|
||||||
|
@ -574,7 +574,7 @@ $0$th DCT coefficient and _not_ the $N/2$th).
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [DCT Scala docs](api/scala/index.html#org.apache.spark.ml.feature.DCT)
|
Refer to the [DCT Scala docs](api/scala/org/apache/spark/ml/feature/DCT.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/DCTExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/DCTExample.scala %}
|
||||||
|
@ -704,7 +704,7 @@ Notice that the rows containing "d" or "e" are mapped to index "3.0"
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [StringIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer)
|
Refer to the [StringIndexer Scala docs](api/scala/org/apache/spark/ml/feature/StringIndexer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/StringIndexerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/StringIndexerExample.scala %}
|
||||||
|
@ -770,7 +770,7 @@ labels (they will be inferred from the columns' metadata):
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [IndexToString Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IndexToString)
|
Refer to the [IndexToString Scala docs](api/scala/org/apache/spark/ml/feature/IndexToString.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/IndexToStringExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/IndexToStringExample.scala %}
|
||||||
|
@ -809,7 +809,7 @@ for more details on the API.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [OneHotEncoder Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) for more details on the API.
|
Refer to the [OneHotEncoder Scala docs](api/scala/org/apache/spark/ml/feature/OneHotEncoder.html) for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -835,7 +835,7 @@ Refer to the [OneHotEncoder Python docs](api/python/pyspark.ml.html#pyspark.ml.f
|
||||||
`VectorIndexer` helps index categorical features in datasets of `Vector`s.
|
`VectorIndexer` helps index categorical features in datasets of `Vector`s.
|
||||||
It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:
|
It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:
|
||||||
|
|
||||||
1. Take an input column of type [Vector](api/scala/index.html#org.apache.spark.ml.linalg.Vector) and a parameter `maxCategories`.
|
1. Take an input column of type [Vector](api/scala/org/apache/spark/ml/linalg/Vector.html) and a parameter `maxCategories`.
|
||||||
2. Decide which features should be categorical based on the number of distinct values, where features with at most `maxCategories` are declared categorical.
|
2. Decide which features should be categorical based on the number of distinct values, where features with at most `maxCategories` are declared categorical.
|
||||||
3. Compute 0-based category indices for each categorical feature.
|
3. Compute 0-based category indices for each categorical feature.
|
||||||
4. Index categorical features and transform original feature values to indices.
|
4. Index categorical features and transform original feature values to indices.
|
||||||
|
@ -849,7 +849,7 @@ In the example below, we read in a dataset of labeled points and then use `Vecto
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [VectorIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer)
|
Refer to the [VectorIndexer Scala docs](api/scala/org/apache/spark/ml/feature/VectorIndexer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/VectorIndexerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/VectorIndexerExample.scala %}
|
||||||
|
@ -910,7 +910,7 @@ then `interactedCol` as the output column contains:
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Interaction Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Interaction)
|
Refer to the [Interaction Scala docs](api/scala/org/apache/spark/ml/feature/Interaction.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/InteractionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/InteractionExample.scala %}
|
||||||
|
@ -944,7 +944,7 @@ The following example demonstrates how to load a dataset in libsvm format and th
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Normalizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Normalizer)
|
Refer to the [Normalizer Scala docs](api/scala/org/apache/spark/ml/feature/Normalizer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/NormalizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/NormalizerExample.scala %}
|
||||||
|
@ -986,7 +986,7 @@ The following example demonstrates how to load a dataset in libsvm format and th
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [StandardScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StandardScaler)
|
Refer to the [StandardScaler Scala docs](api/scala/org/apache/spark/ml/feature/StandardScaler.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/StandardScalerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/StandardScalerExample.scala %}
|
||||||
|
@ -1030,7 +1030,7 @@ The following example demonstrates how to load a dataset in libsvm format and th
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [RobustScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RobustScaler)
|
Refer to the [RobustScaler Scala docs](api/scala/org/apache/spark/ml/feature/RobustScaler.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/RobustScalerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/RobustScalerExample.scala %}
|
||||||
|
@ -1078,8 +1078,8 @@ The following example demonstrates how to load a dataset in libsvm format and th
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [MinMaxScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)
|
Refer to the [MinMaxScaler Scala docs](api/scala/org/apache/spark/ml/feature/MinMaxScaler.html)
|
||||||
and the [MinMaxScalerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel)
|
and the [MinMaxScalerModel Scala docs](api/scala/org/apache/spark/ml/feature/MinMaxScalerModel.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala %}
|
||||||
|
@ -1121,8 +1121,8 @@ The following example demonstrates how to load a dataset in libsvm format and th
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [MaxAbsScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MaxAbsScaler)
|
Refer to the [MaxAbsScaler Scala docs](api/scala/org/apache/spark/ml/feature/MaxAbsScaler.html)
|
||||||
and the [MaxAbsScalerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MaxAbsScalerModel)
|
and the [MaxAbsScalerModel Scala docs](api/scala/org/apache/spark/ml/feature/MaxAbsScalerModel.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala %}
|
||||||
|
@ -1157,7 +1157,7 @@ Note that if you have no idea of the upper and lower bounds of the targeted colu
|
||||||
|
|
||||||
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
|
Note also that the splits that you provided have to be in strictly increasing order, i.e. `s0 < s1 < s2 < ... < sn`.
|
||||||
|
|
||||||
More details can be found in the API docs for [Bucketizer](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer).
|
More details can be found in the API docs for [Bucketizer](api/scala/org/apache/spark/ml/feature/Bucketizer.html).
|
||||||
|
|
||||||
**Examples**
|
**Examples**
|
||||||
|
|
||||||
|
@ -1166,7 +1166,7 @@ The following example demonstrates how to bucketize a column of `Double`s into a
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Bucketizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer)
|
Refer to the [Bucketizer Scala docs](api/scala/org/apache/spark/ml/feature/Bucketizer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/BucketizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/BucketizerExample.scala %}
|
||||||
|
@ -1216,7 +1216,7 @@ This example below demonstrates how to transform vectors using a transforming ve
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [ElementwiseProduct Scala docs](api/scala/index.html#org.apache.spark.ml.feature.ElementwiseProduct)
|
Refer to the [ElementwiseProduct Scala docs](api/scala/org/apache/spark/ml/feature/ElementwiseProduct.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/ElementwiseProductExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/ElementwiseProductExample.scala %}
|
||||||
|
@ -1276,7 +1276,7 @@ This is the output of the `SQLTransformer` with statement `"SELECT *, (v1 + v2)
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [SQLTransformer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.SQLTransformer)
|
Refer to the [SQLTransformer Scala docs](api/scala/org/apache/spark/ml/feature/SQLTransformer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/SQLTransformerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/SQLTransformerExample.scala %}
|
||||||
|
@ -1336,7 +1336,7 @@ output column to `features`, after transformation we should get the following Da
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [VectorAssembler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorAssembler)
|
Refer to the [VectorAssembler Scala docs](api/scala/org/apache/spark/ml/feature/VectorAssembler.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/VectorAssemblerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/VectorAssemblerExample.scala %}
|
||||||
|
@ -1387,7 +1387,7 @@ to avoid this kind of inconsistent state.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [VectorSizeHint Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint)
|
Refer to the [VectorSizeHint Scala docs](api/scala/org/apache/spark/ml/feature/VectorSizeHint.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala %}
|
||||||
|
@ -1426,7 +1426,7 @@ NaN values, they will be handled specially and placed into their own bucket, for
|
||||||
are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
|
are used, then non-NaN data will be put into buckets[0-3], but NaNs will be counted in a special bucket[4].
|
||||||
|
|
||||||
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
|
Algorithm: The bin ranges are chosen using an approximate algorithm (see the documentation for
|
||||||
[approxQuantile](api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions) for a
|
[approxQuantile](api/scala/org/apache/spark/sql/DataFrameStatFunctions.html) for a
|
||||||
detailed description). The precision of the approximation can be controlled with the
|
detailed description). The precision of the approximation can be controlled with the
|
||||||
`relativeError` parameter. When set to zero, exact quantiles are calculated
|
`relativeError` parameter. When set to zero, exact quantiles are calculated
|
||||||
(**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds
|
(**Note:** Computing exact quantiles is an expensive operation). The lower and upper bin bounds
|
||||||
|
@ -1470,7 +1470,7 @@ a categorical one. Given `numBuckets = 3`, we should get the following DataFrame
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [QuantileDiscretizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.QuantileDiscretizer)
|
Refer to the [QuantileDiscretizer Scala docs](api/scala/org/apache/spark/ml/feature/QuantileDiscretizer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/QuantileDiscretizerExample.scala %}
|
||||||
|
@ -1539,7 +1539,7 @@ the relevant column.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [Imputer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Imputer)
|
Refer to the [Imputer Scala docs](api/scala/org/apache/spark/ml/feature/Imputer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/ImputerExample.scala %}
|
||||||
|
@ -1620,7 +1620,7 @@ Suppose also that we have potential input attributes for the `userFeatures`, i.e
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [VectorSlicer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer)
|
Refer to the [VectorSlicer Scala docs](api/scala/org/apache/spark/ml/feature/VectorSlicer.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/VectorSlicerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/VectorSlicerExample.scala %}
|
||||||
|
@ -1706,7 +1706,7 @@ id | country | hour | clicked | features | label
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [RFormula Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RFormula)
|
Refer to the [RFormula Scala docs](api/scala/org/apache/spark/ml/feature/RFormula.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/RFormulaExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/RFormulaExample.scala %}
|
||||||
|
@ -1770,7 +1770,7 @@ id | features | clicked | selectedFeatures
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [ChiSqSelector Scala docs](api/scala/index.html#org.apache.spark.ml.feature.ChiSqSelector)
|
Refer to the [ChiSqSelector Scala docs](api/scala/org/apache/spark/ml/feature/ChiSqSelector.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/ChiSqSelectorExample.scala %}
|
||||||
|
@ -1856,7 +1856,7 @@ Bucketed Random Projection accepts arbitrary vectors as input features, and supp
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [BucketedRandomProjectionLSH Scala docs](api/scala/index.html#org.apache.spark.ml.feature.BucketedRandomProjectionLSH)
|
Refer to the [BucketedRandomProjectionLSH Scala docs](api/scala/org/apache/spark/ml/feature/BucketedRandomProjectionLSH.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/BucketedRandomProjectionLSHExample.scala %}
|
||||||
|
@ -1897,7 +1897,7 @@ The input sets for MinHash are represented as binary vectors, where the vector i
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [MinHashLSH Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinHashLSH)
|
Refer to the [MinHashLSH Scala docs](api/scala/org/apache/spark/ml/feature/MinHashLSH.html)
|
||||||
for more details on the API.
|
for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/MinHashLSHExample.scala %}
|
||||||
|
|
|
@ -75,7 +75,7 @@ The `FPGrowthModel` provides:
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.fpm.FPGrowth) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/fpm/FPGrowth.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/FPGrowthExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/FPGrowthExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -128,7 +128,7 @@ pattern mining problem.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.fpm.PrefixSpan) for more details.
|
Refer to the [Scala API docs](api/scala/org/apache/spark/ml/fpm/PrefixSpan.html) for more details.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/PrefixSpanExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/PrefixSpanExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -187,7 +187,7 @@ val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
|
||||||
val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
|
val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail.
|
Refer to the [`MLUtils` Scala docs](api/scala/org/apache/spark/mllib/util/MLUtils$.html) for further detail.
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div data-lang="java" markdown="1">
|
<div data-lang="java" markdown="1">
|
||||||
|
@ -341,9 +341,9 @@ In the `spark.ml` package, there exists one breaking API change and one behavior
|
||||||
In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
|
In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
|
||||||
|
|
||||||
* Gradient-Boosted Trees
|
* Gradient-Boosted Trees
|
||||||
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
|
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/org/apache/spark/mllib/tree/loss/Loss.html) method was changed. This is only an issues for users who wrote their own losses for GBTs.
|
||||||
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
|
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/org/apache/spark/mllib/tree/configuration/BoostingStrategy.html) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
|
||||||
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
|
* *(Breaking change)* The return value of [`LDA.run`](api/scala/org/apache/spark/mllib/clustering/LDA.html) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
|
||||||
|
|
||||||
In the `spark.ml` package, several major API changes occurred, including:
|
In the `spark.ml` package, several major API changes occurred, including:
|
||||||
|
|
||||||
|
@ -359,12 +359,12 @@ changes for future releases.
|
||||||
|
|
||||||
In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
|
In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
|
||||||
|
|
||||||
* *(Breaking change)* In [`ALS`](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed.
|
* *(Breaking change)* In [`ALS`](api/scala/org/apache/spark/mllib/recommendation/ALS.html), the extraneous method `solveLeastSquares` has been removed. The `DeveloperApi` method `analyzeBlocks` was also removed.
|
||||||
* *(Breaking change)* [`StandardScalerModel`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScalerModel) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
|
* *(Breaking change)* [`StandardScalerModel`](api/scala/org/apache/spark/mllib/feature/StandardScalerModel.html) remains an Alpha component. In it, the `variance` method has been replaced with the `std` method. To compute the column variance values returned by the original `variance` method, simply square the standard deviation values returned by `std`.
|
||||||
* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/index.html#org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD) remains an Experimental component. In it, there were two changes:
|
* *(Breaking change)* [`StreamingLinearRegressionWithSGD`](api/scala/org/apache/spark/mllib/regression/StreamingLinearRegressionWithSGD.html) remains an Experimental component. In it, there were two changes:
|
||||||
* The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods.
|
* The constructor taking arguments was removed in favor of a builder pattern using the default constructor plus parameter setter methods.
|
||||||
* Variable `model` is no longer public.
|
* Variable `model` is no longer public.
|
||||||
* *(Breaking change)* [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) remains an Experimental component. In it and its associated classes, there were several changes:
|
* *(Breaking change)* [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html) remains an Experimental component. In it and its associated classes, there were several changes:
|
||||||
* In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.)
|
* In `DecisionTree`, the deprecated class method `train` has been removed. (The object/static `train` methods remain.)
|
||||||
* In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
|
* In `Strategy`, the `checkpointDir` parameter has been removed. Checkpointing is still supported, but the checkpoint directory must be set before calling tree and tree ensemble training.
|
||||||
* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use.
|
* `PythonMLlibAPI` (the interface between Scala/Java and Python for MLlib) was a public API but is now private, declared `private[python]`. This was never meant for external use.
|
||||||
|
@ -373,31 +373,31 @@ In the `spark.mllib` package, there were several breaking changes. The first ch
|
||||||
|
|
||||||
In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:
|
In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:
|
||||||
|
|
||||||
* The old [SchemaRDD](https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in `spark.ml` which used to use SchemaRDD now use DataFrame.
|
* The old [SchemaRDD](https://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/org/apache/spark/sql/DataFrame.html) with a somewhat modified API. All algorithms in `spark.ml` which used to use SchemaRDD now use DataFrame.
|
||||||
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
|
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
|
||||||
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
|
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
|
||||||
|
|
||||||
Other changes were in `LogisticRegression`:
|
Other changes were in `LogisticRegression`:
|
||||||
|
|
||||||
* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
|
* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
|
||||||
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.
|
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html). The option to use an intercept will be added in the future.
|
||||||
|
|
||||||
## Upgrading from MLlib 1.1 to 1.2
|
## Upgrading from MLlib 1.1 to 1.2
|
||||||
|
|
||||||
The only API changes in MLlib v1.2 are in
|
The only API changes in MLlib v1.2 are in
|
||||||
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
|
[`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html),
|
||||||
which continues to be an experimental API in MLlib 1.2:
|
which continues to be an experimental API in MLlib 1.2:
|
||||||
|
|
||||||
1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number
|
1. *(Breaking change)* The Scala API for classification takes a named argument specifying the number
|
||||||
of classes. In MLlib v1.1, this argument was called `numClasses` in Python and
|
of classes. In MLlib v1.1, this argument was called `numClasses` in Python and
|
||||||
`numClassesForClassification` in Scala. In MLlib v1.2, the names are both set to `numClasses`.
|
`numClassesForClassification` in Scala. In MLlib v1.2, the names are both set to `numClasses`.
|
||||||
This `numClasses` parameter is specified either via
|
This `numClasses` parameter is specified either via
|
||||||
[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
|
[`Strategy`](api/scala/org/apache/spark/mllib/tree/configuration/Strategy.html)
|
||||||
or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
|
or via [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html)
|
||||||
static `trainClassifier` and `trainRegressor` methods.
|
static `trainClassifier` and `trainRegressor` methods.
|
||||||
|
|
||||||
2. *(Breaking change)* The API for
|
2. *(Breaking change)* The API for
|
||||||
[`Node`](api/scala/index.html#org.apache.spark.mllib.tree.model.Node) has changed.
|
[`Node`](api/scala/org/apache/spark/mllib/tree/model/Node.html) has changed.
|
||||||
This should generally not affect user code, unless the user manually constructs decision trees
|
This should generally not affect user code, unless the user manually constructs decision trees
|
||||||
(instead of using the `trainClassifier` or `trainRegressor` methods).
|
(instead of using the `trainClassifier` or `trainRegressor` methods).
|
||||||
The tree `Node` now includes more information, including the probability of the predicted label
|
The tree `Node` now includes more information, including the probability of the predicted label
|
||||||
|
@ -411,7 +411,7 @@ Examples in the Spark distribution and examples in the
|
||||||
## Upgrading from MLlib 1.0 to 1.1
|
## Upgrading from MLlib 1.0 to 1.1
|
||||||
|
|
||||||
The only API changes in MLlib v1.1 are in
|
The only API changes in MLlib v1.1 are in
|
||||||
[`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
|
[`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html),
|
||||||
which continues to be an experimental API in MLlib 1.1:
|
which continues to be an experimental API in MLlib 1.1:
|
||||||
|
|
||||||
1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match
|
1. *(Breaking change)* The meaning of tree depth has been changed by 1 in order to match
|
||||||
|
@ -421,12 +421,12 @@ and in [rpart](http://cran.r-project.org/web/packages/rpart/index.html).
|
||||||
In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf nodes.
|
In MLlib v1.0, a depth-1 tree had 1 leaf node, and a depth-2 tree had 1 root node and 2 leaf nodes.
|
||||||
In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf nodes.
|
In MLlib v1.1, a depth-0 tree has 1 leaf node, and a depth-1 tree has 1 root node and 2 leaf nodes.
|
||||||
This depth is specified by the `maxDepth` parameter in
|
This depth is specified by the `maxDepth` parameter in
|
||||||
[`Strategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.Strategy)
|
[`Strategy`](api/scala/org/apache/spark/mllib/tree/configuration/Strategy.html)
|
||||||
or via [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
|
or via [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html)
|
||||||
static `trainClassifier` and `trainRegressor` methods.
|
static `trainClassifier` and `trainRegressor` methods.
|
||||||
|
|
||||||
2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor`
|
2. *(Non-breaking change)* We recommend using the newly added `trainClassifier` and `trainRegressor`
|
||||||
methods to build a [`DecisionTree`](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree),
|
methods to build a [`DecisionTree`](api/scala/org/apache/spark/mllib/tree/DecisionTree.html),
|
||||||
rather than using the old parameter class `Strategy`. These new training methods explicitly
|
rather than using the old parameter class `Strategy`. These new training methods explicitly
|
||||||
separate classification and regression, and they replace specialized parameter types with
|
separate classification and regression, and they replace specialized parameter types with
|
||||||
simple `String` types.
|
simple `String` types.
|
||||||
|
|
|
@ -238,7 +238,7 @@ notes, then it should be treated as a bug to be fixed.
|
||||||
|
|
||||||
This section gives code examples illustrating the functionality discussed above.
|
This section gives code examples illustrating the functionality discussed above.
|
||||||
For more info, please refer to the API documentation
|
For more info, please refer to the API documentation
|
||||||
([Scala](api/scala/index.html#org.apache.spark.ml.package),
|
([Scala](api/scala/org/apache/spark/ml/package.html),
|
||||||
[Java](api/java/org/apache/spark/ml/package-summary.html),
|
[Java](api/java/org/apache/spark/ml/package-summary.html),
|
||||||
and [Python](api/python/pyspark.ml.html)).
|
and [Python](api/python/pyspark.ml.html)).
|
||||||
|
|
||||||
|
@ -250,9 +250,9 @@ This example covers the concepts of `Estimator`, `Transformer`, and `Param`.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [`Estimator` Scala docs](api/scala/index.html#org.apache.spark.ml.Estimator),
|
Refer to the [`Estimator` Scala docs](api/scala/org/apache/spark/ml/Estimator.html),
|
||||||
the [`Transformer` Scala docs](api/scala/index.html#org.apache.spark.ml.Transformer) and
|
the [`Transformer` Scala docs](api/scala/org/apache/spark/ml/Transformer.html) and
|
||||||
the [`Params` Scala docs](api/scala/index.html#org.apache.spark.ml.param.Params) for details on the API.
|
the [`Params` Scala docs](api/scala/org/apache/spark/ml/param/Params.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/EstimatorTransformerParamExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -285,7 +285,7 @@ This example follows the simple text document `Pipeline` illustrated in the figu
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [`Pipeline` Scala docs](api/scala/index.html#org.apache.spark.ml.Pipeline) for details on the API.
|
Refer to the [`Pipeline` Scala docs](api/scala/org/apache/spark/ml/Pipeline.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/PipelineExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -50,7 +50,7 @@ correlation methods are currently Pearson's and Spearman's correlation.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`Correlation`](api/scala/index.html#org.apache.spark.ml.stat.Correlation$)
|
[`Correlation`](api/scala/org/apache/spark/ml/stat/Correlation$.html)
|
||||||
computes the correlation matrix for the input Dataset of Vectors using the specified method.
|
computes the correlation matrix for the input Dataset of Vectors using the specified method.
|
||||||
The output will be a DataFrame that contains the correlation matrix of the column of vectors.
|
The output will be a DataFrame that contains the correlation matrix of the column of vectors.
|
||||||
|
|
||||||
|
@ -87,7 +87,7 @@ the Chi-squared statistic is computed. All label and feature values must be cate
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`ChiSquareTest` Scala docs](api/scala/index.html#org.apache.spark.ml.stat.ChiSquareTest$) for details on the API.
|
Refer to the [`ChiSquareTest` Scala docs](api/scala/org/apache/spark/ml/stat/ChiSquareTest$.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/ChiSquareTestExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -114,7 +114,7 @@ as well as the total count.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
The following example demonstrates using [`Summarizer`](api/scala/index.html#org.apache.spark.ml.stat.Summarizer$)
|
The following example demonstrates using [`Summarizer`](api/scala/org/apache/spark/ml/stat/Summarizer$.html)
|
||||||
to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
|
to compute the mean and variance for a vector column of the input dataframe, with and without a weight column.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/SummarizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/SummarizerExample.scala %}
|
||||||
|
@ -133,4 +133,4 @@ Refer to the [`Summarizer` Python docs](api/python/index.html#pyspark.ml.stat.Su
|
||||||
{% include_example python/ml/summarizer_example.py %}
|
{% include_example python/ml/summarizer_example.py %}
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -49,12 +49,12 @@ Built-in Cross-Validation and other tooling allow users to optimize hyperparamet
|
||||||
An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*.
|
An important task in ML is *model selection*, or using data to find the best model or parameters for a given task. This is also called *tuning*.
|
||||||
Tuning may be done for individual `Estimator`s such as `LogisticRegression`, or for entire `Pipeline`s which include multiple algorithms, featurization, and other steps. Users can tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
|
Tuning may be done for individual `Estimator`s such as `LogisticRegression`, or for entire `Pipeline`s which include multiple algorithms, featurization, and other steps. Users can tune an entire `Pipeline` at once, rather than tuning each element in the `Pipeline` separately.
|
||||||
|
|
||||||
MLlib supports model selection using tools such as [`CrossValidator`](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) and [`TrainValidationSplit`](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit).
|
MLlib supports model selection using tools such as [`CrossValidator`](api/scala/org/apache/spark/ml/tuning/CrossValidator.html) and [`TrainValidationSplit`](api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html).
|
||||||
These tools require the following items:
|
These tools require the following items:
|
||||||
|
|
||||||
* [`Estimator`](api/scala/index.html#org.apache.spark.ml.Estimator): algorithm or `Pipeline` to tune
|
* [`Estimator`](api/scala/org/apache/spark/ml/Estimator.html): algorithm or `Pipeline` to tune
|
||||||
* Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter grid" to search over
|
* Set of `ParamMap`s: parameters to choose from, sometimes called a "parameter grid" to search over
|
||||||
* [`Evaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.Evaluator): metric to measure how well a fitted `Model` does on held-out test data
|
* [`Evaluator`](api/scala/org/apache/spark/ml/evaluation/Evaluator.html): metric to measure how well a fitted `Model` does on held-out test data
|
||||||
|
|
||||||
At a high level, these model selection tools work as follows:
|
At a high level, these model selection tools work as follows:
|
||||||
|
|
||||||
|
@ -63,13 +63,13 @@ At a high level, these model selection tools work as follows:
|
||||||
* For each `ParamMap`, they fit the `Estimator` using those parameters, get the fitted `Model`, and evaluate the `Model`'s performance using the `Evaluator`.
|
* For each `ParamMap`, they fit the `Estimator` using those parameters, get the fitted `Model`, and evaluate the `Model`'s performance using the `Evaluator`.
|
||||||
* They select the `Model` produced by the best-performing set of parameters.
|
* They select the `Model` produced by the best-performing set of parameters.
|
||||||
|
|
||||||
The `Evaluator` can be a [`RegressionEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.RegressionEvaluator)
|
The `Evaluator` can be a [`RegressionEvaluator`](api/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.html)
|
||||||
for regression problems, a [`BinaryClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator)
|
for regression problems, a [`BinaryClassificationEvaluator`](api/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.html)
|
||||||
for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/index.html#org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator)
|
for binary data, or a [`MulticlassClassificationEvaluator`](api/scala/org/apache/spark/ml/evaluation/MulticlassClassificationEvaluator.html)
|
||||||
for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
|
for multiclass problems. The default metric used to choose the best `ParamMap` can be overridden by the `setMetricName`
|
||||||
method in each of these evaluators.
|
method in each of these evaluators.
|
||||||
|
|
||||||
To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/index.html#org.apache.spark.ml.tuning.ParamGridBuilder) utility.
|
To help construct the parameter grid, users can use the [`ParamGridBuilder`](api/scala/org/apache/spark/ml/tuning/ParamGridBuilder.html) utility.
|
||||||
By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
|
By default, sets of parameters from the parameter grid are evaluated in serial. Parameter evaluation can be done in parallel by setting `parallelism` with a value of 2 or more (a value of 1 will be serial) before running model selection with `CrossValidator` or `TrainValidationSplit`.
|
||||||
The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters.
|
The value of `parallelism` should be chosen carefully to maximize parallelism without exceeding cluster resources, and larger values may not always lead to improved performance. Generally speaking, a value up to 10 should be sufficient for most clusters.
|
||||||
|
|
||||||
|
@ -93,7 +93,7 @@ However, it is also a well-established method for choosing parameters which is m
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [`CrossValidator` Scala docs](api/scala/index.html#org.apache.spark.ml.tuning.CrossValidator) for details on the API.
|
Refer to the [`CrossValidator` Scala docs](api/scala/org/apache/spark/ml/tuning/CrossValidator.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -133,7 +133,7 @@ Like `CrossValidator`, `TrainValidationSplit` finally fits the `Estimator` using
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [`TrainValidationSplit` Scala docs](api/scala/index.html#org.apache.spark.ml.tuning.TrainValidationSplit) for details on the API.
|
Refer to the [`TrainValidationSplit` Scala docs](api/scala/org/apache/spark/ml/tuning/TrainValidationSplit.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
|
{% include_example scala/org/apache/spark/examples/ml/ModelSelectionViaTrainValidationSplitExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -55,12 +55,12 @@ initialization via k-means\|\|.
|
||||||
The following code snippets can be executed in `spark-shell`.
|
The following code snippets can be executed in `spark-shell`.
|
||||||
|
|
||||||
In the following example after loading and parsing data, we use the
|
In the following example after loading and parsing data, we use the
|
||||||
[`KMeans`](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) object to cluster the data
|
[`KMeans`](api/scala/org/apache/spark/mllib/clustering/KMeans.html) object to cluster the data
|
||||||
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
|
into two clusters. The number of desired clusters is passed to the algorithm. We then compute Within
|
||||||
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact, the
|
Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact, the
|
||||||
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
|
optimal *k* is usually one where there is an "elbow" in the WSSSE graph.
|
||||||
|
|
||||||
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`KMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel) for details on the API.
|
Refer to the [`KMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/KMeans.html) and [`KMeansModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/KMeansModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/KMeansExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -111,11 +111,11 @@ has the following parameters:
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
In the following example after loading and parsing data, we use a
|
In the following example after loading and parsing data, we use a
|
||||||
[GaussianMixture](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture)
|
[GaussianMixture](api/scala/org/apache/spark/mllib/clustering/GaussianMixture.html)
|
||||||
object to cluster the data into two clusters. The number of desired clusters is passed
|
object to cluster the data into two clusters. The number of desired clusters is passed
|
||||||
to the algorithm. We then output the parameters of the mixture model.
|
to the algorithm. We then output the parameters of the mixture model.
|
||||||
|
|
||||||
Refer to the [`GaussianMixture` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture) and [`GaussianMixtureModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixtureModel) for details on the API.
|
Refer to the [`GaussianMixture` Scala docs](api/scala/org/apache/spark/mllib/clustering/GaussianMixture.html) and [`GaussianMixtureModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/GaussianMixtureModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/GaussianMixtureExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/GaussianMixtureExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -172,15 +172,15 @@ In the following, we show code snippets to demonstrate how to use PIC in `spark.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
[`PowerIterationClustering`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering)
|
[`PowerIterationClustering`](api/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.html)
|
||||||
implements the PIC algorithm.
|
implements the PIC algorithm.
|
||||||
It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
|
It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
|
||||||
affinity matrix.
|
affinity matrix.
|
||||||
Calling `PowerIterationClustering.run` returns a
|
Calling `PowerIterationClustering.run` returns a
|
||||||
[`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel),
|
[`PowerIterationClusteringModel`](api/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html),
|
||||||
which contains the computed clustering assignments.
|
which contains the computed clustering assignments.
|
||||||
|
|
||||||
Refer to the [`PowerIterationClustering` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering) and [`PowerIterationClusteringModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel) for details on the API.
|
Refer to the [`PowerIterationClustering` Scala docs](api/scala/org/apache/spark/mllib/clustering/PowerIterationClustering.html) and [`PowerIterationClusteringModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -278,9 +278,9 @@ separately.
|
||||||
**Expectation Maximization**
|
**Expectation Maximization**
|
||||||
|
|
||||||
Implemented in
|
Implemented in
|
||||||
[`EMLDAOptimizer`](api/scala/index.html#org.apache.spark.mllib.clustering.EMLDAOptimizer)
|
[`EMLDAOptimizer`](api/scala/org/apache/spark/mllib/clustering/EMLDAOptimizer.html)
|
||||||
and
|
and
|
||||||
[`DistributedLDAModel`](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel).
|
[`DistributedLDAModel`](api/scala/org/apache/spark/mllib/clustering/DistributedLDAModel.html).
|
||||||
|
|
||||||
For the parameters provided to `LDA`:
|
For the parameters provided to `LDA`:
|
||||||
|
|
||||||
|
@ -350,13 +350,13 @@ perplexity of the provided `documents` given the inferred topics.
|
||||||
**Examples**
|
**Examples**
|
||||||
|
|
||||||
In the following example, we load word count vectors representing a corpus of documents.
|
In the following example, we load word count vectors representing a corpus of documents.
|
||||||
We then use [LDA](api/scala/index.html#org.apache.spark.mllib.clustering.LDA)
|
We then use [LDA](api/scala/org/apache/spark/mllib/clustering/LDA.html)
|
||||||
to infer three topics from the documents. The number of desired clusters is passed
|
to infer three topics from the documents. The number of desired clusters is passed
|
||||||
to the algorithm. We then output the topics, represented as probability distributions over words.
|
to the algorithm. We then output the topics, represented as probability distributions over words.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`LDA` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) and [`DistributedLDAModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel) for details on the API.
|
Refer to the [`LDA` Scala docs](api/scala/org/apache/spark/mllib/clustering/LDA.html) and [`DistributedLDAModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/DistributedLDAModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/LatentDirichletAllocationExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -398,7 +398,7 @@ The implementation in MLlib has the following parameters:
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API.
|
Refer to the [`BisectingKMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/BisectingKMeans.html) and [`BisectingKMeansModel` Scala docs](api/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -451,7 +451,7 @@ This example shows how to estimate clusters on streaming data.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`StreamingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.StreamingKMeans) for details on the API.
|
Refer to the [`StreamingKMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/StreamingKMeans.html) for details on the API.
|
||||||
And Refer to [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for details on StreamingContext.
|
And Refer to [Spark Streaming Programming Guide](streaming-programming-guide.html#initializing) for details on StreamingContext.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala %}
|
||||||
|
|
|
@ -76,11 +76,11 @@ best parameter learned from a sampled subset to the full dataset and expect simi
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
In the following example, we load rating data. Each row consists of a user, a product and a rating.
|
In the following example, we load rating data. Each row consists of a user, a product and a rating.
|
||||||
We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
|
We use the default [ALS.train()](api/scala/org/apache/spark/mllib/recommendation/ALS$.html)
|
||||||
method which assumes ratings are explicit. We evaluate the
|
method which assumes ratings are explicit. We evaluate the
|
||||||
recommendation model by measuring the Mean Squared Error of rating prediction.
|
recommendation model by measuring the Mean Squared Error of rating prediction.
|
||||||
|
|
||||||
Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for more details on the API.
|
Refer to the [`ALS` Scala docs](api/scala/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}
|
||||||
|
|
||||||
|
|
|
@ -42,13 +42,13 @@ of the vector.
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
The base class of local vectors is
|
The base class of local vectors is
|
||||||
[`Vector`](api/scala/index.html#org.apache.spark.mllib.linalg.Vector), and we provide two
|
[`Vector`](api/scala/org/apache/spark/mllib/linalg/Vector.html), and we provide two
|
||||||
implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseVector) and
|
implementations: [`DenseVector`](api/scala/org/apache/spark/mllib/linalg/DenseVector.html) and
|
||||||
[`SparseVector`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseVector). We recommend
|
[`SparseVector`](api/scala/org/apache/spark/mllib/linalg/SparseVector.html). We recommend
|
||||||
using the factory methods implemented in
|
using the factory methods implemented in
|
||||||
[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors.
|
[`Vectors`](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) to create local vectors.
|
||||||
|
|
||||||
Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
|
Refer to the [`Vector` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vector.html) and [`Vectors` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.linalg.{Vector, Vectors}
|
import org.apache.spark.mllib.linalg.{Vector, Vectors}
|
||||||
|
@ -138,9 +138,9 @@ For multiclass classification, labels should be class indices starting from zero
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
A labeled point is represented by the case class
|
A labeled point is represented by the case class
|
||||||
[`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint).
|
[`LabeledPoint`](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html).
|
||||||
|
|
||||||
Refer to the [`LabeledPoint` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) for details on the API.
|
Refer to the [`LabeledPoint` Scala docs](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.linalg.Vectors
|
import org.apache.spark.mllib.linalg.Vectors
|
||||||
|
@ -211,10 +211,10 @@ After loading, the feature indices are converted to zero-based.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
|
[`MLUtils.loadLibSVMFile`](api/scala/org/apache/spark/mllib/util/MLUtils$.html) reads training
|
||||||
examples stored in LIBSVM format.
|
examples stored in LIBSVM format.
|
||||||
|
|
||||||
Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for details on the API.
|
Refer to the [`MLUtils` Scala docs](api/scala/org/apache/spark/mllib/util/MLUtils$.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.regression.LabeledPoint
|
import org.apache.spark.mllib.regression.LabeledPoint
|
||||||
|
@ -272,14 +272,14 @@ is stored in a one-dimensional array `[1.0, 3.0, 5.0, 2.0, 4.0, 6.0]` with the m
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
The base class of local matrices is
|
The base class of local matrices is
|
||||||
[`Matrix`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix), and we provide two
|
[`Matrix`](api/scala/org/apache/spark/mllib/linalg/Matrix.html), and we provide two
|
||||||
implementations: [`DenseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.DenseMatrix),
|
implementations: [`DenseMatrix`](api/scala/org/apache/spark/mllib/linalg/DenseMatrix.html),
|
||||||
and [`SparseMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.SparseMatrix).
|
and [`SparseMatrix`](api/scala/org/apache/spark/mllib/linalg/SparseMatrix.html).
|
||||||
We recommend using the factory methods implemented
|
We recommend using the factory methods implemented
|
||||||
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) to create local
|
in [`Matrices`](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) to create local
|
||||||
matrices. Remember, local matrices in MLlib are stored in column-major order.
|
matrices. Remember, local matrices in MLlib are stored in column-major order.
|
||||||
|
|
||||||
Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) for details on the API.
|
Refer to the [`Matrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrix.html) and [`Matrices` Scala docs](api/scala/org/apache/spark/mllib/linalg/Matrices$.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
|
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
|
||||||
|
@ -377,12 +377,12 @@ limited by the integer range but it should be much smaller in practice.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
A [`RowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) can be
|
A [`RowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be
|
||||||
created from an `RDD[Vector]` instance. Then we can compute its column summary statistics and decompositions.
|
created from an `RDD[Vector]` instance. Then we can compute its column summary statistics and decompositions.
|
||||||
[QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
|
[QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix.
|
||||||
For [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), please refer to [Dimensionality reduction](mllib-dimensionality-reduction.html).
|
For [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), please refer to [Dimensionality reduction](mllib-dimensionality-reduction.html).
|
||||||
|
|
||||||
Refer to the [`RowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) for details on the API.
|
Refer to the [`RowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.linalg.Vector
|
import org.apache.spark.mllib.linalg.Vector
|
||||||
|
@ -463,13 +463,13 @@ vector.
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
An
|
An
|
||||||
[`IndexedRowMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix)
|
[`IndexedRowMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html)
|
||||||
can be created from an `RDD[IndexedRow]` instance, where
|
can be created from an `RDD[IndexedRow]` instance, where
|
||||||
[`IndexedRow`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRow) is a
|
[`IndexedRow`](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRow.html) is a
|
||||||
wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
|
wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping
|
||||||
its row indices.
|
its row indices.
|
||||||
|
|
||||||
Refer to the [`IndexedRowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) for details on the API.
|
Refer to the [`IndexedRowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
|
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix}
|
||||||
|
@ -568,14 +568,14 @@ dimensions of the matrix are huge and the matrix is very sparse.
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
A
|
A
|
||||||
[`CoordinateMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix)
|
[`CoordinateMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html)
|
||||||
can be created from an `RDD[MatrixEntry]` instance, where
|
can be created from an `RDD[MatrixEntry]` instance, where
|
||||||
[`MatrixEntry`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.MatrixEntry) is a
|
[`MatrixEntry`](api/scala/org/apache/spark/mllib/linalg/distributed/MatrixEntry.html) is a
|
||||||
wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
|
wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to an `IndexedRowMatrix`
|
||||||
with sparse rows by calling `toIndexedRowMatrix`. Other computations for
|
with sparse rows by calling `toIndexedRowMatrix`. Other computations for
|
||||||
`CoordinateMatrix` are not currently supported.
|
`CoordinateMatrix` are not currently supported.
|
||||||
|
|
||||||
Refer to the [`CoordinateMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) for details on the API.
|
Refer to the [`CoordinateMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
|
import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}
|
||||||
|
@ -678,12 +678,12 @@ the sub-matrix at the given index with size `rowsPerBlock` x `colsPerBlock`.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
A [`BlockMatrix`](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix) can be
|
A [`BlockMatrix`](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) can be
|
||||||
most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`.
|
most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix`.
|
||||||
`toBlockMatrix` creates blocks of size 1024 x 1024 by default.
|
`toBlockMatrix` creates blocks of size 1024 x 1024 by default.
|
||||||
Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`.
|
Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`.
|
||||||
|
|
||||||
Refer to the [`BlockMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix) for details on the API.
|
Refer to the [`BlockMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
|
import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry}
|
||||||
|
|
|
@ -151,7 +151,7 @@ When tuning these parameters, be careful to validate on held-out test data to av
|
||||||
|
|
||||||
* **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
|
* **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
|
||||||
|
|
||||||
* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) since those are often trained deeper than individual trees.
|
* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) since those are often trained deeper than individual trees.
|
||||||
|
|
||||||
* **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain).
|
* **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain).
|
||||||
|
|
||||||
|
@ -167,13 +167,13 @@ These parameters may be tuned. Be careful to validate on held-out test data whe
|
||||||
* The default value is conservatively chosen to be 256 MiB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`.
|
* The default value is conservatively chosen to be 256 MiB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`.
|
||||||
* *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics.
|
* *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics.
|
||||||
|
|
||||||
* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
|
* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) and [`GradientBoostedTrees`](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
|
||||||
|
|
||||||
* **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter.
|
* **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter.
|
||||||
|
|
||||||
### Caching and checkpointing
|
### Caching and checkpointing
|
||||||
|
|
||||||
MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) when `numTrees` is set to be large.
|
MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) when `numTrees` is set to be large.
|
||||||
|
|
||||||
* **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration.
|
* **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration.
|
||||||
* This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration).
|
* This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration).
|
||||||
|
@ -207,7 +207,7 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`DecisionTree` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel) for details on the API.
|
Refer to the [`DecisionTree` Scala docs](api/scala/org/apache/spark/mllib/tree/DecisionTree.html) and [`DecisionTreeModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/DecisionTreeModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/DecisionTreeClassificationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/DecisionTreeClassificationExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -238,7 +238,7 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`DecisionTree` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel) for details on the API.
|
Refer to the [`DecisionTree` Scala docs](api/scala/org/apache/spark/mllib/tree/DecisionTree.html) and [`DecisionTreeModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/DecisionTreeModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/DecisionTreeRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/DecisionTreeRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -77,7 +77,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`SingularValueDecomposition` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.SingularValueDecomposition) for details on the API.
|
Refer to the [`SingularValueDecomposition` Scala docs](api/scala/org/apache/spark/mllib/linalg/SingularValueDecomposition.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/SVDExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/SVDExample.scala %}
|
||||||
|
|
||||||
|
@ -117,14 +117,14 @@ the rotation matrix are called principal components. PCA is used widely in dimen
|
||||||
The following code demonstrates how to compute principal components on a `RowMatrix`
|
The following code demonstrates how to compute principal components on a `RowMatrix`
|
||||||
and use them to project the vectors into a low-dimensional space.
|
and use them to project the vectors into a low-dimensional space.
|
||||||
|
|
||||||
Refer to the [`RowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) for details on the API.
|
Refer to the [`RowMatrix` Scala docs](api/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/PCAOnRowMatrixExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/PCAOnRowMatrixExample.scala %}
|
||||||
|
|
||||||
The following code demonstrates how to compute principal components on source vectors
|
The following code demonstrates how to compute principal components on source vectors
|
||||||
and use them to project the vectors into a low-dimensional space while keeping associated labels:
|
and use them to project the vectors into a low-dimensional space while keeping associated labels:
|
||||||
|
|
||||||
Refer to the [`PCA` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.PCA) for details on the API.
|
Refer to the [`PCA` Scala docs](api/scala/org/apache/spark/mllib/feature/PCA.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/PCAOnSourceVectorExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/PCAOnSourceVectorExample.scala %}
|
||||||
|
|
||||||
|
|
|
@ -24,7 +24,7 @@ license: |
|
||||||
|
|
||||||
An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
|
An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
|
||||||
is a learning algorithm which creates a model composed of a set of other base models.
|
is a learning algorithm which creates a model composed of a set of other base models.
|
||||||
`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$).
|
`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`RandomForest`](api/scala/org/apache/spark/mllib/tree/RandomForest$.html).
|
||||||
Both use [decision trees](mllib-decision-tree.html) as their base models.
|
Both use [decision trees](mllib-decision-tree.html) as their base models.
|
||||||
|
|
||||||
## Gradient-Boosted Trees vs. Random Forests
|
## Gradient-Boosted Trees vs. Random Forests
|
||||||
|
@ -111,7 +111,7 @@ The test error is calculated to measure the algorithm accuracy.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
|
Refer to the [`RandomForest` Scala docs](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) and [`RandomForestModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/RandomForestModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -142,7 +142,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
|
Refer to the [`RandomForest` Scala docs](api/scala/org/apache/spark/mllib/tree/RandomForest$.html) and [`RandomForestModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/RandomForestModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -252,7 +252,7 @@ The test error is calculated to measure the algorithm accuracy.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`GradientBoostedTrees` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel) for details on the API.
|
Refer to the [`GradientBoostedTrees` Scala docs](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`GradientBoostedTreesModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -283,7 +283,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`GradientBoostedTrees` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel) for details on the API.
|
Refer to the [`GradientBoostedTrees` Scala docs](api/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`GradientBoostedTreesModel` Scala docs](api/scala/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -117,7 +117,7 @@ The following code snippets illustrate how to load a sample dataset, train a bin
|
||||||
data, and evaluate the performance of the algorithm by several binary evaluation metrics.
|
data, and evaluate the performance of the algorithm by several binary evaluation metrics.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) and [`BinaryClassificationMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics) for details on the API.
|
Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`BinaryClassificationMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala %}
|
||||||
|
|
||||||
|
@ -243,7 +243,7 @@ The following code snippets illustrate how to load a sample dataset, train a mul
|
||||||
the data, and evaluate the performance of the algorithm by several multiclass classification evaluation metrics.
|
the data, and evaluate the performance of the algorithm by several multiclass classification evaluation metrics.
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`MulticlassMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.MulticlassMetrics) for details on the API.
|
Refer to the [`MulticlassMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MulticlassMetrics.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/MulticlassMetricsExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/MulticlassMetricsExample.scala %}
|
||||||
|
|
||||||
|
@ -393,7 +393,7 @@ True classes:
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`MultilabelMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.MultilabelMetrics) for details on the API.
|
Refer to the [`MultilabelMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/MultilabelMetrics.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/MultiLabelMetricsExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/MultiLabelMetricsExample.scala %}
|
||||||
|
|
||||||
|
@ -521,7 +521,7 @@ expanded world of non-positive weights are "the same as never having interacted
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`RegressionMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RegressionMetrics) and [`RankingMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RankingMetrics) for details on the API.
|
Refer to the [`RegressionMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.html) and [`RankingMetrics` Scala docs](api/scala/org/apache/spark/mllib/evaluation/RankingMetrics.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/RankingMetricsExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/RankingMetricsExample.scala %}
|
||||||
|
|
||||||
|
|
|
@ -69,12 +69,12 @@ We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
TF and IDF are implemented in [HashingTF](api/scala/index.html#org.apache.spark.mllib.feature.HashingTF)
|
TF and IDF are implemented in [HashingTF](api/scala/org/apache/spark/mllib/feature/HashingTF.html)
|
||||||
and [IDF](api/scala/index.html#org.apache.spark.mllib.feature.IDF).
|
and [IDF](api/scala/org/apache/spark/mllib/feature/IDF.html).
|
||||||
`HashingTF` takes an `RDD[Iterable[_]]` as the input.
|
`HashingTF` takes an `RDD[Iterable[_]]` as the input.
|
||||||
Each record could be an iterable of strings or other types.
|
Each record could be an iterable of strings or other types.
|
||||||
|
|
||||||
Refer to the [`HashingTF` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.HashingTF) for details on the API.
|
Refer to the [`HashingTF` Scala docs](api/scala/org/apache/spark/mllib/feature/HashingTF.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/TFIDFExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -135,7 +135,7 @@ Here we assume the extracted file is `text8` and in same directory as you run th
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`Word2Vec` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.Word2Vec) for details on the API.
|
Refer to the [`Word2Vec` Scala docs](api/scala/org/apache/spark/mllib/feature/Word2Vec.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/Word2VecExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/Word2VecExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -159,19 +159,19 @@ against features with very large variances exerting an overly large influence du
|
||||||
|
|
||||||
### Model Fitting
|
### Model Fitting
|
||||||
|
|
||||||
[`StandardScaler`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) has the
|
[`StandardScaler`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) has the
|
||||||
following parameters in the constructor:
|
following parameters in the constructor:
|
||||||
|
|
||||||
* `withMean` False by default. Centers the data with mean before scaling. It will build a dense
|
* `withMean` False by default. Centers the data with mean before scaling. It will build a dense
|
||||||
output, so take care when applying to sparse input.
|
output, so take care when applying to sparse input.
|
||||||
* `withStd` True by default. Scales the data to unit standard deviation.
|
* `withStd` True by default. Scales the data to unit standard deviation.
|
||||||
|
|
||||||
We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) method in
|
We provide a [`fit`](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) method in
|
||||||
`StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then
|
`StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then
|
||||||
return a model which can transform the input dataset into unit standard deviation and/or zero mean features
|
return a model which can transform the input dataset into unit standard deviation and/or zero mean features
|
||||||
depending how we configure the `StandardScaler`.
|
depending how we configure the `StandardScaler`.
|
||||||
|
|
||||||
This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
|
This model implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
|
||||||
which can apply the standardization on a `Vector` to produce a transformed `Vector` or on
|
which can apply the standardization on a `Vector` to produce a transformed `Vector` or on
|
||||||
an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
|
an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
|
||||||
|
|
||||||
|
@ -185,7 +185,7 @@ so that the new features have unit standard deviation and/or zero mean.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`StandardScaler` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) for details on the API.
|
Refer to the [`StandardScaler` Scala docs](api/scala/org/apache/spark/mllib/feature/StandardScaler.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/StandardScalerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/StandardScalerExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -203,12 +203,12 @@ Normalizer scales individual samples to have unit $L^p$ norm. This is a common o
|
||||||
classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors
|
classification or clustering. For example, the dot product of two $L^2$ normalized TF-IDF vectors
|
||||||
is the cosine similarity of the vectors.
|
is the cosine similarity of the vectors.
|
||||||
|
|
||||||
[`Normalizer`](api/scala/index.html#org.apache.spark.mllib.feature.Normalizer) has the following
|
[`Normalizer`](api/scala/org/apache/spark/mllib/feature/Normalizer.html) has the following
|
||||||
parameter in the constructor:
|
parameter in the constructor:
|
||||||
|
|
||||||
* `p` Normalization in $L^p$ space, $p = 2$ by default.
|
* `p` Normalization in $L^p$ space, $p = 2$ by default.
|
||||||
|
|
||||||
`Normalizer` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
|
`Normalizer` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html)
|
||||||
which can apply the normalization on a `Vector` to produce a transformed `Vector` or on
|
which can apply the normalization on a `Vector` to produce a transformed `Vector` or on
|
||||||
an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
|
an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
|
||||||
|
|
||||||
|
@ -221,7 +221,7 @@ with $L^2$ norm, and $L^\infty$ norm.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`Normalizer` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.Normalizer) for details on the API.
|
Refer to the [`Normalizer` Scala docs](api/scala/org/apache/spark/mllib/feature/Normalizer.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/NormalizerExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/NormalizerExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -239,7 +239,7 @@ Refer to the [`Normalizer` Python docs](api/python/pyspark.mllib.html#pyspark.ml
|
||||||
features for use in model construction. It reduces the size of the feature space, which can improve
|
features for use in model construction. It reduces the size of the feature space, which can improve
|
||||||
both speed and statistical learning behavior.
|
both speed and statistical learning behavior.
|
||||||
|
|
||||||
[`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements
|
[`ChiSqSelector`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) implements
|
||||||
Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
|
Chi-Squared feature selection. It operates on labeled data with categorical features. ChiSqSelector uses the
|
||||||
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
|
[Chi-Squared test of independence](https://en.wikipedia.org/wiki/Chi-squared_test) to decide which
|
||||||
features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
|
features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
|
||||||
|
@ -257,7 +257,7 @@ The number of features to select can be tuned using a held-out validation set.
|
||||||
|
|
||||||
### Model Fitting
|
### Model Fitting
|
||||||
|
|
||||||
The [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method takes
|
The [`fit`](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html) method takes
|
||||||
an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then
|
an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then
|
||||||
returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
|
returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space.
|
||||||
The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to
|
The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to
|
||||||
|
@ -272,7 +272,7 @@ The following example shows the basic use of ChiSqSelector. The data set used ha
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [`ChiSqSelector` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector)
|
Refer to the [`ChiSqSelector` Scala docs](api/scala/org/apache/spark/mllib/feature/ChiSqSelector.html)
|
||||||
for details on the API.
|
for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/ChiSqSelectorExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/ChiSqSelectorExample.scala %}
|
||||||
|
@ -312,11 +312,11 @@ v_N
|
||||||
\end{pmatrix}
|
\end{pmatrix}
|
||||||
\]`
|
\]`
|
||||||
|
|
||||||
[`ElementwiseProduct`](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) has the following parameter in the constructor:
|
[`ElementwiseProduct`](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) has the following parameter in the constructor:
|
||||||
|
|
||||||
* `scalingVec`: the transforming vector.
|
* `scalingVec`: the transforming vector.
|
||||||
|
|
||||||
`ElementwiseProduct` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
|
`ElementwiseProduct` implements [`VectorTransformer`](api/scala/org/apache/spark/mllib/feature/VectorTransformer.html) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`.
|
||||||
|
|
||||||
### Example
|
### Example
|
||||||
|
|
||||||
|
@ -325,7 +325,7 @@ This example below demonstrates how to transform vectors using a transforming ve
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
Refer to the [`ElementwiseProduct` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) for details on the API.
|
Refer to the [`ElementwiseProduct` Scala docs](api/scala/org/apache/spark/mllib/feature/ElementwiseProduct.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/ElementwiseProductExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/ElementwiseProductExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -54,18 +54,18 @@ We refer users to the papers for more details.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
[`FPGrowth`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) implements the
|
[`FPGrowth`](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) implements the
|
||||||
FP-growth algorithm.
|
FP-growth algorithm.
|
||||||
It takes an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
|
It takes an `RDD` of transactions, where each transaction is an `Array` of items of a generic type.
|
||||||
Calling `FPGrowth.run` with transactions returns an
|
Calling `FPGrowth.run` with transactions returns an
|
||||||
[`FPGrowthModel`](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowthModel)
|
[`FPGrowthModel`](api/scala/org/apache/spark/mllib/fpm/FPGrowthModel.html)
|
||||||
that stores the frequent itemsets with their frequencies. The following
|
that stores the frequent itemsets with their frequencies. The following
|
||||||
example illustrates how to mine frequent itemsets and association rules
|
example illustrates how to mine frequent itemsets and association rules
|
||||||
(see [Association
|
(see [Association
|
||||||
Rules](mllib-frequent-pattern-mining.html#association-rules) for
|
Rules](mllib-frequent-pattern-mining.html#association-rules) for
|
||||||
details) from `transactions`.
|
details) from `transactions`.
|
||||||
|
|
||||||
Refer to the [`FPGrowth` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) for details on the API.
|
Refer to the [`FPGrowth` Scala docs](api/scala/org/apache/spark/mllib/fpm/FPGrowth.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/SimpleFPGrowth.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/SimpleFPGrowth.scala %}
|
||||||
|
|
||||||
|
@ -111,7 +111,7 @@ Refer to the [`FPGrowth` Python docs](api/python/pyspark.mllib.html#pyspark.mlli
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[AssociationRules](api/scala/index.html#org.apache.spark.mllib.fpm.AssociationRules)
|
[AssociationRules](api/scala/org/apache/spark/mllib/fpm/AssociationRules.html)
|
||||||
implements a parallel rule generation algorithm for constructing rules
|
implements a parallel rule generation algorithm for constructing rules
|
||||||
that have a single item as the consequent.
|
that have a single item as the consequent.
|
||||||
|
|
||||||
|
@ -168,13 +168,13 @@ The following example illustrates PrefixSpan running on the sequences
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) implements the
|
[`PrefixSpan`](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
|
||||||
PrefixSpan algorithm.
|
PrefixSpan algorithm.
|
||||||
Calling `PrefixSpan.run` returns a
|
Calling `PrefixSpan.run` returns a
|
||||||
[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel)
|
[`PrefixSpanModel`](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
|
||||||
that stores the frequent sequences with their frequencies.
|
that stores the frequent sequences with their frequencies.
|
||||||
|
|
||||||
Refer to the [`PrefixSpan` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) and [`PrefixSpanModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel) for details on the API.
|
Refer to the [`PrefixSpan` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpan.html) and [`PrefixSpanModel` Scala docs](api/scala/org/apache/spark/mllib/fpm/PrefixSpanModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/PrefixSpanExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/PrefixSpanExample.scala %}
|
||||||
|
|
||||||
|
|
|
@ -74,7 +74,7 @@ i.e. 4710.28,500.00. The data are split to training and testing set.
|
||||||
Model is created using the training set and a mean squared error is calculated from the predicted
|
Model is created using the training set and a mean squared error is calculated from the predicted
|
||||||
labels and real labels in the test set.
|
labels and real labels in the test set.
|
||||||
|
|
||||||
Refer to the [`IsotonicRegression` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegression) and [`IsotonicRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegressionModel) for details on the API.
|
Refer to the [`IsotonicRegression` Scala docs](api/scala/org/apache/spark/mllib/regression/IsotonicRegression.html) and [`IsotonicRegressionModel` Scala docs](api/scala/org/apache/spark/mllib/regression/IsotonicRegressionModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -184,7 +184,7 @@ training algorithm on this training data using a static method in the algorithm
|
||||||
object, and make predictions with the resulting model to compute the training
|
object, and make predictions with the resulting model to compute the training
|
||||||
error.
|
error.
|
||||||
|
|
||||||
Refer to the [`SVMWithSGD` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD) and [`SVMModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMModel) for details on the API.
|
Refer to the [`SVMWithSGD` Scala docs](api/scala/org/apache/spark/mllib/classification/SVMWithSGD.html) and [`SVMModel` Scala docs](api/scala/org/apache/spark/mllib/classification/SVMModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/SVMWithSGDExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/SVMWithSGDExample.scala %}
|
||||||
|
|
||||||
|
@ -305,11 +305,11 @@ We recommend L-BFGS over mini-batch gradient descent for faster convergence.
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
The following code illustrates how to load a sample multiclass dataset, split it into train and
|
The following code illustrates how to load a sample multiclass dataset, split it into train and
|
||||||
test, and use
|
test, and use
|
||||||
[LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS)
|
[LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html)
|
||||||
to fit a logistic regression model.
|
to fit a logistic regression model.
|
||||||
Then the model is evaluated against the test dataset and saved to disk.
|
Then the model is evaluated against the test dataset and saved to disk.
|
||||||
|
|
||||||
Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) and [`LogisticRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel) for details on the API.
|
Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`LogisticRegressionModel` Scala docs](api/scala/org/apache/spark/mllib/classification/LogisticRegressionModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/LogisticRegressionWithLBFGSExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/LogisticRegressionWithLBFGSExample.scala %}
|
||||||
|
|
||||||
|
@ -438,8 +438,8 @@ regularization parameter (`regParam`) along with various parameters associated w
|
||||||
gradient descent (`stepSize`, `numIterations`, `miniBatchFraction`). For each of them, we support
|
gradient descent (`stepSize`, `numIterations`, `miniBatchFraction`). For each of them, we support
|
||||||
all three possible regularizations (none, L1 or L2).
|
all three possible regularizations (none, L1 or L2).
|
||||||
|
|
||||||
For Logistic Regression, [L-BFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS)
|
For Logistic Regression, [L-BFGS](api/scala/org/apache/spark/mllib/optimization/LBFGS.html)
|
||||||
version is implemented under [LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS), and this
|
version is implemented under [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html), and this
|
||||||
version supports both binary and multinomial Logistic Regression while SGD version only supports
|
version supports both binary and multinomial Logistic Regression while SGD version only supports
|
||||||
binary Logistic Regression. However, L-BFGS version doesn't support L1 regularization but SGD one
|
binary Logistic Regression. However, L-BFGS version doesn't support L1 regularization but SGD one
|
||||||
supports L1 regularization. When L1 regularization is not required, L-BFGS version is strongly
|
supports L1 regularization. When L1 regularization is not required, L-BFGS version is strongly
|
||||||
|
@ -448,10 +448,10 @@ inverse Hessian matrix using quasi-Newton method.
|
||||||
|
|
||||||
Algorithms are all implemented in Scala:
|
Algorithms are all implemented in Scala:
|
||||||
|
|
||||||
* [SVMWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD)
|
* [SVMWithSGD](api/scala/org/apache/spark/mllib/classification/SVMWithSGD.html)
|
||||||
* [LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS)
|
* [LogisticRegressionWithLBFGS](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html)
|
||||||
* [LogisticRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD)
|
* [LogisticRegressionWithSGD](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithSGD.html)
|
||||||
* [LinearRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD)
|
* [LinearRegressionWithSGD](api/scala/org/apache/spark/mllib/regression/LinearRegressionWithSGD.html)
|
||||||
* [RidgeRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
|
* [RidgeRegressionWithSGD](api/scala/org/apache/spark/mllib/regression/RidgeRegressionWithSGD.html)
|
||||||
* [LassoWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD)
|
* [LassoWithSGD](api/scala/org/apache/spark/mllib/regression/LassoWithSGD.html)
|
||||||
|
|
||||||
|
|
|
@ -46,14 +46,14 @@ sparsity. Since the training data is only used once, it is not necessary to cach
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
[NaiveBayes](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
|
[NaiveBayes](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) implements
|
||||||
multinomial naive Bayes. It takes an RDD of
|
multinomial naive Bayes. It takes an RDD of
|
||||||
[LabeledPoint](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional
|
[LabeledPoint](api/scala/org/apache/spark/mllib/regression/LabeledPoint.html) and an optional
|
||||||
smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
|
smoothing parameter `lambda` as input, an optional model type parameter (default is "multinomial"), and outputs a
|
||||||
[NaiveBayesModel](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
|
[NaiveBayesModel](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html), which
|
||||||
can be used for evaluation and prediction.
|
can be used for evaluation and prediction.
|
||||||
|
|
||||||
Refer to the [`NaiveBayes` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes) and [`NaiveBayesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel) for details on the API.
|
Refer to the [`NaiveBayes` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayes$.html) and [`NaiveBayesModel` Scala docs](api/scala/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -111,12 +111,12 @@ As an alternative to just use the subgradient `$R'(\wv)$` of the regularizer in
|
||||||
direction, an improved update for some cases can be obtained by using the proximal operator
|
direction, an improved update for some cases can be obtained by using the proximal operator
|
||||||
instead.
|
instead.
|
||||||
For the L1-regularizer, the proximal operator is given by soft thresholding, as implemented in
|
For the L1-regularizer, the proximal operator is given by soft thresholding, as implemented in
|
||||||
[L1Updater](api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater).
|
[L1Updater](api/scala/org/apache/spark/mllib/optimization/L1Updater.html).
|
||||||
|
|
||||||
|
|
||||||
### Update schemes for distributed SGD
|
### Update schemes for distributed SGD
|
||||||
The SGD implementation in
|
The SGD implementation in
|
||||||
[GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) uses
|
[GradientDescent](api/scala/org/apache/spark/mllib/optimization/GradientDescent.html) uses
|
||||||
a simple (distributed) sampling of the data examples.
|
a simple (distributed) sampling of the data examples.
|
||||||
We recall that the loss part of the optimization problem `$\eqref{eq:regPrimal}$` is
|
We recall that the loss part of the optimization problem `$\eqref{eq:regPrimal}$` is
|
||||||
`$\frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)$`, and therefore `$\frac1n \sum_{i=1}^n L'_{\wv,i}$` would
|
`$\frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)$`, and therefore `$\frac1n \sum_{i=1}^n L'_{\wv,i}$` would
|
||||||
|
@ -169,7 +169,7 @@ are developed, see the
|
||||||
section for example.
|
section for example.
|
||||||
|
|
||||||
The SGD class
|
The SGD class
|
||||||
[GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent)
|
[GradientDescent](api/scala/org/apache/spark/mllib/optimization/GradientDescent.html)
|
||||||
sets the following parameters:
|
sets the following parameters:
|
||||||
|
|
||||||
* `Gradient` is a class that computes the stochastic gradient of the function
|
* `Gradient` is a class that computes the stochastic gradient of the function
|
||||||
|
@ -195,15 +195,15 @@ each iteration, to compute the gradient direction.
|
||||||
L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various
|
L-BFGS is currently only a low-level optimization primitive in `MLlib`. If you want to use L-BFGS in various
|
||||||
ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective
|
ML algorithms such as Linear Regression, and Logistic Regression, you have to pass the gradient of objective
|
||||||
function, and updater into optimizer yourself instead of using the training APIs like
|
function, and updater into optimizer yourself instead of using the training APIs like
|
||||||
[LogisticRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD).
|
[LogisticRegressionWithSGD](api/scala/org/apache/spark/mllib/classification/LogisticRegressionWithSGD.html).
|
||||||
See the example below. It will be addressed in the next release.
|
See the example below. It will be addressed in the next release.
|
||||||
|
|
||||||
The L1 regularization by using
|
The L1 regularization by using
|
||||||
[L1Updater](api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater) will not work since the
|
[L1Updater](api/scala/org/apache/spark/mllib/optimization/L1Updater.html) will not work since the
|
||||||
soft-thresholding logic in L1Updater is designed for gradient descent. See the developer's note.
|
soft-thresholding logic in L1Updater is designed for gradient descent. See the developer's note.
|
||||||
|
|
||||||
The L-BFGS method
|
The L-BFGS method
|
||||||
[LBFGS.runLBFGS](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS)
|
[LBFGS.runLBFGS](api/scala/org/apache/spark/mllib/optimization/LBFGS.html)
|
||||||
has the following parameters:
|
has the following parameters:
|
||||||
|
|
||||||
* `Gradient` is a class that computes the gradient of the objective function
|
* `Gradient` is a class that computes the gradient of the objective function
|
||||||
|
@ -233,7 +233,7 @@ L-BFGS optimizer.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
Refer to the [`LBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS) and [`SquaredL2Updater` Scala docs](api/scala/index.html#org.apache.spark.mllib.optimization.SquaredL2Updater) for details on the API.
|
Refer to the [`LBFGS` Scala docs](api/scala/org/apache/spark/mllib/optimization/LBFGS.html) and [`SquaredL2Updater` Scala docs](api/scala/org/apache/spark/mllib/optimization/SquaredL2Updater.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/LBFGSExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/LBFGSExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -62,7 +62,7 @@ To export a supported `model` (see table above) to PMML, simply call `model.toPM
|
||||||
|
|
||||||
As well as exporting the PMML model to a String (`model.toPMML` as in the example above), you can export the PMML model to other formats.
|
As well as exporting the PMML model to a String (`model.toPMML` as in the example above), you can export the PMML model to other formats.
|
||||||
|
|
||||||
Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
|
Refer to the [`KMeans` Scala docs](api/scala/org/apache/spark/mllib/clustering/KMeans.html) and [`Vectors` Scala docs](api/scala/org/apache/spark/mllib/linalg/Vectors$.html) for details on the API.
|
||||||
|
|
||||||
Here a complete example of building a KMeansModel and print it out in PMML format:
|
Here a complete example of building a KMeansModel and print it out in PMML format:
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}
|
||||||
|
|
|
@ -48,12 +48,12 @@ available in `Statistics`.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
[`colStats()`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) returns an instance of
|
[`colStats()`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) returns an instance of
|
||||||
[`MultivariateStatisticalSummary`](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary),
|
[`MultivariateStatisticalSummary`](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html),
|
||||||
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
|
which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the
|
||||||
total count.
|
total count.
|
||||||
|
|
||||||
Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary) for details on the API.
|
Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/SummaryStatisticsExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -91,11 +91,11 @@ correlation methods are currently Pearson's and Spearman's correlation.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to
|
[`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
|
||||||
calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
|
calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
|
||||||
an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
|
an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
|
||||||
|
|
||||||
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
|
Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -137,7 +137,7 @@ python.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`sampleByKeyExact()`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) allows users to
|
[`sampleByKeyExact()`](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) allows users to
|
||||||
sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
|
sample exactly $\lceil f_k \cdot n_k \rceil \, \forall k \in K$ items, where $f_k$ is the desired
|
||||||
fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
|
fraction for key $k$, $n_k$ is the number of key-value pairs for key $k$, and $K$ is the set of
|
||||||
keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
|
keys. Sampling without replacement requires one additional pass over the RDD to guarantee sample
|
||||||
|
@ -181,7 +181,7 @@ independence tests.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to
|
[`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
|
||||||
run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
|
run Pearson's chi-squared tests. The following example demonstrates how to run and interpret
|
||||||
hypothesis tests.
|
hypothesis tests.
|
||||||
|
|
||||||
|
@ -221,11 +221,11 @@ message.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`Statistics`](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) provides methods to
|
[`Statistics`](api/scala/org/apache/spark/mllib/stat/Statistics$.html) provides methods to
|
||||||
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
|
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
|
||||||
and interpret the hypothesis tests.
|
and interpret the hypothesis tests.
|
||||||
|
|
||||||
Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
|
Refer to the [`Statistics` Scala docs](api/scala/org/apache/spark/mllib/stat/Statistics$.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -269,7 +269,7 @@ all prior batches.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`StreamingTest`](api/scala/index.html#org.apache.spark.mllib.stat.test.StreamingTest)
|
[`StreamingTest`](api/scala/org/apache/spark/mllib/stat/test/StreamingTest.html)
|
||||||
provides streaming hypothesis testing.
|
provides streaming hypothesis testing.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/StreamingTestExample.scala %}
|
||||||
|
@ -292,12 +292,12 @@ uniform, standard normal, or Poisson.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) provides factory
|
[`RandomRDDs`](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) provides factory
|
||||||
methods to generate random double RDDs or vector RDDs.
|
methods to generate random double RDDs or vector RDDs.
|
||||||
The following example generates a random double RDD, whose values follows the standard normal
|
The following example generates a random double RDD, whose values follows the standard normal
|
||||||
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
|
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
|
||||||
|
|
||||||
Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) for details on the API.
|
Refer to the [`RandomRDDs` Scala docs](api/scala/org/apache/spark/mllib/random/RandomRDDs$.html) for details on the API.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark.SparkContext
|
import org.apache.spark.SparkContext
|
||||||
|
@ -370,11 +370,11 @@ mean of PDFs of normal distributions centered around each of the samples.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
[`KernelDensity`](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) provides methods
|
[`KernelDensity`](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) provides methods
|
||||||
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
|
to compute kernel density estimates from an RDD of samples. The following example demonstrates how
|
||||||
to do so.
|
to do so.
|
||||||
|
|
||||||
Refer to the [`KernelDensity` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) for details on the API.
|
Refer to the [`KernelDensity` Scala docs](api/scala/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API.
|
||||||
|
|
||||||
{% include_example scala/org/apache/spark/examples/mllib/KernelDensityEstimationExample.scala %}
|
{% include_example scala/org/apache/spark/examples/mllib/KernelDensityEstimationExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -57,7 +57,7 @@ scala> val textFile = spark.read.textFile("README.md")
|
||||||
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
|
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the _[API doc](api/scala/index.html#org.apache.spark.sql.Dataset)_.
|
You can get values from Dataset directly, by calling some actions, or transform the Dataset to get a new one. For more details, please read the _[API doc](api/scala/org/apache/spark/sql/Dataset.html)_.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
scala> textFile.count() // Number of items in this Dataset
|
scala> textFile.count() // Number of items in this Dataset
|
||||||
|
|
|
@ -149,8 +149,8 @@ $ PYSPARK_PYTHON=/opt/pypy-2.5/bin/pypy bin/spark-submit examples/src/main/pytho
|
||||||
|
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
The first thing a Spark program must do is to create a [SparkContext](api/scala/index.html#org.apache.spark.SparkContext) object, which tells Spark
|
The first thing a Spark program must do is to create a [SparkContext](api/scala/org/apache/spark/SparkContext.html) object, which tells Spark
|
||||||
how to access a cluster. To create a `SparkContext` you first need to build a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object
|
how to access a cluster. To create a `SparkContext` you first need to build a [SparkConf](api/scala/org/apache/spark/SparkConf.html) object
|
||||||
that contains information about your application.
|
that contains information about your application.
|
||||||
|
|
||||||
Only one SparkContext should be active per JVM. You must `stop()` the active SparkContext before creating a new one.
|
Only one SparkContext should be active per JVM. You must `stop()` the active SparkContext before creating a new one.
|
||||||
|
@ -500,7 +500,7 @@ then this approach should work well for such cases.
|
||||||
|
|
||||||
If you have custom serialized binary data (such as loading data from Cassandra / HBase), then you will first need to
|
If you have custom serialized binary data (such as loading data from Cassandra / HBase), then you will first need to
|
||||||
transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
|
transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
|
||||||
A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) trait is provided
|
A [Converter](api/scala/org/apache/spark/api/python/Converter.html) trait is provided
|
||||||
for this. Simply extend this trait and implement your transformation code in the ```convert```
|
for this. Simply extend this trait and implement your transformation code in the ```convert```
|
||||||
method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark
|
method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark
|
||||||
classpath.
|
classpath.
|
||||||
|
@ -856,7 +856,7 @@ by a key.
|
||||||
In Scala, these operations are automatically available on RDDs containing
|
In Scala, these operations are automatically available on RDDs containing
|
||||||
[Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2) objects
|
[Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2) objects
|
||||||
(the built-in tuples in the language, created by simply writing `(a, b)`). The key-value pair operations are available in the
|
(the built-in tuples in the language, created by simply writing `(a, b)`). The key-value pair operations are available in the
|
||||||
[PairRDDFunctions](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) class,
|
[PairRDDFunctions](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) class,
|
||||||
which automatically wraps around an RDD of tuples.
|
which automatically wraps around an RDD of tuples.
|
||||||
|
|
||||||
For example, the following code uses the `reduceByKey` operation on key-value pairs to count how
|
For example, the following code uses the `reduceByKey` operation on key-value pairs to count how
|
||||||
|
@ -946,12 +946,12 @@ We could also use `counts.sortByKey()`, for example, to sort the pairs alphabeti
|
||||||
|
|
||||||
The following table lists some of the common transformations supported by Spark. Refer to the
|
The following table lists some of the common transformations supported by Spark. Refer to the
|
||||||
RDD API doc
|
RDD API doc
|
||||||
([Scala](api/scala/index.html#org.apache.spark.rdd.RDD),
|
([Scala](api/scala/org/apache/spark/rdd/RDD.html),
|
||||||
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
|
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
|
||||||
[Python](api/python/pyspark.html#pyspark.RDD),
|
[Python](api/python/pyspark.html#pyspark.RDD),
|
||||||
[R](api/R/index.html))
|
[R](api/R/index.html))
|
||||||
and pair RDD functions doc
|
and pair RDD functions doc
|
||||||
([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions),
|
([Scala](api/scala/org/apache/spark/rdd/PairRDDFunctions.html),
|
||||||
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
|
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
|
||||||
for details.
|
for details.
|
||||||
|
|
||||||
|
@ -1060,13 +1060,13 @@ for details.
|
||||||
|
|
||||||
The following table lists some of the common actions supported by Spark. Refer to the
|
The following table lists some of the common actions supported by Spark. Refer to the
|
||||||
RDD API doc
|
RDD API doc
|
||||||
([Scala](api/scala/index.html#org.apache.spark.rdd.RDD),
|
([Scala](api/scala/org/apache/spark/rdd/RDD.html),
|
||||||
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
|
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
|
||||||
[Python](api/python/pyspark.html#pyspark.RDD),
|
[Python](api/python/pyspark.html#pyspark.RDD),
|
||||||
[R](api/R/index.html))
|
[R](api/R/index.html))
|
||||||
|
|
||||||
and pair RDD functions doc
|
and pair RDD functions doc
|
||||||
([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions),
|
([Scala](api/scala/org/apache/spark/rdd/PairRDDFunctions.html),
|
||||||
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
|
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
|
||||||
for details.
|
for details.
|
||||||
|
|
||||||
|
@ -1208,7 +1208,7 @@ In addition, each persisted RDD can be stored using a different *storage level*,
|
||||||
to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space),
|
to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space),
|
||||||
replicate it across nodes.
|
replicate it across nodes.
|
||||||
These levels are set by passing a
|
These levels are set by passing a
|
||||||
`StorageLevel` object ([Scala](api/scala/index.html#org.apache.spark.storage.StorageLevel),
|
`StorageLevel` object ([Scala](api/scala/org/apache/spark/storage/StorageLevel.html),
|
||||||
[Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html),
|
[Java](api/java/index.html?org/apache/spark/storage/StorageLevel.html),
|
||||||
[Python](api/python/pyspark.html#pyspark.StorageLevel))
|
[Python](api/python/pyspark.html#pyspark.StorageLevel))
|
||||||
to `persist()`. The `cache()` method is a shorthand for using the default storage level,
|
to `persist()`. The `cache()` method is a shorthand for using the default storage level,
|
||||||
|
@ -1404,11 +1404,11 @@ res2: Long = 10
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
While this code used the built-in support for accumulators of type Long, programmers can also
|
While this code used the built-in support for accumulators of type Long, programmers can also
|
||||||
create their own types by subclassing [AccumulatorV2](api/scala/index.html#org.apache.spark.util.AccumulatorV2).
|
create their own types by subclassing [AccumulatorV2](api/scala/org/apache/spark/util/AccumulatorV2.html).
|
||||||
The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting
|
The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting
|
||||||
the accumulator to zero, `add` for adding another value into the accumulator,
|
the accumulator to zero, `add` for adding another value into the accumulator,
|
||||||
`merge` for merging another same-type accumulator into this one. Other methods that must be overridden
|
`merge` for merging another same-type accumulator into this one. Other methods that must be overridden
|
||||||
are contained in the [API documentation](api/scala/index.html#org.apache.spark.util.AccumulatorV2). For example, supposing we had a `MyVector` class
|
are contained in the [API documentation](api/scala/org/apache/spark/util/AccumulatorV2.html). For example, supposing we had a `MyVector` class
|
||||||
representing mathematical vectors, we could write:
|
representing mathematical vectors, we could write:
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
|
@ -1457,11 +1457,11 @@ accum.value();
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
While this code used the built-in support for accumulators of type Long, programmers can also
|
While this code used the built-in support for accumulators of type Long, programmers can also
|
||||||
create their own types by subclassing [AccumulatorV2](api/scala/index.html#org.apache.spark.util.AccumulatorV2).
|
create their own types by subclassing [AccumulatorV2](api/scala/org/apache/spark/util/AccumulatorV2.html).
|
||||||
The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting
|
The AccumulatorV2 abstract class has several methods which one has to override: `reset` for resetting
|
||||||
the accumulator to zero, `add` for adding another value into the accumulator,
|
the accumulator to zero, `add` for adding another value into the accumulator,
|
||||||
`merge` for merging another same-type accumulator into this one. Other methods that must be overridden
|
`merge` for merging another same-type accumulator into this one. Other methods that must be overridden
|
||||||
are contained in the [API documentation](api/scala/index.html#org.apache.spark.util.AccumulatorV2). For example, supposing we had a `MyVector` class
|
are contained in the [API documentation](api/scala/org/apache/spark/util/AccumulatorV2.html). For example, supposing we had a `MyVector` class
|
||||||
representing mathematical vectors, we could write:
|
representing mathematical vectors, we could write:
|
||||||
|
|
||||||
{% highlight java %}
|
{% highlight java %}
|
||||||
|
@ -1620,4 +1620,4 @@ For help on deploying, the [cluster mode overview](cluster-overview.html) descri
|
||||||
in distributed operation and supported cluster managers.
|
in distributed operation and supported cluster managers.
|
||||||
|
|
||||||
Finally, full API documentation is available in
|
Finally, full API documentation is available in
|
||||||
[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/).
|
[Scala](api/scala/org/apache/spark/), [Java](api/java/), [Python](api/python/) and [R](api/R/).
|
||||||
|
|
|
@ -118,4 +118,4 @@ To load all files recursively, you can use:
|
||||||
<div data-lang="r" markdown="1">
|
<div data-lang="r" markdown="1">
|
||||||
{% include_example recursive_file_lookup r/RSparkSQLExample.R %}
|
{% include_example recursive_file_lookup r/RSparkSQLExample.R %}
|
||||||
</div>
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -23,7 +23,7 @@ license: |
|
||||||
{:toc}
|
{:toc}
|
||||||
|
|
||||||
Spark SQL also includes a data source that can read data from other databases using JDBC. This
|
Spark SQL also includes a data source that can read data from other databases using JDBC. This
|
||||||
functionality should be preferred over using [JdbcRDD](api/scala/index.html#org.apache.spark.rdd.JdbcRDD).
|
functionality should be preferred over using [JdbcRDD](api/scala/org/apache/spark/rdd/JdbcRDD.html).
|
||||||
This is because the results are returned
|
This is because the results are returned
|
||||||
as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
|
as a DataFrame and they can easily be processed in Spark SQL or joined with other data sources.
|
||||||
The JDBC data source is also easier to use from Java or Python as it does not require the user to
|
The JDBC data source is also easier to use from Java or Python as it does not require the user to
|
||||||
|
|
|
@ -93,4 +93,4 @@ SELECT * FROM jsonTable
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -27,7 +27,7 @@ license: |
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
The entry point into all functionality in Spark is the [`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
|
The entry point into all functionality in Spark is the [`SparkSession`](api/scala/org/apache/spark/sql/SparkSession.html) class. To create a basic `SparkSession`, just use `SparkSession.builder()`:
|
||||||
|
|
||||||
{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
|
{% include_example init_session scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
|
||||||
</div>
|
</div>
|
||||||
|
@ -104,7 +104,7 @@ As an example, the following creates a DataFrame based on the content of a JSON
|
||||||
|
|
||||||
## Untyped Dataset Operations (aka DataFrame Operations)
|
## Untyped Dataset Operations (aka DataFrame Operations)
|
||||||
|
|
||||||
DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/index.html#org.apache.spark.sql.Dataset), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/SparkDataFrame.html).
|
DataFrames provide a domain-specific language for structured data manipulation in [Scala](api/scala/org/apache/spark/sql/Dataset.html), [Java](api/java/index.html?org/apache/spark/sql/Dataset.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame) and [R](api/R/SparkDataFrame.html).
|
||||||
|
|
||||||
As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets.
|
As mentioned above, in Spark 2.0, DataFrames are just Dataset of `Row`s in Scala and Java API. These operations are also referred as "untyped transformations" in contrast to "typed transformations" come with strongly typed Scala/Java Datasets.
|
||||||
|
|
||||||
|
@ -114,9 +114,9 @@ Here we include some basic examples of structured data processing using Datasets
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
|
{% include_example untyped_ops scala/org/apache/spark/examples/sql/SparkSQLExample.scala %}
|
||||||
|
|
||||||
For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/index.html#org.apache.spark.sql.Dataset).
|
For a complete list of the types of operations that can be performed on a Dataset, refer to the [API Documentation](api/scala/org/apache/spark/sql/Dataset.html).
|
||||||
|
|
||||||
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/index.html#org.apache.spark.sql.functions$).
|
In addition to simple column references and expressions, Datasets also have a rich library of functions including string manipulation, date arithmetic, common math operations and more. The complete list is available in the [DataFrame Function Reference](api/scala/org/apache/spark/sql/functions$.html).
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
<div data-lang="java" markdown="1">
|
<div data-lang="java" markdown="1">
|
||||||
|
@ -222,7 +222,7 @@ SELECT * FROM global_temp.temp_view
|
||||||
## Creating Datasets
|
## Creating Datasets
|
||||||
|
|
||||||
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
|
Datasets are similar to RDDs, however, instead of using Java serialization or Kryo they use
|
||||||
a specialized [Encoder](api/scala/index.html#org.apache.spark.sql.Encoder) to serialize the objects
|
a specialized [Encoder](api/scala/org/apache/spark/sql/Encoder.html) to serialize the objects
|
||||||
for processing or transmitting over the network. While both encoders and standard serialization are
|
for processing or transmitting over the network. While both encoders and standard serialization are
|
||||||
responsible for turning an object into bytes, encoders are code generated dynamically and use a format
|
responsible for turning an object into bytes, encoders are code generated dynamically and use a format
|
||||||
that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
|
that allows Spark to perform many operations like filtering, sorting and hashing without deserializing
|
||||||
|
@ -351,16 +351,16 @@ For example:
|
||||||
|
|
||||||
## Aggregations
|
## Aggregations
|
||||||
|
|
||||||
The [built-in DataFrames functions](api/scala/index.html#org.apache.spark.sql.functions$) provide common
|
The [built-in DataFrames functions](api/scala/org/apache/spark/sql/functions$.html) provide common
|
||||||
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc.
|
aggregations such as `count()`, `countDistinct()`, `avg()`, `max()`, `min()`, etc.
|
||||||
While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in
|
While those functions are designed for DataFrames, Spark SQL also has type-safe versions for some of them in
|
||||||
[Scala](api/scala/index.html#org.apache.spark.sql.expressions.scalalang.typed$) and
|
[Scala](api/scala/org/apache/spark/sql/expressions/scalalang/typed$.html) and
|
||||||
[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets.
|
[Java](api/java/org/apache/spark/sql/expressions/javalang/typed.html) to work with strongly typed Datasets.
|
||||||
Moreover, users are not limited to the predefined aggregate functions and can create their own.
|
Moreover, users are not limited to the predefined aggregate functions and can create their own.
|
||||||
|
|
||||||
### Type-Safe User-Defined Aggregate Functions
|
### Type-Safe User-Defined Aggregate Functions
|
||||||
|
|
||||||
User-defined aggregations for strongly typed Datasets revolve around the [Aggregator](api/scala/index.html#org.apache.spark.sql.expressions.Aggregator) abstract class.
|
User-defined aggregations for strongly typed Datasets revolve around the [Aggregator](api/scala/org/apache/spark/sql/expressions/Aggregator.html) abstract class.
|
||||||
For example, a type-safe user-defined average can look like:
|
For example, a type-safe user-defined average can look like:
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
|
|
@ -737,11 +737,11 @@ and writing data out (`DataFrame.write`),
|
||||||
and deprecated the old APIs (e.g., `SQLContext.parquetFile`, `SQLContext.jsonFile`).
|
and deprecated the old APIs (e.g., `SQLContext.parquetFile`, `SQLContext.jsonFile`).
|
||||||
|
|
||||||
See the API docs for `SQLContext.read` (
|
See the API docs for `SQLContext.read` (
|
||||||
<a href="api/scala/index.html#org.apache.spark.sql.SQLContext@read:DataFrameReader">Scala</a>,
|
<a href="api/scala/org/apache/spark/sql/SQLContext.html#read:DataFrameReader">Scala</a>,
|
||||||
<a href="api/java/org/apache/spark/sql/SQLContext.html#read()">Java</a>,
|
<a href="api/java/org/apache/spark/sql/SQLContext.html#read()">Java</a>,
|
||||||
<a href="api/python/pyspark.sql.html#pyspark.sql.SQLContext.read">Python</a>
|
<a href="api/python/pyspark.sql.html#pyspark.sql.SQLContext.read">Python</a>
|
||||||
) and `DataFrame.write` (
|
) and `DataFrame.write` (
|
||||||
<a href="api/scala/index.html#org.apache.spark.sql.DataFrame@write:DataFrameWriter">Scala</a>,
|
<a href="api/scala/org/apache/spark/sql/DataFrame.html#write:DataFrameWriter">Scala</a>,
|
||||||
<a href="api/java/org/apache/spark/sql/Dataset.html#write()">Java</a>,
|
<a href="api/java/org/apache/spark/sql/Dataset.html#write()">Java</a>,
|
||||||
<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrame.write">Python</a>
|
<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrame.write">Python</a>
|
||||||
) more information.
|
) more information.
|
||||||
|
|
|
@ -61,7 +61,7 @@ In Scala and Java, a DataFrame is represented by a Dataset of `Row`s.
|
||||||
In [the Scala API][scala-datasets], `DataFrame` is simply a type alias of `Dataset[Row]`.
|
In [the Scala API][scala-datasets], `DataFrame` is simply a type alias of `Dataset[Row]`.
|
||||||
While, in [Java API][java-datasets], users need to use `Dataset<Row>` to represent a `DataFrame`.
|
While, in [Java API][java-datasets], users need to use `Dataset<Row>` to represent a `DataFrame`.
|
||||||
|
|
||||||
[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
|
[scala-datasets]: api/scala/org/apache/spark/sql/Dataset.html
|
||||||
[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
|
[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
|
||||||
|
|
||||||
Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
|
Throughout this document, we will often refer to Scala/Java Datasets of `Row`s as DataFrames.
|
||||||
|
|
|
@ -106,4 +106,4 @@ ANALYZE TABLE table_identifier [ partition_spec ]
|
||||||
max_col_len 13
|
max_col_len 13
|
||||||
histogram NULL
|
histogram NULL
|
||||||
|
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
|
@ -51,4 +51,4 @@ REFRESH "hdfs://path/to/table";
|
||||||
- [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html)
|
- [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html)
|
||||||
- [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html)
|
- [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html)
|
||||||
- [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html)
|
- [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html)
|
||||||
- [REFRESH TABLE](sql-ref-syntax-aux-refresh-table.html)
|
- [REFRESH TABLE](sql-ref-syntax-aux-refresh-table.html)
|
||||||
|
|
|
@ -55,4 +55,4 @@ REFRESH TABLE tempDB.view1;
|
||||||
### Related Statements
|
### Related Statements
|
||||||
- [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html)
|
- [CACHE TABLE](sql-ref-syntax-aux-cache-cache-table.html)
|
||||||
- [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html)
|
- [CLEAR CACHE](sql-ref-syntax-aux-cache-clear-cache.html)
|
||||||
- [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html)
|
- [UNCACHE TABLE](sql-ref-syntax-aux-cache-uncache-table.html)
|
||||||
|
|
|
@ -22,4 +22,4 @@ license: |
|
||||||
* [ADD FILE](sql-ref-syntax-aux-resource-mgmt-add-file.html)
|
* [ADD FILE](sql-ref-syntax-aux-resource-mgmt-add-file.html)
|
||||||
* [ADD JAR](sql-ref-syntax-aux-resource-mgmt-add-jar.html)
|
* [ADD JAR](sql-ref-syntax-aux-resource-mgmt-add-jar.html)
|
||||||
* [LIST FILE](sql-ref-syntax-aux-resource-mgmt-list-file.html)
|
* [LIST FILE](sql-ref-syntax-aux-resource-mgmt-list-file.html)
|
||||||
* [LIST JAR](sql-ref-syntax-aux-resource-mgmt-list-jar.html)
|
* [LIST JAR](sql-ref-syntax-aux-resource-mgmt-list-jar.html)
|
||||||
|
|
|
@ -104,4 +104,4 @@ SHOW TABLES LIKE 'sam*|suj';
|
||||||
- [CREATE TABLE](sql-ref-syntax-ddl-create-table.html)
|
- [CREATE TABLE](sql-ref-syntax-ddl-create-table.html)
|
||||||
- [DROP TABLE](sql-ref-syntax-ddl-drop-table.html)
|
- [DROP TABLE](sql-ref-syntax-ddl-drop-table.html)
|
||||||
- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html)
|
- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html)
|
||||||
- [DROP DATABASE](sql-ref-syntax-ddl-drop-database.html)
|
- [DROP DATABASE](sql-ref-syntax-ddl-drop-database.html)
|
||||||
|
|
|
@ -25,4 +25,4 @@ license: |
|
||||||
* [SHOW TABLES](sql-ref-syntax-aux-show-tables.html)
|
* [SHOW TABLES](sql-ref-syntax-aux-show-tables.html)
|
||||||
* [SHOW TBLPROPERTIES](sql-ref-syntax-aux-show-tblproperties.html)
|
* [SHOW TBLPROPERTIES](sql-ref-syntax-aux-show-tblproperties.html)
|
||||||
* [SHOW PARTITIONS](sql-ref-syntax-aux-show-partitions.html)
|
* [SHOW PARTITIONS](sql-ref-syntax-aux-show-partitions.html)
|
||||||
* [SHOW CREATE TABLE](sql-ref-syntax-aux-show-create-table.html)
|
* [SHOW CREATE TABLE](sql-ref-syntax-aux-show-create-table.html)
|
||||||
|
|
|
@ -77,4 +77,4 @@ DROP DATABASE IF EXISTS inventory_db CASCADE;
|
||||||
### Related statements
|
### Related statements
|
||||||
- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html)
|
- [CREATE DATABASE](sql-ref-syntax-ddl-create-database.html)
|
||||||
- [DESCRIBE DATABASE](sql-ref-syntax-aux-describe-database.html)
|
- [DESCRIBE DATABASE](sql-ref-syntax-aux-describe-database.html)
|
||||||
- [SHOW DATABASES](sql-ref-syntax-aux-show-databases.html)
|
- [SHOW DATABASES](sql-ref-syntax-aux-show-databases.html)
|
||||||
|
|
|
@ -102,4 +102,4 @@ DROP TEMPORARY FUNCTION IF EXISTS test_avg;
|
||||||
### Related statements
|
### Related statements
|
||||||
- [CREATE FUNCTION](sql-ref-syntax-ddl-create-function.html)
|
- [CREATE FUNCTION](sql-ref-syntax-ddl-create-function.html)
|
||||||
- [DESCRIBE FUNCTION](sql-ref-syntax-aux-describe-function.html)
|
- [DESCRIBE FUNCTION](sql-ref-syntax-aux-describe-function.html)
|
||||||
- [SHOW FUNCTION](sql-ref-syntax-aux-show-functions.html)
|
- [SHOW FUNCTION](sql-ref-syntax-aux-show-functions.html)
|
||||||
|
|
|
@ -84,4 +84,4 @@ INSERT OVERWRITE [ LOCAL ] DIRECTORY directory_path
|
||||||
### Related Statements
|
### Related Statements
|
||||||
* [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html)
|
* [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html)
|
||||||
* [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html)
|
* [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html)
|
||||||
* [INSERT OVERWRITE DIRECTORY statement](sql-ref-syntax-dml-insert-overwrite-directory.html)
|
* [INSERT OVERWRITE DIRECTORY statement](sql-ref-syntax-dml-insert-overwrite-directory.html)
|
||||||
|
|
|
@ -82,4 +82,4 @@ INSERT OVERWRITE DIRECTORY
|
||||||
### Related Statements
|
### Related Statements
|
||||||
* [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html)
|
* [INSERT INTO statement](sql-ref-syntax-dml-insert-into.html)
|
||||||
* [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html)
|
* [INSERT OVERWRITE statement](sql-ref-syntax-dml-insert-overwrite-table.html)
|
||||||
* [INSERT OVERWRITE DIRECTORY with Hive format statement](sql-ref-syntax-dml-insert-overwrite-directory-hive.html)
|
* [INSERT OVERWRITE DIRECTORY with Hive format statement](sql-ref-syntax-dml-insert-overwrite-directory-hive.html)
|
||||||
|
|
|
@ -22,4 +22,4 @@ license: |
|
||||||
Data Manipulation Statements are used to add, change, or delete data. Spark SQL supports the following Data Manipulation Statements:
|
Data Manipulation Statements are used to add, change, or delete data. Spark SQL supports the following Data Manipulation Statements:
|
||||||
|
|
||||||
- [INSERT](sql-ref-syntax-dml-insert.html)
|
- [INSERT](sql-ref-syntax-dml-insert.html)
|
||||||
- [LOAD](sql-ref-syntax-dml-load.html)
|
- [LOAD](sql-ref-syntax-dml-load.html)
|
||||||
|
|
|
@ -96,4 +96,4 @@ SELECT age, name FROM person CLUSTER BY age;
|
||||||
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
||||||
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
||||||
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
||||||
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
||||||
|
|
|
@ -91,4 +91,4 @@ SELECT age, name FROM person DISTRIBUTE BY age;
|
||||||
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
||||||
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
||||||
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
|
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
|
||||||
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
||||||
|
|
|
@ -184,4 +184,4 @@ SELECT /*+ REPARTITION(zip_code) */ name, age, zip_code FROM person
|
||||||
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
- [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
||||||
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
|
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
|
||||||
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
||||||
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
||||||
|
|
|
@ -143,4 +143,4 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { named_expression [ , ... ] }
|
||||||
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
- [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
||||||
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
|
- [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
|
||||||
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
- [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
||||||
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
- [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
||||||
|
|
|
@ -28,7 +28,7 @@ in Scala or Java.
|
||||||
## Implementing a Custom Receiver
|
## Implementing a Custom Receiver
|
||||||
|
|
||||||
This starts with implementing a **Receiver**
|
This starts with implementing a **Receiver**
|
||||||
([Scala doc](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver),
|
([Scala doc](api/scala/org/apache/spark/streaming/receiver/Receiver.html),
|
||||||
[Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)).
|
[Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)).
|
||||||
A custom receiver must extend this abstract class by implementing two methods
|
A custom receiver must extend this abstract class by implementing two methods
|
||||||
|
|
||||||
|
|
|
@ -23,4 +23,4 @@ replicated commit log service. Please read the [Kafka documentation](https://ka
|
||||||
thoroughly before starting an integration using Spark.
|
thoroughly before starting an integration using Spark.
|
||||||
|
|
||||||
At the moment, Spark requires Kafka 0.10 and higher. See
|
At the moment, Spark requires Kafka 0.10 and higher. See
|
||||||
<a href="streaming-kafka-0-10-integration.html">Kafka 0.10 integration documentation</a> for details.
|
<a href="streaming-kafka-0-10-integration.html">Kafka 0.10 integration documentation</a> for details.
|
||||||
|
|
|
@ -59,7 +59,7 @@ A Kinesis stream can be set up at one of the valid Kinesis endpoints with 1 or m
|
||||||
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
|
.storageLevel(StorageLevel.MEMORY_AND_DISK_2)
|
||||||
.build()
|
.build()
|
||||||
|
|
||||||
See the [API docs](api/scala/index.html#org.apache.spark.streaming.kinesis.KinesisInputDStream)
|
See the [API docs](api/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.html)
|
||||||
and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala). Refer to the [Running the Example](#running-the-example) subsection for instructions on how to run the example.
|
and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/external/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala). Refer to the [Running the Example](#running-the-example) subsection for instructions on how to run the example.
|
||||||
|
|
||||||
</div>
|
</div>
|
||||||
|
|
|
@ -57,7 +57,7 @@ Spark Streaming provides a high-level abstraction called *discretized stream* or
|
||||||
which represents a continuous stream of data. DStreams can be created either from input data
|
which represents a continuous stream of data. DStreams can be created either from input data
|
||||||
streams from sources such as Kafka, and Kinesis, or by applying high-level
|
streams from sources such as Kafka, and Kinesis, or by applying high-level
|
||||||
operations on other DStreams. Internally, a DStream is represented as a sequence of
|
operations on other DStreams. Internally, a DStream is represented as a sequence of
|
||||||
[RDDs](api/scala/index.html#org.apache.spark.rdd.RDD).
|
[RDDs](api/scala/org/apache/spark/rdd/RDD.html).
|
||||||
|
|
||||||
This guide shows you how to start writing Spark Streaming programs with DStreams. You can
|
This guide shows you how to start writing Spark Streaming programs with DStreams. You can
|
||||||
write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2),
|
write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2),
|
||||||
|
@ -80,7 +80,7 @@ do is as follows.
|
||||||
<div data-lang="scala" markdown="1" >
|
<div data-lang="scala" markdown="1" >
|
||||||
First, we import the names of the Spark Streaming classes and some implicit
|
First, we import the names of the Spark Streaming classes and some implicit
|
||||||
conversions from StreamingContext into our environment in order to add useful methods to
|
conversions from StreamingContext into our environment in order to add useful methods to
|
||||||
other classes we need (like DStream). [StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) is the
|
other classes we need (like DStream). [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) is the
|
||||||
main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second.
|
main entry point for all streaming functionality. We create a local StreamingContext with two execution threads, and a batch interval of 1 second.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
|
@ -185,7 +185,7 @@ JavaDStream<String> words = lines.flatMap(x -> Arrays.asList(x.split(" ")).itera
|
||||||
generating multiple new records from each record in the source DStream. In this case,
|
generating multiple new records from each record in the source DStream. In this case,
|
||||||
each line will be split into multiple words and the stream of words is represented as the
|
each line will be split into multiple words and the stream of words is represented as the
|
||||||
`words` DStream. Note that we defined the transformation using a
|
`words` DStream. Note that we defined the transformation using a
|
||||||
[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
|
[FlatMapFunction](api/scala/org/apache/spark/api/java/function/FlatMapFunction.html) object.
|
||||||
As we will discover along the way, there are a number of such convenience classes in the Java API
|
As we will discover along the way, there are a number of such convenience classes in the Java API
|
||||||
that help defines DStream transformations.
|
that help defines DStream transformations.
|
||||||
|
|
||||||
|
@ -201,9 +201,9 @@ wordCounts.print();
|
||||||
{% endhighlight %}
|
{% endhighlight %}
|
||||||
|
|
||||||
The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word,
|
The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word,
|
||||||
1)` pairs, using a [PairFunction](api/scala/index.html#org.apache.spark.api.java.function.PairFunction)
|
1)` pairs, using a [PairFunction](api/scala/org/apache/spark/api/java/function/PairFunction.html)
|
||||||
object. Then, it is reduced to get the frequency of words in each batch of data,
|
object. Then, it is reduced to get the frequency of words in each batch of data,
|
||||||
using a [Function2](api/scala/index.html#org.apache.spark.api.java.function.Function2) object.
|
using a [Function2](api/scala/org/apache/spark/api/java/function/Function2.html) object.
|
||||||
Finally, `wordCounts.print()` will print a few of the counts generated every second.
|
Finally, `wordCounts.print()` will print a few of the counts generated every second.
|
||||||
|
|
||||||
Note that when these lines are executed, Spark Streaming only sets up the computation it
|
Note that when these lines are executed, Spark Streaming only sets up the computation it
|
||||||
|
@ -435,7 +435,7 @@ To initialize a Spark Streaming program, a **StreamingContext** object has to be
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
A [StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) object can be created from a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object.
|
A [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) object can be created from a [SparkConf](api/scala/org/apache/spark/SparkConf.html) object.
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
import org.apache.spark._
|
import org.apache.spark._
|
||||||
|
@ -451,7 +451,7 @@ or a special __"local[\*]"__ string to run in local mode. In practice, when runn
|
||||||
you will not want to hardcode `master` in the program,
|
you will not want to hardcode `master` in the program,
|
||||||
but rather [launch the application with `spark-submit`](submitting-applications.html) and
|
but rather [launch the application with `spark-submit`](submitting-applications.html) and
|
||||||
receive it there. However, for local testing and unit tests, you can pass "local[\*]" to run Spark Streaming
|
receive it there. However, for local testing and unit tests, you can pass "local[\*]" to run Spark Streaming
|
||||||
in-process (detects the number of cores in the local system). Note that this internally creates a [SparkContext](api/scala/index.html#org.apache.spark.SparkContext) (starting point of all Spark functionality) which can be accessed as `ssc.sparkContext`.
|
in-process (detects the number of cores in the local system). Note that this internally creates a [SparkContext](api/scala/org/apache/spark/SparkContext.html) (starting point of all Spark functionality) which can be accessed as `ssc.sparkContext`.
|
||||||
|
|
||||||
The batch interval must be set based on the latency requirements of your application
|
The batch interval must be set based on the latency requirements of your application
|
||||||
and available cluster resources. See the [Performance Tuning](#setting-the-right-batch-interval)
|
and available cluster resources. See the [Performance Tuning](#setting-the-right-batch-interval)
|
||||||
|
@ -584,7 +584,7 @@ Input DStreams are DStreams representing the stream of input data received from
|
||||||
sources. In the [quick example](#a-quick-example), `lines` was an input DStream as it represented
|
sources. In the [quick example](#a-quick-example), `lines` was an input DStream as it represented
|
||||||
the stream of data received from the netcat server. Every input DStream
|
the stream of data received from the netcat server. Every input DStream
|
||||||
(except file stream, discussed later in this section) is associated with a **Receiver**
|
(except file stream, discussed later in this section) is associated with a **Receiver**
|
||||||
([Scala doc](api/scala/index.html#org.apache.spark.streaming.receiver.Receiver),
|
([Scala doc](api/scala/org/apache/spark/streaming/receiver/Receiver.html),
|
||||||
[Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)) object which receives the
|
[Java doc](api/java/org/apache/spark/streaming/receiver/Receiver.html)) object which receives the
|
||||||
data from a source and stores it in Spark's memory for processing.
|
data from a source and stores it in Spark's memory for processing.
|
||||||
|
|
||||||
|
@ -739,7 +739,7 @@ DStreams can be created with data streams received through custom receivers. See
|
||||||
For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
|
For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using `streamingContext.queueStream(queueOfRDDs)`. Each RDD pushed into the queue will be treated as a batch of data in the DStream, and processed like a stream.
|
||||||
|
|
||||||
For more details on streams from sockets and files, see the API documentations of the relevant functions in
|
For more details on streams from sockets and files, see the API documentations of the relevant functions in
|
||||||
[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for
|
[StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) for
|
||||||
Scala, [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
|
Scala, [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
|
||||||
for Java, and [StreamingContext](api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext) for Python.
|
for Java, and [StreamingContext](api/python/pyspark.streaming.html#pyspark.streaming.StreamingContext) for Python.
|
||||||
|
|
||||||
|
@ -1219,8 +1219,8 @@ joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))
|
||||||
In fact, you can also dynamically change the dataset you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to.
|
In fact, you can also dynamically change the dataset you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to.
|
||||||
|
|
||||||
The complete list of DStream transformations is available in the API documentation. For the Scala API,
|
The complete list of DStream transformations is available in the API documentation. For the Scala API,
|
||||||
see [DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream)
|
see [DStream](api/scala/org/apache/spark/streaming/dstream/DStream.html)
|
||||||
and [PairDStreamFunctions](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
|
and [PairDStreamFunctions](api/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.html).
|
||||||
For the Java API, see [JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html)
|
For the Java API, see [JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html)
|
||||||
and [JavaPairDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaPairDStream.html).
|
and [JavaPairDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaPairDStream.html).
|
||||||
For the Python API, see [DStream](api/python/pyspark.streaming.html#pyspark.streaming.DStream).
|
For the Python API, see [DStream](api/python/pyspark.streaming.html#pyspark.streaming.DStream).
|
||||||
|
@ -2067,7 +2067,7 @@ for prime time, the old one be can be brought down. Note that this can be done f
|
||||||
sending the data to two destinations (i.e., the earlier and upgraded applications).
|
sending the data to two destinations (i.e., the earlier and upgraded applications).
|
||||||
|
|
||||||
- The existing application is shutdown gracefully (see
|
- The existing application is shutdown gracefully (see
|
||||||
[`StreamingContext.stop(...)`](api/scala/index.html#org.apache.spark.streaming.StreamingContext)
|
[`StreamingContext.stop(...)`](api/scala/org/apache/spark/streaming/StreamingContext.html)
|
||||||
or [`JavaStreamingContext.stop(...)`](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
|
or [`JavaStreamingContext.stop(...)`](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html)
|
||||||
for graceful shutdown options) which ensure data that has been received is completely
|
for graceful shutdown options) which ensure data that has been received is completely
|
||||||
processed before shutdown. Then the
|
processed before shutdown. Then the
|
||||||
|
@ -2104,7 +2104,7 @@ In that case, consider
|
||||||
[reducing](#reducing-the-batch-processing-times) the batch processing time.
|
[reducing](#reducing-the-batch-processing-times) the batch processing time.
|
||||||
|
|
||||||
The progress of a Spark Streaming program can also be monitored using the
|
The progress of a Spark Streaming program can also be monitored using the
|
||||||
[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener) interface,
|
[StreamingListener](api/scala/org/apache/spark/streaming/scheduler/StreamingListener.html) interface,
|
||||||
which allows you to get receiver status and processing times. Note that this is a developer API
|
which allows you to get receiver status and processing times. Note that this is a developer API
|
||||||
and it is likely to be improved upon (i.e., more information reported) in the future.
|
and it is likely to be improved upon (i.e., more information reported) in the future.
|
||||||
|
|
||||||
|
@ -2197,7 +2197,7 @@ computation is not high enough. For example, for distributed reduce operations l
|
||||||
and `reduceByKeyAndWindow`, the default number of parallel tasks is controlled by
|
and `reduceByKeyAndWindow`, the default number of parallel tasks is controlled by
|
||||||
the `spark.default.parallelism` [configuration property](configuration.html#spark-properties). You
|
the `spark.default.parallelism` [configuration property](configuration.html#spark-properties). You
|
||||||
can pass the level of parallelism as an argument (see
|
can pass the level of parallelism as an argument (see
|
||||||
[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
|
[`PairDStreamFunctions`](api/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.html)
|
||||||
documentation), or set the `spark.default.parallelism`
|
documentation), or set the `spark.default.parallelism`
|
||||||
[configuration property](configuration.html#spark-properties) to change the default.
|
[configuration property](configuration.html#spark-properties) to change the default.
|
||||||
|
|
||||||
|
@ -2205,9 +2205,9 @@ documentation), or set the `spark.default.parallelism`
|
||||||
{:.no_toc}
|
{:.no_toc}
|
||||||
The overheads of data serialization can be reduced by tuning the serialization formats. In the case of streaming, there are two types of data that are being serialized.
|
The overheads of data serialization can be reduced by tuning the serialization formats. In the case of streaming, there are two types of data that are being serialized.
|
||||||
|
|
||||||
* **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format.
|
* **Input data**: By default, the input data received through Receivers is stored in the executors' memory with [StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/org/apache/spark/storage/StorageLevel$.html). That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads -- the receiver must deserialize the received data and re-serialize it using Spark's serialization format.
|
||||||
|
|
||||||
* **Persisted RDDs generated by Streaming Operations**: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of [StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$), persisted RDDs generated by streaming computations are persisted with [StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$) (i.e. serialized) by default to minimize GC overheads.
|
* **Persisted RDDs generated by Streaming Operations**: RDDs generated by streaming computations may be persisted in memory. For example, window operations persist data in memory as they would be processed multiple times. However, unlike the Spark Core default of [StorageLevel.MEMORY_ONLY](api/scala/org/apache/spark/storage/StorageLevel$.html), persisted RDDs generated by streaming computations are persisted with [StorageLevel.MEMORY_ONLY_SER](api/scala/org/apache/spark/storage/StorageLevel.html$) (i.e. serialized) by default to minimize GC overheads.
|
||||||
|
|
||||||
In both cases, using Kryo serialization can reduce both CPU and memory overheads. See the [Spark Tuning Guide](tuning.html#data-serialization) for more details. For Kryo, consider registering custom classes, and disabling object reference tracking (see Kryo-related configurations in the [Configuration Guide](configuration.html#compression-and-serialization)).
|
In both cases, using Kryo serialization can reduce both CPU and memory overheads. See the [Spark Tuning Guide](tuning.html#data-serialization) for more details. For Kryo, consider registering custom classes, and disabling object reference tracking (see Kryo-related configurations in the [Configuration Guide](configuration.html#compression-and-serialization)).
|
||||||
|
|
||||||
|
@ -2247,7 +2247,7 @@ A good approach to figure out the right batch size for your application is to te
|
||||||
conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system
|
conservative batch interval (say, 5-10 seconds) and a low data rate. To verify whether the system
|
||||||
is able to keep up with the data rate, you can check the value of the end-to-end delay experienced
|
is able to keep up with the data rate, you can check the value of the end-to-end delay experienced
|
||||||
by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the
|
by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the
|
||||||
[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener)
|
[StreamingListener](api/scala/org/apache/spark/streaming/scheduler/StreamingListener.html)
|
||||||
interface).
|
interface).
|
||||||
If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise,
|
If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise,
|
||||||
if the delay is continuously increasing, it means that the system is unable to keep up and it
|
if the delay is continuously increasing, it means that the system is unable to keep up and it
|
||||||
|
@ -2485,10 +2485,10 @@ additional effort may be necessary to achieve exactly-once semantics. There are
|
||||||
* Third-party DStream data sources can be found in [Third Party Projects](https://spark.apache.org/third-party-projects.html)
|
* Third-party DStream data sources can be found in [Third Party Projects](https://spark.apache.org/third-party-projects.html)
|
||||||
* API documentation
|
* API documentation
|
||||||
- Scala docs
|
- Scala docs
|
||||||
* [StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) and
|
* [StreamingContext](api/scala/org/apache/spark/streaming/StreamingContext.html) and
|
||||||
[DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream)
|
[DStream](api/scala/org/apache/spark/streaming/dstream/DStream.html)
|
||||||
* [KafkaUtils](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
|
* [KafkaUtils](api/scala/org/apache/spark/streaming/kafka/KafkaUtils$.html),
|
||||||
[KinesisUtils](api/scala/index.html#org.apache.spark.streaming.kinesis.KinesisInputDStream),
|
[KinesisUtils](api/scala/org/apache/spark/streaming/kinesis/KinesisInputDStream.html),
|
||||||
- Java docs
|
- Java docs
|
||||||
* [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html),
|
* [JavaStreamingContext](api/java/index.html?org/apache/spark/streaming/api/java/JavaStreamingContext.html),
|
||||||
[JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html) and
|
[JavaDStream](api/java/index.html?org/apache/spark/streaming/api/java/JavaDStream.html) and
|
||||||
|
|
|
@ -498,13 +498,13 @@ to track the read position in the stream. The engine uses checkpointing and writ
|
||||||
|
|
||||||
# API using Datasets and DataFrames
|
# API using Datasets and DataFrames
|
||||||
Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point `SparkSession`
|
Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point `SparkSession`
|
||||||
([Scala](api/scala/index.html#org.apache.spark.sql.SparkSession)/[Java](api/java/org/apache/spark/sql/SparkSession.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession)/[R](api/R/sparkR.session.html) docs)
|
([Scala](api/scala/org/apache/spark/sql/SparkSession.html)/[Java](api/java/org/apache/spark/sql/SparkSession.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession)/[R](api/R/sparkR.session.html) docs)
|
||||||
to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the
|
to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the
|
||||||
[DataFrame/Dataset Programming Guide](sql-programming-guide.html).
|
[DataFrame/Dataset Programming Guide](sql-programming-guide.html).
|
||||||
|
|
||||||
## Creating streaming DataFrames and streaming Datasets
|
## Creating streaming DataFrames and streaming Datasets
|
||||||
Streaming DataFrames can be created through the `DataStreamReader` interface
|
Streaming DataFrames can be created through the `DataStreamReader` interface
|
||||||
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader) docs)
|
([Scala](api/scala/org/apache/spark/sql/streaming/DataStreamReader.html)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader) docs)
|
||||||
returned by `SparkSession.readStream()`. In [R](api/R/read.stream.html), with the `read.stream()` method. Similar to the read interface for creating static DataFrame, you can specify the details of the source – data format, schema, options, etc.
|
returned by `SparkSession.readStream()`. In [R](api/R/read.stream.html), with the `read.stream()` method. Similar to the read interface for creating static DataFrame, you can specify the details of the source – data format, schema, options, etc.
|
||||||
|
|
||||||
#### Input Sources
|
#### Input Sources
|
||||||
|
@ -557,7 +557,7 @@ Here are the details of all the sources in Spark.
|
||||||
NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. Spark may not clean up some source files in some circumstances - e.g. the application doesn't shut down gracefully, too many files are queued to clean up.
|
NOTE 3: Both delete and move actions are best effort. Failing to delete or move files will not fail the streaming query. Spark may not clean up some source files in some circumstances - e.g. the application doesn't shut down gracefully, too many files are queued to clean up.
|
||||||
<br/><br/>
|
<br/><br/>
|
||||||
For file-format-specific options, see the related methods in <code>DataStreamReader</code>
|
For file-format-specific options, see the related methods in <code>DataStreamReader</code>
|
||||||
(<a href="api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>/<a
|
(<a href="api/scala/org/apache/spark/sql/streaming/DataStreamReader.html">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>/<a
|
||||||
href="api/R/read.stream.html">R</a>).
|
href="api/R/read.stream.html">R</a>).
|
||||||
E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>.
|
E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>.
|
||||||
<br/><br/>
|
<br/><br/>
|
||||||
|
@ -1622,7 +1622,7 @@ However, as a side effect, data from the slower streams will be aggressively dro
|
||||||
this configuration judiciously.
|
this configuration judiciously.
|
||||||
|
|
||||||
### Arbitrary Stateful Operations
|
### Arbitrary Stateful Operations
|
||||||
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)).
|
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/org/apache/spark/sql/streaming/GroupState.html)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)).
|
||||||
|
|
||||||
Though Spark cannot check and force it, the state function should be implemented with respect to the semantics of the output mode. For example, in Update mode Spark doesn't expect that the state function will emit rows which are older than current watermark plus allowed late record delay, whereas in Append mode the state function can emit these rows.
|
Though Spark cannot check and force it, the state function should be implemented with respect to the semantics of the output mode. For example, in Update mode Spark doesn't expect that the state function will emit rows which are older than current watermark plus allowed late record delay, whereas in Append mode the state function can emit these rows.
|
||||||
|
|
||||||
|
@ -1679,7 +1679,7 @@ end-to-end exactly once per query. Ensuring end-to-end exactly once for the last
|
||||||
|
|
||||||
## Starting Streaming Queries
|
## Starting Streaming Queries
|
||||||
Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter`
|
Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter`
|
||||||
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamWriter) docs)
|
([Scala](api/scala/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamWriter) docs)
|
||||||
returned through `Dataset.writeStream()`. You will have to specify one or more of the following in this interface.
|
returned through `Dataset.writeStream()`. You will have to specify one or more of the following in this interface.
|
||||||
|
|
||||||
- *Details of the output sink:* Data format, location, etc.
|
- *Details of the output sink:* Data format, location, etc.
|
||||||
|
@ -1863,7 +1863,7 @@ Here are the details of all the sinks in Spark.
|
||||||
<code>path</code>: path to the output directory, must be specified.
|
<code>path</code>: path to the output directory, must be specified.
|
||||||
<br/><br/>
|
<br/><br/>
|
||||||
For file-format-specific options, see the related methods in DataFrameWriter
|
For file-format-specific options, see the related methods in DataFrameWriter
|
||||||
(<a href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a
|
(<a href="api/scala/org/apache/spark/sql/DataFrameWriter.html">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a
|
||||||
href="api/R/write.stream.html">R</a>).
|
href="api/R/write.stream.html">R</a>).
|
||||||
E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>
|
E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>
|
||||||
</td>
|
</td>
|
||||||
|
@ -2175,7 +2175,7 @@ Since Spark 2.4, `foreach` is available in Scala, Java and Python.
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
<div data-lang="scala" markdown="1">
|
<div data-lang="scala" markdown="1">
|
||||||
|
|
||||||
In Scala, you have to extend the class `ForeachWriter` ([docs](api/scala/index.html#org.apache.spark.sql.ForeachWriter)).
|
In Scala, you have to extend the class `ForeachWriter` ([docs](api/scala/org/apache/spark/sql/ForeachWriter.html)).
|
||||||
|
|
||||||
{% highlight scala %}
|
{% highlight scala %}
|
||||||
streamingDatasetOfString.writeStream.foreach(
|
streamingDatasetOfString.writeStream.foreach(
|
||||||
|
@ -2564,7 +2564,7 @@ lastProgress(query) # the most recent progress update of this streaming qu
|
||||||
</div>
|
</div>
|
||||||
|
|
||||||
You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources. You can use `sparkSession.streams()` to get the `StreamingQueryManager`
|
You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources. You can use `sparkSession.streams()` to get the `StreamingQueryManager`
|
||||||
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryManager)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager) docs)
|
([Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager) docs)
|
||||||
that can be used to manage the currently active queries.
|
that can be used to manage the currently active queries.
|
||||||
|
|
||||||
<div class="codetabs">
|
<div class="codetabs">
|
||||||
|
@ -2624,7 +2624,7 @@ There are multiple ways to monitor active streaming queries. You can either push
|
||||||
You can directly get the current status and metrics of an active query using
|
You can directly get the current status and metrics of an active query using
|
||||||
`streamingQuery.lastProgress()` and `streamingQuery.status()`.
|
`streamingQuery.lastProgress()` and `streamingQuery.status()`.
|
||||||
`lastProgress()` returns a `StreamingQueryProgress` object
|
`lastProgress()` returns a `StreamingQueryProgress` object
|
||||||
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryProgress)
|
in [Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
|
||||||
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
|
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
|
||||||
and a dictionary with the same fields in Python. It has all the information about
|
and a dictionary with the same fields in Python. It has all the information about
|
||||||
the progress made in the last trigger of the stream - what data was processed,
|
the progress made in the last trigger of the stream - what data was processed,
|
||||||
|
@ -2632,7 +2632,7 @@ what were the processing rates, latencies, etc. There is also
|
||||||
`streamingQuery.recentProgress` which returns an array of last few progresses.
|
`streamingQuery.recentProgress` which returns an array of last few progresses.
|
||||||
|
|
||||||
In addition, `streamingQuery.status()` returns a `StreamingQueryStatus` object
|
In addition, `streamingQuery.status()` returns a `StreamingQueryStatus` object
|
||||||
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryStatus)
|
in [Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
|
||||||
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
|
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
|
||||||
and a dictionary with the same fields in Python. It gives information about
|
and a dictionary with the same fields in Python. It gives information about
|
||||||
what the query is immediately doing - is a trigger active, is data being processed, etc.
|
what the query is immediately doing - is a trigger active, is data being processed, etc.
|
||||||
|
@ -2853,7 +2853,7 @@ Will print something like the following.
|
||||||
|
|
||||||
You can also asynchronously monitor all queries associated with a
|
You can also asynchronously monitor all queries associated with a
|
||||||
`SparkSession` by attaching a `StreamingQueryListener`
|
`SparkSession` by attaching a `StreamingQueryListener`
|
||||||
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryListener)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html) docs).
|
([Scala](api/scala/org/apache/spark/sql/streaming/StreamingQueryListener.html)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html) docs).
|
||||||
Once you attach your custom `StreamingQueryListener` object with
|
Once you attach your custom `StreamingQueryListener` object with
|
||||||
`sparkSession.streams.attachListener()`, you will get callbacks when a query is started and
|
`sparkSession.streams.attachListener()`, you will get callbacks when a query is started and
|
||||||
stopped and when there is progress made in an active query. Here is an example,
|
stopped and when there is progress made in an active query. Here is an example,
|
||||||
|
|
|
@ -260,7 +260,7 @@ enough. Spark automatically sets the number of "map" tasks to run on each file a
|
||||||
(though you can control it through optional parameters to `SparkContext.textFile`, etc), and for
|
(though you can control it through optional parameters to `SparkContext.textFile`, etc), and for
|
||||||
distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses the largest
|
distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses the largest
|
||||||
parent RDD's number of partitions. You can pass the level of parallelism as a second argument
|
parent RDD's number of partitions. You can pass the level of parallelism as a second argument
|
||||||
(see the [`spark.PairRDDFunctions`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) documentation),
|
(see the [`spark.PairRDDFunctions`](api/scala/org/apache/spark/rdd/PairRDDFunctions.html) documentation),
|
||||||
or set the config property `spark.default.parallelism` to change the default.
|
or set the config property `spark.default.parallelism` to change the default.
|
||||||
In general, we recommend 2-3 tasks per CPU core in your cluster.
|
In general, we recommend 2-3 tasks per CPU core in your cluster.
|
||||||
|
|
||||||
|
|
Loading…
Reference in a new issue