[SPARK-9671] [MLLIB] re-org user guide and add migration guide
This PR updates the MLlib user guide and adds migration guide for 1.4->1.5. * merge migration guide for `spark.mllib` and `spark.ml` packages * remove dependency section from `spark.ml` guide * move the paragraph about `spark.mllib` and `spark.ml` to the top and recommend `spark.ml` * move Sam's talk to footnote to make the section focus on dependencies Minor changes to code examples and other wording will be in a separate PR. jkbradley srowen feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8498 from mengxr/SPARK-9671.
This commit is contained in:
parent
45723214e6
commit
88032ecaf0
|
@ -21,19 +21,11 @@ title: Spark ML Programming Guide
|
|||
\]`
|
||||
|
||||
|
||||
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
|
||||
high-level APIs that help users create and tune practical machine learning pipelines.
|
||||
|
||||
*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.
|
||||
|
||||
Note that we will keep supporting and adding features to `spark.mllib` along with the
|
||||
development of `spark.ml`.
|
||||
Users should be comfortable using `spark.mllib` features and expect more features coming.
|
||||
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
|
||||
to `spark.ml`.
|
||||
|
||||
See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages of `spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.
|
||||
|
||||
The `spark.ml` package aims to provide a uniform set of high-level APIs built on top of
|
||||
[DataFrames](sql-programming-guide.html#dataframes) that help users create and tune practical
|
||||
machine learning pipelines.
|
||||
See the [Algorithm Guides section](#algorithm-guides) below for guides on sub-packages of
|
||||
`spark.ml`, including feature transformers unique to the Pipelines API, ensembles, and more.
|
||||
|
||||
**Table of Contents**
|
||||
|
||||
|
@ -171,7 +163,7 @@ This is useful if there are two algorithms with the `maxIter` parameter in a `Pi
|
|||
|
||||
# Algorithm Guides
|
||||
|
||||
There are now several algorithms in the Pipelines API which are not in the lower-level MLlib API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
|
||||
There are now several algorithms in the Pipelines API which are not in the `spark.mllib` API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines.
|
||||
|
||||
**Pipelines API Algorithm Guides**
|
||||
|
||||
|
@ -880,35 +872,3 @@ jsc.stop();
|
|||
</div>
|
||||
|
||||
</div>
|
||||
|
||||
# Dependencies
|
||||
|
||||
Spark ML currently depends on MLlib and has the same dependencies.
|
||||
Please see the [MLlib Dependencies guide](mllib-guide.html#dependencies) for more info.
|
||||
|
||||
Spark ML also depends upon Spark SQL, but the relevant parts of Spark SQL do not bring additional dependencies.
|
||||
|
||||
# Migration Guide
|
||||
|
||||
## From 1.3 to 1.4
|
||||
|
||||
Several major API changes occurred, including:
|
||||
* `Param` and other APIs for specifying parameters
|
||||
* `uid` unique IDs for Pipeline components
|
||||
* Reorganization of certain classes
|
||||
Since the `spark.ml` API was an Alpha Component in Spark 1.3, we do not list all changes here.
|
||||
|
||||
However, now that `spark.ml` is no longer an Alpha Component, we will provide details on any API changes for future releases.
|
||||
|
||||
## From 1.2 to 1.3
|
||||
|
||||
The main API changes are from Spark SQL. We list the most important changes here:
|
||||
|
||||
* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
|
||||
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
|
||||
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
|
||||
|
||||
Other changes were in `LogisticRegression`:
|
||||
|
||||
* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
|
||||
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.
|
||||
|
|
|
@ -5,21 +5,28 @@ displayTitle: Machine Learning Library (MLlib) Guide
|
|||
description: MLlib machine learning library overview for Spark SPARK_VERSION_SHORT
|
||||
---
|
||||
|
||||
MLlib is Spark's scalable machine learning library consisting of common learning algorithms and utilities,
|
||||
including classification, regression, clustering, collaborative
|
||||
filtering, dimensionality reduction, as well as underlying optimization primitives.
|
||||
Guides for individual algorithms are listed below.
|
||||
MLlib is Spark's machine learning (ML) library.
|
||||
Its goal is to make practical machine learning scalable and easy.
|
||||
It consists of common learning algorithms and utilities, including classification, regression,
|
||||
clustering, collaborative filtering, dimensionality reduction, as well as lower-level optimization
|
||||
primitives and higher-level pipeline APIs.
|
||||
|
||||
The API is divided into 2 parts:
|
||||
It divides into two packages:
|
||||
|
||||
* [The original `spark.mllib` API](mllib-guide.html#mllib-types-algorithms-and-utilities) is the primary API.
|
||||
* [The "Pipelines" `spark.ml` API](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) is a higher-level API for constructing ML workflows.
|
||||
* [`spark.mllib`](mllib-guide.html#mllib-types-algorithms-and-utilities) contains the original API
|
||||
built on top of RDDs.
|
||||
* [`spark.ml`](mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) provides higher-level API
|
||||
built on top of DataFrames for constructing ML pipelines.
|
||||
|
||||
Using `spark.ml` is recommended because with DataFrames the API is more versatile and flexible.
|
||||
But we will keep supporting `spark.mllib` along with the development of `spark.ml`.
|
||||
Users should be comfortable using `spark.mllib` features and expect more features coming.
|
||||
Developers should contribute new algorithms to `spark.ml` if they fit the ML pipeline concept well,
|
||||
e.g., feature extractors and transformers.
|
||||
|
||||
We list major functionality from both below, with links to detailed guides.
|
||||
|
||||
# MLlib types, algorithms and utilities
|
||||
|
||||
This lists functionality included in `spark.mllib`, the main MLlib API.
|
||||
# spark.mllib: data types, algorithms, and utilities
|
||||
|
||||
* [Data types](mllib-data-types.html)
|
||||
* [Basic statistics](mllib-statistics.html)
|
||||
|
@ -56,71 +63,63 @@ This lists functionality included in `spark.mllib`, the main MLlib API.
|
|||
* [limited-memory BFGS (L-BFGS)](mllib-optimization.html#limited-memory-bfgs-l-bfgs)
|
||||
* [PMML model export](mllib-pmml-model-export.html)
|
||||
|
||||
# spark.ml: high-level APIs for ML pipelines
|
||||
|
||||
**[spark.ml programming guide](ml-guide.html)** provides an overview of the Pipelines API and major
|
||||
concepts. It also contains sections on using algorithms within the Pipelines API, for example:
|
||||
|
||||
* [Feature Extraction, Transformation, and Selection](ml-features.html)
|
||||
* [Decision Trees for Classification and Regression](ml-decision-tree.html)
|
||||
* [Ensembles](ml-ensembles.html)
|
||||
* [Linear methods with elastic net regularization](ml-linear-methods.html)
|
||||
* [Multilayer perceptron classifier](ml-ann.html)
|
||||
|
||||
# Dependencies
|
||||
|
||||
MLlib uses the linear algebra package [Breeze](http://www.scalanlp.org/), which depends on
|
||||
[netlib-java](https://github.com/fommil/netlib-java) for optimised numerical processing.
|
||||
If natives libraries[^1] are not available at runtime, you will see a warning message and a pure JVM
|
||||
implementation will be used instead.
|
||||
|
||||
Due to licensing issues with runtime proprietary binaries, we do not include `netlib-java`'s native
|
||||
proxies by default.
|
||||
To configure `netlib-java` / Breeze to use system optimised binaries, include
|
||||
`com.github.fommil.netlib:all:1.1.2` (or build Spark with `-Pnetlib-lgpl`) as a dependency of your
|
||||
project and read the [netlib-java](https://github.com/fommil/netlib-java) documentation for your
|
||||
platform's additional installation instructions.
|
||||
|
||||
To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4 or newer.
|
||||
|
||||
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
|
||||
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
|
||||
|
||||
# Migration guide
|
||||
|
||||
MLlib is under active development.
|
||||
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
|
||||
and the migration guide below will explain all changes between releases.
|
||||
|
||||
# spark.ml: high-level APIs for ML pipelines
|
||||
## From 1.4 to 1.5
|
||||
|
||||
Spark 1.2 introduced a new package called `spark.ml`, which aims to provide a uniform set of
|
||||
high-level APIs that help users create and tune practical machine learning pipelines.
|
||||
In the `spark.mllib` package, there are no break API changes but several behavior changes:
|
||||
|
||||
*Graduated from Alpha!* The Pipelines API is no longer an alpha component, although many elements of it are still `Experimental` or `DeveloperApi`.
|
||||
* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
|
||||
`RegressionMetrics.explainedVariance` returns the average regression sum of squares.
|
||||
* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
|
||||
sorted.
|
||||
* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
|
||||
convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.
|
||||
|
||||
Note that we will keep supporting and adding features to `spark.mllib` along with the
|
||||
development of `spark.ml`.
|
||||
Users should be comfortable using `spark.mllib` features and expect more features coming.
|
||||
Developers should contribute new algorithms to `spark.mllib` and can optionally contribute
|
||||
to `spark.ml`.
|
||||
In the `spark.ml` package, there exists one break API change and one behavior change:
|
||||
|
||||
Guides for `spark.ml` include:
|
||||
* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
|
||||
from `Params.setDefault` due to a
|
||||
[Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
|
||||
* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
|
||||
added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.
|
||||
|
||||
* **[spark.ml programming guide](ml-guide.html)**: overview of the Pipelines API and major concepts
|
||||
* Guides on using algorithms within the Pipelines API:
|
||||
* [Feature transformers](ml-features.html), including a few not in the lower-level `spark.mllib` API
|
||||
* [Decision trees](ml-decision-tree.html)
|
||||
* [Ensembles](ml-ensembles.html)
|
||||
* [Linear methods](ml-linear-methods.html)
|
||||
|
||||
# Dependencies
|
||||
|
||||
MLlib uses the linear algebra package
|
||||
[Breeze](http://www.scalanlp.org/), which depends on
|
||||
[netlib-java](https://github.com/fommil/netlib-java) for optimised
|
||||
numerical processing. If natives are not available at runtime, you
|
||||
will see a warning message and a pure JVM implementation will be used
|
||||
instead.
|
||||
|
||||
To learn more about the benefits and background of system optimised
|
||||
natives, you may wish to watch Sam Halliday's ScalaX talk on
|
||||
[High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/)).
|
||||
|
||||
Due to licensing issues with runtime proprietary binaries, we do not
|
||||
include `netlib-java`'s native proxies by default. To configure
|
||||
`netlib-java` / Breeze to use system optimised binaries, include
|
||||
`com.github.fommil.netlib:all:1.1.2` (or build Spark with
|
||||
`-Pnetlib-lgpl`) as a dependency of your project and read the
|
||||
[netlib-java](https://github.com/fommil/netlib-java) documentation for
|
||||
your platform's additional installation instructions.
|
||||
|
||||
To use MLlib in Python, you will need [NumPy](http://www.numpy.org)
|
||||
version 1.4 or newer.
|
||||
|
||||
---
|
||||
|
||||
# Migration Guide
|
||||
|
||||
For the `spark.ml` package, please see the [spark.ml Migration Guide](ml-guide.html#migration-guide).
|
||||
|
||||
## From 1.3 to 1.4
|
||||
|
||||
In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
|
||||
|
||||
* Gradient-Boosted Trees
|
||||
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
|
||||
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
|
||||
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
|
||||
|
||||
## Previous Spark Versions
|
||||
## Previous Spark versions
|
||||
|
||||
Earlier migration guides are archived [on this page](mllib-migration-guides.html).
|
||||
|
||||
---
|
||||
|
|
|
@ -7,6 +7,25 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
|
|||
|
||||
The migration guide for the current Spark version is kept on the [MLlib Programming Guide main page](mllib-guide.html#migration-guide).
|
||||
|
||||
## From 1.3 to 1.4
|
||||
|
||||
In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
|
||||
|
||||
* Gradient-Boosted Trees
|
||||
* *(Breaking change)* The signature of the [`Loss.gradient`](api/scala/index.html#org.apache.spark.mllib.tree.loss.Loss) method was changed. This is only an issues for users who wrote their own losses for GBTs.
|
||||
* *(Breaking change)* The `apply` and `copy` methods for the case class [`BoostingStrategy`](api/scala/index.html#org.apache.spark.mllib.tree.configuration.BoostingStrategy) have been changed because of a modification to the case class fields. This could be an issue for users who use `BoostingStrategy` to set GBT parameters.
|
||||
* *(Breaking change)* The return value of [`LDA.run`](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) has changed. It now returns an abstract class `LDAModel` instead of the concrete class `DistributedLDAModel`. The object of type `LDAModel` can still be cast to the appropriate concrete type, which depends on the optimization algorithm.
|
||||
|
||||
In the `spark.ml` package, several major API changes occurred, including:
|
||||
|
||||
* `Param` and other APIs for specifying parameters
|
||||
* `uid` unique IDs for Pipeline components
|
||||
* Reorganization of certain classes
|
||||
|
||||
Since the `spark.ml` API was an alpha component in Spark 1.3, we do not list all changes here.
|
||||
However, since 1.4 `spark.ml` is no longer an alpha component, we will provide details on any API
|
||||
changes for future releases.
|
||||
|
||||
## From 1.2 to 1.3
|
||||
|
||||
In the `spark.mllib` package, there were several breaking changes. The first change (in `ALS`) is the only one in a component not marked as Alpha or Experimental.
|
||||
|
@ -23,6 +42,17 @@ In the `spark.mllib` package, there were several breaking changes. The first ch
|
|||
* In linear regression (including Lasso and ridge regression), the squared loss is now divided by 2.
|
||||
So in order to produce the same result as in 1.2, the regularization parameter needs to be divided by 2 and the step size needs to be multiplied by 2.
|
||||
|
||||
In the `spark.ml` package, the main API changes are from Spark SQL. We list the most important changes here:
|
||||
|
||||
* The old [SchemaRDD](http://spark.apache.org/docs/1.2.1/api/scala/index.html#org.apache.spark.sql.SchemaRDD) has been replaced with [DataFrame](api/scala/index.html#org.apache.spark.sql.DataFrame) with a somewhat modified API. All algorithms in Spark ML which used to use SchemaRDD now use DataFrame.
|
||||
* In Spark 1.2, we used implicit conversions from `RDD`s of `LabeledPoint` into `SchemaRDD`s by calling `import sqlContext._` where `sqlContext` was an instance of `SQLContext`. These implicits have been moved, so we now call `import sqlContext.implicits._`.
|
||||
* Java APIs for SQL have also changed accordingly. Please see the examples above and the [Spark SQL Programming Guide](sql-programming-guide.html) for details.
|
||||
|
||||
Other changes were in `LogisticRegression`:
|
||||
|
||||
* The `scoreCol` output column (with default value "score") was renamed to be `probabilityCol` (with default value "probability"). The type was originally `Double` (for the probability of class 1.0), but it is now `Vector` (for the probability of each class, to support multiclass classification in the future).
|
||||
* In Spark 1.2, `LogisticRegressionModel` did not include an intercept. In Spark 1.3, it includes an intercept; however, it will always be 0.0 since it uses the default settings for [spark.mllib.LogisticRegressionWithLBFGS](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS). The option to use an intercept will be added in the future.
|
||||
|
||||
## From 1.1 to 1.2
|
||||
|
||||
The only API changes in MLlib v1.2 are in
|
||||
|
|
Loading…
Reference in a new issue