[SPARK-20506][DOCS] 2.2 migration guide
Update ML guide for migration `2.1` -> `2.2` and the previous version migration guide section. ## How was this patch tested? Build doc locally. Author: Nick Pentreath <nickp@za.ibm.com> Closes #17996 from MLnick/SPARK-20506-2.2-migration-guide.
This commit is contained in:
parent
7f203a248f
commit
b5d8d9ba17
|
@ -26,7 +26,7 @@ The primary Machine Learning API for Spark is now the [DataFrame](sql-programmin
|
|||
* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
|
||||
* MLlib will not add new features to the RDD-based API.
|
||||
* In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
|
||||
* After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
|
||||
* After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
|
||||
* The RDD-based API is expected to be removed in Spark 3.0.
|
||||
|
||||
*Why is MLlib switching to the DataFrame-based API?*
|
||||
|
@ -66,41 +66,59 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4
|
|||
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
|
||||
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).
|
||||
|
||||
# Highlights in 2.2
|
||||
|
||||
The list below highlights some of the new features and enhancements added to MLlib in the `2.2`
|
||||
release of Spark:
|
||||
|
||||
* `ALS` methods for _top-k_ recommendations for all users or items, matching the functionality
|
||||
in `mllib` ([SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)). Performance
|
||||
was also improved for both `ml` and `mllib`
|
||||
([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and
|
||||
[SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587))
|
||||
* `Correlation` and `ChiSquareTest` stats functions for `DataFrames`
|
||||
([SPARK-19636](https://issues.apache.org/jira/browse/SPARK-19636) and
|
||||
[SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635))
|
||||
* `FPGrowth` algorithm for frequent pattern mining
|
||||
([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503))
|
||||
* `GLM` now supports the full `Tweedie` family
|
||||
([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929))
|
||||
* `Imputer` feature transformer to impute missing values in a dataset
|
||||
([SPARK-13568](https://issues.apache.org/jira/browse/SPARK-13568))
|
||||
* `LinearSVC` for linear Support Vector Machine classification
|
||||
([SPARK-14709](https://issues.apache.org/jira/browse/SPARK-14709))
|
||||
* Logistic regression now supports constraints on the coefficients during training
|
||||
([SPARK-20047](https://issues.apache.org/jira/browse/SPARK-20047))
|
||||
|
||||
# Migration guide
|
||||
|
||||
MLlib is under active development.
|
||||
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
|
||||
and the migration guide below will explain all changes between releases.
|
||||
|
||||
## From 2.0 to 2.1
|
||||
## From 2.1 to 2.2
|
||||
|
||||
### Breaking changes
|
||||
|
||||
**Deprecated methods removed**
|
||||
|
||||
* `setLabelCol` in `feature.ChiSqSelectorModel`
|
||||
* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`)
|
||||
* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`)
|
||||
* `model` in `regression.LinearRegressionSummary`
|
||||
* `validateParams` in `PipelineStage`
|
||||
* `validateParams` in `Evaluator`
|
||||
There are no breaking changes.
|
||||
|
||||
### Deprecations and changes of behavior
|
||||
|
||||
**Deprecations**
|
||||
|
||||
* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
|
||||
Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel`
|
||||
There are no deprecations.
|
||||
|
||||
**Changes of behavior**
|
||||
|
||||
* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
|
||||
Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features.
|
||||
* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
|
||||
`KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
|
||||
* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
|
||||
`KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.
|
||||
|
||||
* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
|
||||
Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`).
|
||||
**Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class.
|
||||
* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
|
||||
Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
|
||||
* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
|
||||
`StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
|
||||
would always be thrown regardless of the setting of the `handleInvalid` parameter.
|
||||
|
||||
## Previous Spark versions
|
||||
|
||||
Earlier migration guides are archived [on this page](ml-migration-guides.html).
|
||||
|
|
|
@ -7,6 +7,35 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT
|
|||
|
||||
The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).
|
||||
|
||||
## From 2.0 to 2.1
|
||||
|
||||
### Breaking changes
|
||||
|
||||
**Deprecated methods removed**
|
||||
|
||||
* `setLabelCol` in `feature.ChiSqSelectorModel`
|
||||
* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`)
|
||||
* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`)
|
||||
* `model` in `regression.LinearRegressionSummary`
|
||||
* `validateParams` in `PipelineStage`
|
||||
* `validateParams` in `Evaluator`
|
||||
|
||||
### Deprecations and changes of behavior
|
||||
|
||||
**Deprecations**
|
||||
|
||||
* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
|
||||
Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel`
|
||||
|
||||
**Changes of behavior**
|
||||
|
||||
* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
|
||||
Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features.
|
||||
* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
|
||||
`KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
|
||||
* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
|
||||
`KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.
|
||||
|
||||
## From 1.6 to 2.0
|
||||
|
||||
### Breaking changes
|
||||
|
|
Loading…
Reference in a new issue