[SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document

### What changes were proposed in this pull request?

This follows up #31160 to update score function in the document.

### Why are the changes needed?

Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

No, only doc change.

Closes #31531 from viirya/SPARK-34080-minor.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
Liang-Chi Hsieh 2021-02-10 09:24:25 +09:00 committed by HyukjinKwon
parent c8628c943c
commit 1fbd576410
4 changed files with 16 additions and 10 deletions

View file

@ -1802,9 +1802,9 @@ User can set `featureType` and `labelType`, and Spark will pick the score functi
~~~
featureType | labelType |score function
------------|------------|--------------
categorical |categorical | chi2
continuous |categorical | f_classif
continuous |continuous | f_regression
categorical |categorical | chi-squared (chi2)
continuous |categorical | ANOVATest (f_classif)
continuous |continuous | F-value (f_regression)
~~~
It supports five selection modes: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:

View file

@ -44,7 +44,7 @@ import org.apache.spark.sql.types.StructType
* By default, the selection method is `numTopFeatures`, with the default number of top features
* set to 50.
*/
@deprecated("use UnivariateFeatureSelector instead", "3.1.0")
@deprecated("use UnivariateFeatureSelector instead", "3.1.1")
@Since("1.6.0")
final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
extends Selector[ChiSqSelectorModel] {

View file

@ -100,9 +100,12 @@ private[feature] trait UnivariateFeatureSelectorParams extends Params
* The user can set `featureType` and labelType`, and Spark will pick the score function based on
* the specified `featureType` and labelType`.
* The following combination of `featureType` and `labelType` are supported:
* - `featureType` `categorical` and `labelType` `categorical`: Spark uses chi2.
* - `featureType` `continuous` and `labelType` `categorical`: Spark uses f_classif.
* - `featureType` `continuous` and `labelType` `continuous`: Spark uses f_regression.
* - `featureType` `categorical` and `labelType` `categorical`: Spark uses chi-squared,
* i.e. chi2 in sklearn.
* - `featureType` `continuous` and `labelType` `categorical`: Spark uses ANOVATest,
* i.e. f_classif in sklearn.
* - `featureType` `continuous` and `labelType` `continuous`: Spark uses F-value,
* i.e. f_regression in sklearn.
*
* The `UnivariateFeatureSelector` supports different selection modes: `numTopFeatures`,
* `percentile`, `fpr`, `fdr`, `fwe`.

View file

@ -5821,9 +5821,12 @@ class UnivariateFeatureSelector(JavaEstimator, _UnivariateFeatureSelectorParams,
The following combination of `featureType` and `labelType` are supported:
- `featureType` `categorical` and `labelType` `categorical`, Spark uses chi2.
- `featureType` `continuous` and `labelType` `categorical`, Spark uses f_classif.
- `featureType` `continuous` and `labelType` `continuous`, Spark uses f_regression.
- `featureType` `categorical` and `labelType` `categorical`, Spark uses chi-squared,
i.e. chi2 in sklearn.
- `featureType` `continuous` and `labelType` `categorical`, Spark uses ANOVATest,
i.e. f_classif in sklearn.
- `featureType` `continuous` and `labelType` `continuous`, Spark uses F-value,
i.e. f_regression in sklearn.
The `UnivariateFeatureSelector` supports different selection modes: `numTopFeatures`,
`percentile`, `fpr`, `fdr`, `fwe`.