[SPARK-34080][ML][PYTHON][FOLLOW-UP] Update score function in UnivariateFeatureSelector document

### What changes were proposed in this pull request? This follows up #31160 to update score function in the document. ### Why are the changes needed? Currently we use `f_classif`, `ch2`, `f_regression`, which sound to me the sklearn's naming. It is good to have it but I think it is nice if we have formal score function name with sklearn's ones. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No, only doc change. Closes #31531 from viirya/SPARK-34080-minor. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-10 09:24:25 +09:00 · 2021-02-10 09:24:25 +09:00 · 1fbd576410
parent c8628c943c
commit 1fbd576410
4 changed files with 16 additions and 10 deletions
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@ -1802,9 +1802,9 @@ User can set `featureType` and `labelType`, and Spark will pick the score functi
 ~~~
 featureType |  labelType |score function
 ------------|------------|--------------
-categorical |categorical | chi2
-continuous  |categorical | f_classif
-continuous  |continuous  | f_regression
+categorical |categorical | chi-squared (chi2)
+continuous  |categorical | ANOVATest (f_classif)
+continuous  |continuous  | F-value (f_regression)
 ~~~

 It supports five selection modes: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@ -44,7 +44,7 @@ import org.apache.spark.sql.types.StructType
 * By default, the selection method is `numTopFeatures`, with the default number of top features
 * set to 50.
 */
-@deprecated("use UnivariateFeatureSelector instead", "3.1.0")
+@deprecated("use UnivariateFeatureSelector instead", "3.1.1")
@Since("1.6.0")
 final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: String)
  extends Selector[ChiSqSelectorModel] {
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/UnivariateFeatureSelector.scala
@ -100,9 +100,12 @@ private[feature] trait UnivariateFeatureSelectorParams extends Params
 * The user can set `featureType` and labelType`, and Spark will pick the score function based on
 * the specified `featureType` and labelType`.
 * The following combination of `featureType` and `labelType` are supported:
- *  - `featureType` `categorical` and `labelType` `categorical`:  Spark uses chi2.
- *  - `featureType` `continuous` and `labelType` `categorical`:  Spark uses f_classif.
- *  - `featureType` `continuous` and `labelType` `continuous`:  Spark uses f_regression.
+ *  - `featureType` `categorical` and `labelType` `categorical`: Spark uses chi-squared,
+ *    i.e. chi2 in sklearn.
+ *  - `featureType` `continuous` and `labelType` `categorical`: Spark uses ANOVATest,
+ *    i.e. f_classif in sklearn.
+ *  - `featureType` `continuous` and `labelType` `continuous`: Spark uses F-value,
+ *    i.e. f_regression in sklearn.
 *
 * The `UnivariateFeatureSelector` supports different selection modes: `numTopFeatures`,
 * `percentile`, `fpr`, `fdr`, `fwe`.
--- a/python/pyspark/ml/feature.py
+++ b/python/pyspark/ml/feature.py
@ -5821,9 +5821,12 @@ class UnivariateFeatureSelector(JavaEstimator, _UnivariateFeatureSelectorParams,

    The following combination of `featureType` and `labelType` are supported:

-    - `featureType` `categorical` and `labelType` `categorical`, Spark uses chi2.
-    - `featureType` `continuous` and `labelType` `categorical`, Spark uses f_classif.
-    - `featureType` `continuous` and `labelType` `continuous`, Spark uses f_regression.
+    - `featureType` `categorical` and `labelType` `categorical`, Spark uses chi-squared,
+      i.e. chi2 in sklearn.
+    - `featureType` `continuous` and `labelType` `categorical`, Spark uses ANOVATest,
+      i.e. f_classif in sklearn.
+    - `featureType` `continuous` and `labelType` `continuous`, Spark uses F-value,
+      i.e. f_regression in sklearn.

    The `UnivariateFeatureSelector` supports different selection modes: `numTopFeatures`,
    `percentile`, `fpr`, `fdr`, `fwe`.