[SPARK-32973][ML][DOC] FeatureHasher does not check categoricalCols in inputCols

### What changes were proposed in this pull request?
1, update the comment: `Note, the relevant columns must also be set in inputCols` -> `Note, the relevant columns should also be set in inputCols`;
2, add a check, and if there are `categoricalCols` not set in `inputCols`, log.warn it;

### Why are the changes needed?
1, there is no check to make sure `categoricalCols` are all set in `inputCols`, to keep existing behavior, update this comments;

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
repl

Closes #29868 from zhengruifeng/feature_hash_cat_doc.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
zhengruifeng 2020-09-27 10:26:05 -05:00 committed by Sean Owen
parent c65b64552f
commit bc77e5b840

View file

@ -91,8 +91,8 @@ class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transforme
/**
* Numeric columns to treat as categorical features. By default only string and boolean
* columns are treated as categorical, so this param can be used to explicitly specify the
* numerical columns to treat as categorical. Note, the relevant columns must also be set in
* `inputCols`.
* numerical columns to treat as categorical. Note, the relevant columns should also be set in
* `inputCols`, categorical columns not set in `inputCols` will be listed in a warning.
* @group param
*/
@Since("2.3.0")
@ -195,7 +195,14 @@ class FeatureHasher(@Since("2.3.0") override val uid: String) extends Transforme
@Since("2.3.0")
override def transformSchema(schema: StructType): StructType = {
val fields = schema($(inputCols).toSet)
val localInputCols = $(inputCols).toSet
if (isSet(categoricalCols)) {
val set = $(categoricalCols).filterNot(c => localInputCols.contains(c))
if (set.nonEmpty) {
log.warn(s"categoricalCols ${set.mkString("[", ",", "]")} do not exist in inputCols")
}
}
val fields = schema(localInputCols)
fields.foreach { fieldSchema =>
val dataType = fieldSchema.dataType
val fieldName = fieldSchema.name