94ac9eba21
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach.
A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName.
Author: Sandy Ryza <sandy@cloudera.com>
Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits:
f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
(cherry picked from commit
|
||
---|---|---|
.. | ||
src | ||
pom.xml |