spark-instrumented-optimizer/mllib
Sandy Ryza 94ac9eba21 [SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features.  Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns.  The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The names are used for the output column names, which take the form colName_categoryName.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer

(cherry picked from commit 47728db7cf)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-05 12:34:11 -07:00
..
src [SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer 2015-05-05 12:34:11 -07:00
pom.xml [SPARK-1406] Mllib pmml model export 2015-04-29 23:21:21 -07:00