spark-instrumented-optimizer

History

Sandy Ryza 94ac9eba21 [SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <sandy@cloudera.com> Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer (cherry picked from commit `47728db7cf`) Signed-off-by: Xiangrui Meng <meng@databricks.com>	2015-05-05 12:34:11 -07:00
..
src	[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer	2015-05-05 12:34:11 -07:00
pom.xml	[SPARK-1406] Mllib pmml model export	2015-04-29 23:21:21 -07:00

Sandy Ryza 94ac9eba21 [SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer

This patch adds a one hot encoder for categorical features.  Planning to add documentation and another test after getting feedback on the approach.

A couple choices made here:
* There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns.  The default is true, which is the behavior in scikit-learn.
* The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`.  These can be easily gotten from a `StringIndexer`.  The names are used for the output column names, which take the form colName_categoryName.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits:

f383250 [Sandy Ryza] Infer label names automatically
6e257b9 [Sandy Ryza] Review comments
7c539cf [Sandy Ryza] Vector transformers
1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer

(cherry picked from commit 47728db7cf)
Signed-off-by: Xiangrui Meng <meng@databricks.com>

2015-05-05 12:34:11 -07:00

src [SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer 2015-05-05 12:34:11 -07:00

pom.xml [SPARK-1406] Mllib pmml model export 2015-04-29 23:21:21 -07:00