spark-instrumented-optimizer/python/pyspark/ml
Xiangrui Meng 509a7cafcc [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
This PR contains two major changes to `OneHotEncoder`:

1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:

    a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
    b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
    c. If users use `StringIndex`, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

jkbradley sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:

a280dca [Xiangrui Meng] fix tests
d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast

(cherry picked from commit 23452be944)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2015-05-29 00:51:24 -07:00
..
param [SPARK-7762] [MLLIB] set default value for outputCol 2015-05-20 17:26:44 -07:00
__init__.py [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4 2015-05-21 22:57:43 -07:00
classification.py [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random 2015-05-20 15:16:27 -07:00
evaluation.py [MINOR] fix RegressionEvaluator doc 2015-05-28 21:26:49 -07:00
feature.py [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast 2015-05-29 00:51:24 -07:00
pipeline.py [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4 2015-05-21 22:57:43 -07:00
recommendation.py [SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModel 2015-05-28 22:38:46 -07:00
regression.py [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random 2015-05-20 15:16:27 -07:00
tests.py [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random 2015-05-20 15:16:27 -07:00
tuning.py [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python 2015-05-18 12:02:26 -07:00
util.py [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python 2015-05-18 12:02:26 -07:00
wrapper.py [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4 2015-05-21 22:57:43 -07:00