spark-instrumented-optimizer

History

Xiangrui Meng 509a7cafcc [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes #6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast (cherry picked from commit `23452be944`) Signed-off-by: Xiangrui Meng <meng@databricks.com>		2015-05-29 00:51:24 -07:00
..
param	[SPARK-7762] [MLLIB] set default value for outputCol	2015-05-20 17:26:44 -07:00
__init__.py	[SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4	2015-05-21 22:57:43 -07:00
classification.py	[SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random	2015-05-20 15:16:27 -07:00
evaluation.py	[MINOR] fix RegressionEvaluator doc	2015-05-28 21:26:49 -07:00
feature.py	[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast	2015-05-29 00:51:24 -07:00
pipeline.py	[SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4	2015-05-21 22:57:43 -07:00
recommendation.py	[SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModel	2015-05-28 22:38:46 -07:00
regression.py	[SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random	2015-05-20 15:16:27 -07:00
tests.py	[SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random	2015-05-20 15:16:27 -07:00
tuning.py	[SPARK-7380] [MLLIB] pipeline stages should be copyable in Python	2015-05-18 12:02:26 -07:00
util.py	[SPARK-7380] [MLLIB] pipeline stages should be copyable in Python	2015-05-18 12:02:26 -07:00
wrapper.py	[SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4	2015-05-21 22:57:43 -07:00