spark-instrumented-optimizer

History

Xiangrui Meng 23452be944 [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes #6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast		2015-05-29 00:51:12 -07:00
..
docs	[SPARK-7619] [PYTHON] fix docstring signature	2015-05-14 18:16:22 -07:00
lib	[SPARK-2305] [PySpark] Update Py4J to version 0.8.2.1	2014-07-29 19:02:06 -07:00
pyspark	[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast	2015-05-29 00:51:12 -07:00
test_support	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
.gitignore	[SPARK-3946] gitignore in /python includes wrong directory	2014-10-14 14:09:39 -07:00
run-tests	[SPARK-7543] [SQL] [PySpark] split dataframe.py into multiple files	2015-05-15 20:09:15 -07:00