spark-instrumented-optimizer

History

Xiangrui Meng 509a7cafcc [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast This PR contains two major changes to `OneHotEncoder`: 1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index 2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits: a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm) b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1. c. If users use `StringIndex`, the last element is the least frequent one. Sorry for including two changes in one PR! I'll update the user guide in another PR. jkbradley sryza Author: Xiangrui Meng <meng@databricks.com> Closes #6466 from mengxr/SPARK-7912 and squashes the following commits: a280dca [Xiangrui Meng] fix tests d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912 171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's 00dfd96 [Xiangrui Meng] update OneHotEncoder in Python 208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast (cherry picked from commit `23452be944`) Signed-off-by: Xiangrui Meng <meng@databricks.com>		2015-05-29 00:51:24 -07:00
..
ml	[SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast	2015-05-29 00:51:24 -07:00
mllib	[SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModel	2015-05-28 22:38:46 -07:00
sql	[SPARK-7840] add insertInto() to Writer	2015-05-23 09:07:45 -07:00
streaming	[SPARK-6657] [PYSPARK] Fix doc warnings	2015-05-18 08:35:24 -07:00
__init__.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
accumulators.py	[SPARK-6661] Python type errors should print type, not object	2015-04-20 10:44:09 -07:00
broadcast.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
cloudpickle.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
conf.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
context.py	[SPARK-7711] Add a startTime property to match the corresponding one in Scala	2015-05-21 14:09:09 -07:00
daemon.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
files.py	[SPARK-3309] [PySpark] Put all public API in __all__	2014-09-03 11:49:45 -07:00
heapq3.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
java_gateway.py	[SPARK-6949] [SQL] [PySpark] Support Date/Timestamp in Column expression	2015-04-21 00:08:18 -07:00
join.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
profiler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
rdd.py	[SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative	2015-05-21 19:43:09 +01:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
shell.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
shuffle.py	[SPARK-7339] [PYSPARK] PySpark shuffle spill memory sometimes are not correct	2015-05-26 08:36:08 -07:00
statcounter.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-3417] Use new-style classes in PySpark	2014-09-08 15:45:36 -07:00
tests.py	[SPARK-7711] Add a startTime property to match the corresponding one in Scala	2015-05-21 14:09:09 -07:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
worker.py	[SPARK-6216] [PYSPARK] check python version of worker with driver	2015-05-18 12:55:37 -07:00