ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Yanbo Liang	e51b6eaa9e	[SPARK-13032][ML][PYSPARK] PySpark support model export/import and take LinearRegression as example * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark. * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #10469 from yanboliang/spark-11939.	2016-01-29 09:22:24 -08:00
Xusen Yin	4db255c7aa	[SPARK-12780] Inconsistency returning value of ML python models' properties https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10724 from yinxusen/SPARK-12780.	2016-01-26 21:16:56 -08:00
Holden Karau	eb917291ca	[SPARK-10509][PYSPARK] Reduce excessive param boiler plate code The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.	2016-01-26 15:53:48 -08:00
Xusen Yin	8beab68152	[SPARK-11923][ML] Python API for ml.feature.ChiSqSelector https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes #10186 from yinxusen/SPARK-11923.	2016-01-26 11:56:46 -08:00
Holden Karau	b66afdeb52	[SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer Add Python API for ml.feature.QuantileDiscretizer. One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model. cc brkyvz & mengxr Author: Holden Karau <holden@us.ibm.com> Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.	2016-01-25 22:38:31 -08:00
Yanbo Liang	dcae355c64	[SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark ```PCAModel``` can output ```explainedVariance``` at Python side. cc mengxr srowen Author: Yanbo Liang <ybliang8@gmail.com> Closes #10830 from yanboliang/spark-12905.	2016-01-25 13:54:21 -08:00
Gábor Lipták	9bb35c5b59	[SPARK-11295][PYSPARK] Add packages to JUnit output for Python tests This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test. Author: Gábor Lipták <gliptak@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10850 from mengxr/SPARK-11295.	2016-01-20 11:11:10 -08:00
Xiangrui Meng	beda901422	Revert "[SPARK-11295] Add packages to JUnit output for Python tests" This reverts commit `c6f971b4ae`.	2016-01-19 16:51:17 -08:00
BenFradet	f6f7ca9d2e	[SPARK-9716][ML] BinaryClassificationEvaluator should accept Double prediction column This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10472 from BenFradet/SPARK-9716.	2016-01-19 14:59:20 -08:00
Gábor Lipták	c6f971b4ae	[SPARK-11295] Add packages to JUnit output for Python tests SPARK-11295 Add packages to JUnit output for Python tests This improves grouping/display of test case results. Author: Gábor Lipták <gliptak@gmail.com> Closes #9263 from gliptak/SPARK-11295.	2016-01-19 14:06:53 -08:00
Yanbo Liang	5f843781e3	[SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during Spark 1.6 QA Add PySpark missing methods and params for ml.feature: * ```RegexTokenizer``` should support setting ```toLowercase```. * ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```. * ```PCAModel``` should support output ```pc```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9908 from yanboliang/spark-11925.	2016-01-15 15:54:19 -08:00
Yanbo Liang	3aa3488225	[SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9807 from yanboliang/spark-11815.	2016-01-06 10:52:25 -08:00
Yanbo Liang	95eb651633	[SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9931 from yanboliang/SPARK-11945.	2016-01-06 10:50:02 -08:00
Holden Karau	3b29004d24	[SPARK-7675][ML][PYSPARK] sparkml params type conversion From JIRA: Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method. A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available. This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future. Author: Holden Karau <holden@us.ibm.com> Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.	2016-01-06 10:43:03 -08:00
Jeff Zhang	1920d72a1f	[PYSPARK] Pyspark typo & Add missing abstractmethod annotation No jira is created since this is a trivial change. davies Please help review it Author: Jeff Zhang <zjffdu@apache.org> Closes #10143 from zjffdu/pyspark_typo.	2015-12-21 08:53:46 -08:00
Martin Menestret	3a44aebd0c	[SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed Extend CrossValidator with HasSeed in PySpark. This PR replaces [https://github.com/apache/spark/pull/7997] CC: yanboliang thunterdb mmenestret Would one of you mind taking a look? Thanks! Author: Joseph K. Bradley <joseph@databricks.com> Author: Martin MENESTRET <mmenestret@ippon.fr> Closes #10268 from jkbradley/pyspark-cv-seed.	2015-12-16 14:05:35 -08:00
Yanbo Liang	d576e76bba	[MINOR][ML] Use coefficients replace weights Use ```coefficients``` replace ```weights```, I wish they are the last two. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #10065 from yanboliang/coefficients.	2015-12-03 11:37:34 -08:00
Yanbo Liang	7216f40545	[SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointInterval * Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint. * Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9856 from yanboliang/spark-11875.	2015-11-19 22:14:01 -08:00
Yanbo Liang	603a721c21	[SPARK-11820][ML][PYSPARK] PySpark LiR & LoR should support weightCol [SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9811 from yanboliang/spark-11820.	2015-11-18 13:32:06 -08:00
Yu ISHIKAWA	88a3fdcc78	[SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8690 from yu-iskw/SPARK-10280.	2015-11-09 13:16:04 -08:00
Imran Rashid	49f1a82037	[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.	2015-11-06 20:06:24 +00:00
Yanbo Liang	9da7ceed81	[SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.	2015-11-05 09:56:18 -08:00
Yanbo Liang	2e86cf1b01	[SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose coefficients/intercept/scale PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9492 from yanboliang/spark-11527.	2015-11-05 09:00:03 -08:00
vectorijk	c020f7d9d4	[SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead in ML models Deprecated in `LogisticRegression` and `LinearRegression` Author: vectorijk <jiangkai@gmail.com> Closes #9311 from vectorijk/spark-10592.	2015-11-02 16:12:04 -08:00
lihao	ecfb3e73fd	[SPARK-10286][ML][PYSPARK][DOCS] Add @since annotation to pyspark.ml.param and pyspark.ml.* Author: lihao <lihaowhu@gmail.com> Closes #9275 from lidinghao/SPARK-10286.	2015-11-02 16:09:22 -08:00
Yanbo Liang	f92b7b98e9	[SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solver [SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs") Author: Yanbo Liang <ybliang8@gmail.com> Closes #9328 from yanboliang/spark-11367.	2015-10-28 08:54:20 -07:00
vectorijk	9dba5fb2b5	[SPARK-10024][PYSPARK] Python API RF and GBT related params clear up implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API in pyspark/ml/{classification, regression}.py Author: vectorijk <jiangkai@gmail.com> Closes #9233 from vectorijk/spark-10024.	2015-10-27 13:55:03 -07:00
Gábor Lipták	163d53e829	[SPARK-7021] Add JUnit output for Python unit tests WIP Author: Gábor Lipták <gliptak@gmail.com> Closes #8323 from gliptak/SPARK-7021.	2015-10-22 15:27:11 -07:00
Xiangrui Meng	135ade9050	[MINOR][ML] fix doc warnings Without an empty line, sphinx will treat doctest as docstring. holdenk ~~~ /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label\|raw \|vectors \| +-----+---------------+-------------------------+ \|0 \|[a, b, c] \|(3,[0,1,2],[1.0,1.0,1.0])". /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 \|[a, b, b, c, a]\|(3,[0,1,2],[2.0,2.0,1.0])". ~~~ Author: Xiangrui Meng <meng@databricks.com> Closes #9188 from mengxr/py-count-vec-doc-fix.	2015-10-20 18:38:06 -07:00
Holden Karau	aea7142c98	[SPARK-10767][PYSPARK] Make pyspark shared params codegen more consistent Namely "." shows up in some places in the template when using the param docstring and not in others Author: Holden Karau <holden@pigscanfly.ca> Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.	2015-10-20 16:51:32 -07:00
Holden Karau	3aff0866a8	[SPARK-9774] [ML] [PYSPARK] Add python api for ml regression isotonicregression Add the Python API for isotonicregression. Author: Holden Karau <holden@pigscanfly.ca> Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.	2015-10-07 17:50:35 -07:00
Xiangrui Meng	5e035403d4	[SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in PySpark's AFTSurvivalRegression If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #9001 from mengxr/SPARK-10957.	2015-10-06 14:58:42 -07:00
vectorijk	5952bdb7df	[SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegression Implement Python API for AFTSurvivalRegression Author: vectorijk <jiangkai@gmail.com> Closes #8926 from vectorijk/spark-10688.	2015-10-06 12:43:28 -07:00
Eric Liang	922338812c	[SPARK-9681] [ML] Support R feature interactions in RFormula This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.	2015-09-25 00:43:22 -07:00
noelsmith	7c4f852bfc	[DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formatting Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e: ![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png) .. looks like this once newline is added: ![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png) Author: noelsmith <mail@noelsmith.com> Closes #8851 from noel-smith/docstring-missing-newline-fix.	2015-09-21 14:24:19 -07:00
Holden Karau	ba882db6f4	[SPARK-9769] [ML] [PY] add python api for countvectorizermodel From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel Author: Holden Karau <holden@pigscanfly.ca> Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.	2015-09-21 13:06:23 -07:00
Yanbo Liang	35e8ab9390	[SPARK-10615] [PYSPARK] change assertEquals to assertEqual As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8814 from yanboliang/spark-10615.	2015-09-18 09:53:52 -07:00
Yu ISHIKAWA	268088b899	[SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8692 from yu-iskw/SPARK-10282.	2015-09-17 08:51:19 -07:00
Yu ISHIKAWA	0ded87a4d4	[SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.clustering Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8691 from yu-iskw/SPARK-10281.	2015-09-17 08:47:21 -07:00
Yu ISHIKAWA	29bf8aa5a5	[SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.regression Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8693 from yu-iskw/SPARK-10283.	2015-09-17 08:45:20 -07:00
Yu ISHIKAWA	c633ed3260	[SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuning Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8694 from yu-iskw/SPARK-10284.	2015-09-17 08:43:59 -07:00
Yuhao Yang	5f46444765	[SPARK-8530] [ML] add python API for MinMaxScaler jira: https://issues.apache.org/jira/browse/SPARK-8530 add python API for MinMaxScaler jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514 Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7150 from hhbyyh/pythonMinMax.	2015-09-11 10:32:35 -07:00
Joseph K. Bradley	2e3a280754	[MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtils Changes: * Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited. * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore CC: holdenk mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8679 from jkbradley/doc-fixes-1.5.	2015-09-11 08:55:35 -07:00
Yanbo Liang	b01b262606	[SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifier Add Python API for ```MultilayerPerceptronClassifier```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8067 from yanboliang/SPARK-9773.	2015-09-11 08:52:28 -07:00
Yanbo Liang	b656e6134f	[SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here: ```scala HasElasticNetParam HasFitIntercept HasStandardization HasThresholds ``` Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8508 from yanboliang/spark-10026.	2015-09-11 08:50:35 -07:00
Yanbo Liang	a140dd77c6	[SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.feature Missing method of ml.feature are listed here: ```StringIndexer``` lacks of parameter ```handleInvalid```. ```StringIndexerModel``` lacks of method ```labels```. ```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8313 from yanboliang/spark-10027.	2015-09-10 20:43:38 -07:00
Yanbo Liang	56a0fe5c6e	[SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicer Add Python API for ml.feature.VectorSlicer. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8102 from yanboliang/SPARK-9772.	2015-09-09 18:02:33 -07:00
Holden Karau	2f6fd5256c	[SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark Adds IndexToString to PySpark. Author: Holden Karau <holden@pigscanfly.ca> Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.	2015-09-08 22:13:05 -07:00
noelsmith	0e2f216331	[SPARK-10094] Pyspark ML Feature transformers marked as experimental Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental. Author: noelsmith <mail@noelsmith.com> Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.	2015-09-08 21:26:20 -07:00
Holden Karau	e6e483cc4d	[SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.	2015-09-01 10:48:57 -07:00

1 2 3

114 commits