Commit graph

89 commits

Author SHA1 Message Date
Yanbo Liang f92b7b98e9 [SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solver
[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9328 from yanboliang/spark-11367.
2015-10-28 08:54:20 -07:00
vectorijk 9dba5fb2b5 [SPARK-10024][PYSPARK] Python API RF and GBT related params clear up
implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API
in pyspark/ml/{classification, regression}.py

Author: vectorijk <jiangkai@gmail.com>

Closes #9233 from vectorijk/spark-10024.
2015-10-27 13:55:03 -07:00
Gábor Lipták 163d53e829 [SPARK-7021] Add JUnit output for Python unit tests
WIP

Author: Gábor Lipták <gliptak@gmail.com>

Closes #8323 from gliptak/SPARK-7021.
2015-10-22 15:27:11 -07:00
Xiangrui Meng 135ade9050 [MINOR][ML] fix doc warnings
Without an empty line, sphinx will treat doctest as docstring. holdenk

~~~
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
~~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #9188 from mengxr/py-count-vec-doc-fix.
2015-10-20 18:38:06 -07:00
Holden Karau aea7142c98 [SPARK-10767][PYSPARK] Make pyspark shared params codegen more consistent
Namely "." shows up in some places in the template when using the param docstring and not in others

Author: Holden Karau <holden@pigscanfly.ca>

Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
2015-10-20 16:51:32 -07:00
Holden Karau 3aff0866a8 [SPARK-9774] [ML] [PYSPARK] Add python api for ml regression isotonicregression
Add the Python API for isotonicregression.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.
2015-10-07 17:50:35 -07:00
Xiangrui Meng 5e035403d4 [SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in PySpark's AFTSurvivalRegression
If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #9001 from mengxr/SPARK-10957.
2015-10-06 14:58:42 -07:00
vectorijk 5952bdb7df [SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegression
Implement Python API for AFTSurvivalRegression

Author: vectorijk <jiangkai@gmail.com>

Closes #8926 from vectorijk/spark-10688.
2015-10-06 12:43:28 -07:00
Eric Liang 922338812c [SPARK-9681] [ML] Support R feature interactions in RFormula
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).

To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8830 from ericl/interaction-2.
2015-09-25 00:43:22 -07:00
noelsmith 7c4f852bfc [DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formatting
Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e:

![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png)

.. looks like this once newline is added:

![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png)

Author: noelsmith <mail@noelsmith.com>

Closes #8851 from noel-smith/docstring-missing-newline-fix.
2015-09-21 14:24:19 -07:00
Holden Karau ba882db6f4 [SPARK-9769] [ML] [PY] add python api for countvectorizermodel
From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
2015-09-21 13:06:23 -07:00
Yanbo Liang 35e8ab9390 [SPARK-10615] [PYSPARK] change assertEquals to assertEqual
As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8814 from yanboliang/spark-10615.
2015-09-18 09:53:52 -07:00
Yu ISHIKAWA 268088b899 [SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8692 from yu-iskw/SPARK-10282.
2015-09-17 08:51:19 -07:00
Yu ISHIKAWA 0ded87a4d4 [SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.clustering
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8691 from yu-iskw/SPARK-10281.
2015-09-17 08:47:21 -07:00
Yu ISHIKAWA 29bf8aa5a5 [SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.regression
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8693 from yu-iskw/SPARK-10283.
2015-09-17 08:45:20 -07:00
Yu ISHIKAWA c633ed3260 [SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuning
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8694 from yu-iskw/SPARK-10284.
2015-09-17 08:43:59 -07:00
Yuhao Yang 5f46444765 [SPARK-8530] [ML] add python API for MinMaxScaler
jira: https://issues.apache.org/jira/browse/SPARK-8530

add python API for MinMaxScaler
jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7150 from hhbyyh/pythonMinMax.
2015-09-11 10:32:35 -07:00
Joseph K. Bradley 2e3a280754 [MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtils
Changes:
* Make Scala doc for StringIndexerInverse clearer.  Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore

CC: holdenk mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8679 from jkbradley/doc-fixes-1.5.
2015-09-11 08:55:35 -07:00
Yanbo Liang b01b262606 [SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifier
Add Python API for ```MultilayerPerceptronClassifier```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8067 from yanboliang/SPARK-9773.
2015-09-11 08:52:28 -07:00
Yanbo Liang b656e6134f [SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark
LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:
```scala
HasElasticNetParam
HasFitIntercept
HasStandardization
HasThresholds
```
Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8508 from yanboliang/spark-10026.
2015-09-11 08:50:35 -07:00
Yanbo Liang a140dd77c6 [SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.feature
Missing method of ml.feature are listed here:
```StringIndexer``` lacks of parameter ```handleInvalid```.
```StringIndexerModel``` lacks of method ```labels```.
```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8313 from yanboliang/spark-10027.
2015-09-10 20:43:38 -07:00
Yanbo Liang 56a0fe5c6e [SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicer
Add Python API for ml.feature.VectorSlicer.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8102 from yanboliang/SPARK-9772.
2015-09-09 18:02:33 -07:00
Holden Karau 2f6fd5256c [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark
Adds IndexToString to PySpark.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
2015-09-08 22:13:05 -07:00
noelsmith 0e2f216331 [SPARK-10094] Pyspark ML Feature transformers marked as experimental
Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental.

Author: noelsmith <mail@noelsmith.com>

Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
2015-09-08 21:26:20 -07:00
Holden Karau e6e483cc4d [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover
Add a python API for the Stop Words Remover.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
2015-09-01 10:48:57 -07:00
Yanbo Liang 52ea399e6e [SPARK-10355] [ML] [PySpark] Add Python API for SQLTransformer
Add Python API for SQLTransformer

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8527 from yanboliang/spark-10355.
2015-08-31 16:11:27 -07:00
Yanbo Liang 5b3245d6df [SPARK-8472] [ML] [PySpark] Python API for DCT
Add Python API for ml.feature.DCT.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8485 from yanboliang/spark-8472.
2015-08-31 15:50:41 -07:00
noelsmith 7583681e6b [SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects incorrect model
* Added isLargerBetter() method to Pyspark Evaluator to match the Scala version.
* JavaEvaluator delegates isLargerBetter() to underlying Scala object.
* Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax.
* Added test cases for where smaller is better (RMSE) and larger is better (R-Squared).

(This contribution is my original work and that I license the work to the project under Sparks' open source license)

Author: noelsmith <mail@noelsmith.com>

Closes #8399 from noel-smith/pyspark-rmse-xval-fix.
2015-08-27 23:59:30 -07:00
Feynman Liang 28a98464ea [SPARK-10097] Adds shouldMaximize flag to ml.evaluation.Evaluator
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.

This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.

CC jkbradley

Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8290 from feynmanliang/SPARK-10097.
2015-08-19 11:35:05 -07:00
Yanbo Liang 0076e82123 [SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ml.feature.ElementwiseProduct
Add Python API, user guide and example for ml.feature.ElementwiseProduct.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8061 from yanboliang/SPARK-9768.
2015-08-17 17:25:41 -07:00
MechCoder ffa05c84fe [SPARK-9828] [PYSPARK] Mutable values should not be default arguments
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8110 from MechCoder/spark-9828.
2015-08-14 12:46:05 -07:00
Xiangrui Meng 68f9957149 [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol
This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.

This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.

jkbradley yu-iskw

Author: Xiangrui Meng <meng@databricks.com>

Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:

149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
2015-08-12 23:04:59 -07:00
Joseph K. Bradley 551def5d69 [SPARK-9789] [ML] Added logreg threshold param back
Reinstated LogisticRegression.threshold Param for binary compatibility.  Param thresholds overrides threshold, if set.

CC: mengxr dbtsai feynmanliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8079 from jkbradley/logreg-reinstate-threshold.
2015-08-12 14:27:13 -07:00
Yanbo Liang 762bacc16a [SPARK-9766] [ML] [PySpark] check and add miss docs for PySpark ML
Check and add miss docs for PySpark ML (this issue only check miss docs for o.a.s.ml not o.a.s.mllib).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8059 from yanboliang/SPARK-9766.
2015-08-12 13:24:18 -07:00
MechCoder 076ec05681 [SPARK-9533] [PYSPARK] [ML] Add missing methods in Word2Vec ML
After https://github.com/apache/spark/pull/7263 it is pretty straightforward to Python wrappers.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7930 from MechCoder/spark-9533 and squashes the following commits:

1bea394 [MechCoder] make getVectors a lazy val
5522756 [MechCoder] [SPARK-9533] [PySpark] [ML] Add missing methods in Word2Vec ML
2015-08-06 10:09:58 -07:00
Joseph K. Bradley e375456063 [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier
Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier, plus doc tests for those columns.

CC: holdenk yanboliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7903 from jkbradley/rf-prob-python and squashes the following commits:

c62a83f [Joseph K. Bradley] made unit test more robust
14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier in PySpark
2015-08-04 14:54:26 -07:00
Holden Karau 5a23213c14 [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier
This PR replaces the old "threshold" with a generalized "thresholds" Param.  We keep getThreshold,setThreshold for backwards compatibility for binary classification.

Note that the primary author of this PR is holdenk

Author: Holden Karau <holden@pigscanfly.ca>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits:

3952977 [Joseph K. Bradley] fixed pyspark doc test
85febc8 [Joseph K. Bradley] made python unit tests a little more robust
7eb1d86 [Joseph K. Bradley] small cleanups
6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat
be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc.
6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests
25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
c02d6c0 [Holden Karau] No default for thresholds
5e43628 [Holden Karau] CR feedback and fixed the renamed test
f3fbbd1 [Holden Karau] revert the changes to random forest :(
51f581c [Holden Karau] Add explicit types to public methods, fix long line
f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic
398078a [Holden Karau] move the thresholding around a bunch based on the design doc
4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok)
638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test
e09919c [Holden Karau] Fix return type, I need more coffee....
8d92cac [Holden Karau] Use ClassifierParams as the head
3456ed3 [Holden Karau] Add explicit return types even though just test
a0f3b0c [Holden Karau] scala style fixes
6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now
ffc8dab [Holden Karau] Update the sharedParams
0420290 [Holden Karau] Allow us to override the get methods selectively
978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions
1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there"
1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there
efb9084 [Holden Karau] move setThresholds only to where its used
6b34809 [Holden Karau] Add a test with thresholding for the RFCS
74f54c3 [Holden Karau] Fix creation of vote array
1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down.
2f44b18 [Holden Karau] Add a global default of null for thresholds param
f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds"
634b06f [Holden Karau] Some progress towards unifying threshold and thresholds
85c9e01 [Holden Karau] Test passes again... little fnur
099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer)
0f46836 [Holden Karau] Start adding a classifiersuite
f70eb5e [Holden Karau] Fix test compile issues
a7d59c8 [Holden Karau] Move thresholding into Classifier trait
5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test)
1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation
31d6bf2 [Holden Karau] Start threading the threshold info through
0ef228c [Holden Karau] Add hasthresholds
2015-08-04 10:12:22 -07:00
Xiangrui Meng e4765a4683 [SPARK-9544] [MLLIB] add Python API for RFormula
Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder

Author: Xiangrui Meng <meng@databricks.com>

Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:

3d5ff03 [Xiangrui Meng] add an doctest for . and -
5e969a5 [Xiangrui Meng] fix pydoc
1cd41f8 [Xiangrui Meng] organize imports
3c18b10 [Xiangrui Meng] add Python API for RFormula
2015-08-03 13:59:35 -07:00
Yanbo Liang 4cdd8ecd66 [SPARK-9536] [SPARK-9537] [SPARK-9538] [ML] [PYSPARK] ml.classification support raw and probability prediction for PySpark
Make the following ml.classification class support raw and probability prediction for PySpark:
```scala
NaiveBayesModel
DecisionTreeClassifierModel
LogisticRegressionModel
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7866 from yanboliang/spark-9536-9537 and squashes the following commits:

2934dab [Yanbo Liang] ml.NaiveBayes, ml.DecisionTreeClassifier and ml.LogisticRegression support probability prediction
2015-08-02 22:19:27 -07:00
Yanbo Liang 69b62f76fc [SPARK-9214] [ML] [PySpark] support ml.NaiveBayes for Python
support ml.NaiveBayes for Python

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7568 from yanboliang/spark-9214 and squashes the following commits:

5ee3fd6 [Yanbo Liang] fix typos
3ecd046 [Yanbo Liang] fix typos
f9c94d1 [Yanbo Liang] change lambda_ to smoothing and fix other issues
180452a [Yanbo Liang] fix typos
7dda1f4 [Yanbo Liang] support ml.NaiveBayes for Python
2015-07-30 23:03:48 -07:00
Ram Sriharsha 4e5919bfb4 [SPARK-7690] [ML] Multiclass classification Evaluator
Multiclass Classification Evaluator for ML Pipelines. F1 score, precision, recall, weighted precision and weighted recall are supported as available metrics.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #7475 from harsha2010/SPARK-7690 and squashes the following commits:

9bf4ec7 [Ram Sriharsha] fix indentation
3f09a85 [Ram Sriharsha] cleanup doc
16115ae [Ram Sriharsha] code review fixes
032d2a3 [Ram Sriharsha] fix test
eec9865 [Ram Sriharsha] Fix Python Indentation
1dbeffd [Ram Sriharsha] Merge branch 'master' into SPARK-7690
68cea85 [Ram Sriharsha] Merge branch 'master' into SPARK-7690
54c03de [Ram Sriharsha] [SPARK-7690][ml][WIP] Multiclass Evaluator for ML Pipeline
2015-07-30 23:02:11 -07:00
Xiangrui Meng 81464f2a82 [MINOR] [MLLIB] fix doc for RegexTokenizer
This is #7791 for Python. hhbyyh

Author: Xiangrui Meng <meng@databricks.com>

Closes #7798 from mengxr/regex-tok-py and squashes the following commits:

baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
2015-07-30 09:45:17 -07:00
Holden Karau 37c2d1927c [SPARK-9016] [ML] make random forest classifiers implement classification trait
Implement the classification trait for RandomForestClassifiers. The plan is to use this in the future to providing thresholding for RandomForestClassifiers (as well as other classifiers that implement that trait).

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7432 from holdenk/SPARK-9016-make-random-forest-classifiers-implement-classification-trait and squashes the following commits:

bf22fa6 [Holden Karau] Add missing imports for testing suite
e948f0d [Holden Karau] Check the prediction generation from rawprediciton
25320c3 [Holden Karau] Don't supply numClasses when not needed, assert model classes are as expected
1a67e04 [Holden Karau] Use old decission tree stuff instead
673e0c3 [Holden Karau] Merge branch 'master' into SPARK-9016-make-random-forest-classifiers-implement-classification-trait
0d15b96 [Holden Karau] FIx typo
5eafad4 [Holden Karau] add a constructor for rootnode + num classes
fc6156f [Holden Karau] scala style fix
2597915 [Holden Karau] take num classes in constructor
3ccfe4a [Holden Karau] Merge in master, make pass numClasses through randomforest for training
222a10b [Holden Karau] Increase numtrees to 3 in the python test since before the two were equal and the argmax was selecting the last one
16aea1c [Holden Karau] Make tests match the new models
b454a02 [Holden Karau] Make the Tree classifiers extends the Classifier base class
77b4114 [Holden Karau] Import vectors lib
2015-07-29 18:18:29 -07:00
Yu ISHIKAWA 34a889db85 [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines
I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.

[SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:

be752de [Yu ISHIKAWA] Add assertions
a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
fb2417c [Yu ISHIKAWA] Use getInt, instead of get
f397be4 [Yu ISHIKAWA] Switch the comparisons.
ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
f8338bc [Yu ISHIKAWA] Add the placeholders in Python
4a03003 [Yu ISHIKAWA] Test for contains in Python
6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
2ec80bc [Yu ISHIKAWA] Fit on 1 line
e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
4d2ad1e [Yu ISHIKAWA] Modify the indentations
0ae422f [Yu ISHIKAWA] Add a test for `setParams`
4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
5bedc51 [Yu ISHIKAWA] Remve an extra new line
444c289 [Yu ISHIKAWA] Add the validation for `runs`
e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
7991e15 [Yu ISHIKAWA] Add a validation for `k`
c2df35d [Yu ISHIKAWA] Make `predict` private
93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
11c2a12 [Yu ISHIKAWA] Limit the imports
badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
2015-07-17 18:30:04 -07:00
Yanbo Liang 830666f6fe [SPARK-8792] [ML] Add Python API for PCA transformer
Add Python API for PCA transformer

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7190 from yanboliang/spark-8792 and squashes the following commits:

8f4ac31 [Yanbo Liang] address comments
8a79cc0 [Yanbo Liang] Add Python API for PCA transformer
2015-07-17 14:08:06 -07:00
MechCoder 20bb10f864 [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark
This adds Pylint checks to PySpark.

For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script).
We still need to figure out what rules to be allowed.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7241 from MechCoder/pylint and squashes the following commits:

2fc7291 [MechCoder] Remove pylint test fail
6d883a2 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins
f3a5e17 [MechCoder] undefined-variable
ca8b749 [MechCoder] Minor changes
71629f8 [MechCoder] remove trailing whitespace
8498ff9 [MechCoder] Remove blacklisted arguments and pointless statements check
1dbd094 [MechCoder] Disable all checks for now
8b8aa8a [MechCoder] Add pylint configuration file
7871bb1 [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark
2015-07-15 08:25:53 -07:00
Davies Liu 79c35826e6 Revert "[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark"
This reverts commit 9b62e9375f.
2015-07-13 11:30:36 -07:00
MechCoder 9b62e9375f [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark
This adds Pylint checks to PySpark.

For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script).
We still need to figure out what rules to be allowed.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7241 from MechCoder/pylint and squashes the following commits:

8496834 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins
57393a3 [MechCoder] undefined-variable
a8e2547 [MechCoder] Minor changes
7753810 [MechCoder] remove trailing whitespace
75c5d2b [MechCoder] Remove blacklisted arguments and pointless statements check
6bde250 [MechCoder] Disable all checks for now
3464666 [MechCoder] Add pylint configuration file
d28109f [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark
2015-07-13 09:47:53 -07:00
MechCoder 35d781e71b [SPARK-8704] [ML] [PySpark] Add missing methods in StandardScaler
Add std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7086 from MechCoder/missing_model_methods and squashes the following commits:

9fbae90 [MechCoder] Add type
6e3d6b2 [MechCoder] [SPARK-8704] Add missing methods in StandardScaler (ML and PySpark)
2015-07-07 12:35:40 -07:00
MechCoder 1dbc4a155f [SPARK-8711] [ML] Add additional methods to PySpark ML tree models
Add numNodes and depth to treeModels, add treeWeights to ensemble Models.
Add __repr__ to all models.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7095 from MechCoder/missing_methods_tree and squashes the following commits:

23b08be [MechCoder] private [spark]
38a0860 [MechCoder] rename pyTreeWeights to javaTreeWeights
6d16ad8 [MechCoder] Fix Python 3 Error
47d7023 [MechCoder] Use np.allclose and treeEnsembleModel -> TreeEnsembleMethods
819098c [MechCoder] [SPARK-8711] [ML] Add additional methods ot PySpark ML tree models
2015-07-07 08:58:08 -07:00