Commit graph

536 commits

Author SHA1 Message Date
Ilya Matiach 1edb3175d8 [SPARK-21866][ML][PYSPARK] Adding spark image reader
## What changes were proposed in this pull request?
Adding spark image reader, an implementation of schema for representing images in spark DataFrames

The code is taken from the spark package located here:
(https://github.com/Microsoft/spark-images)

Please see the JIRA for more information (https://issues.apache.org/jira/browse/SPARK-21866)

Please see mailing list for SPIP vote and approval information:
(http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-SPARK-21866-Image-support-in-Apache-Spark-td22510.html)

# Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases are emerging for different data formats beyond the traditional SQL types or the numerical types (vectors and matrices). Deep Learning applications commonly deal with image processing. A number of projects add some Deep Learning capabilities to Spark (see list below), but they struggle to communicate with each other or with MLlib pipelines because there is no standard way to represent an image in Spark DataFrames. We propose to federate efforts for representing images in Spark by defining a representation that caters to the most common needs of users and library developers.
This SPIP proposes a specification to represent images in Spark DataFrames and Datasets (based on existing industrial standards), and an interface for loading sources of images. It is not meant to be a full-fledged image processing library, but rather the core description that other libraries and users can rely on. Several packages already offer various processing facilities for transforming images or doing more complex operations, and each has various design tradeoffs that make them better as standalone solutions.
This project is a joint collaboration between Microsoft and Databricks, which have been testing this design in two open source packages: MMLSpark and Deep Learning Pipelines.
The proposed image format is an in-memory, decompressed representation that targets low-level applications. It is significantly more liberal in memory usage than compressed image representations such as JPEG, PNG, etc., but it allows easy communication with popular image processing libraries and has no decoding overhead.

## How was this patch tested?

Unit tests in scala ImageSchemaSuite, unit tests in python

Author: Ilya Matiach <ilmat@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19439 from imatiach-msft/ilmat/spark-images.
2017-11-22 15:45:45 -08:00
WeichenXu 2d868d9398 [SPARK-22521][ML] VectorIndexerModel support handle unseen categories via handleInvalid: Python API
## What changes were proposed in this pull request?

Add python api for VectorIndexerModel support handle unseen categories via handleInvalid.

## How was this patch tested?

doctest added.

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19753 from WeichenXu123/vector_indexer_invalid_py.
2017-11-21 10:53:53 -08:00
WeichenXu 20eb95e5e9 [SPARK-21911][ML][PYSPARK] Parallel Model Evaluation for ML Tuning in PySpark
## What changes were proposed in this pull request?

Add parallelism support for ML tuning in pyspark.

## How was this patch tested?

Test updated.

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19122 from WeichenXu123/par-ml-tuning-py.
2017-10-27 15:19:27 -07:00
hyukjinkwon d9798c834f [SPARK-22313][PYTHON] Mark/print deprecation warnings as DeprecationWarning for deprecated APIs
## What changes were proposed in this pull request?

This PR proposes to mark the existing warnings as `DeprecationWarning` and print out warnings for deprecated functions.

This could be actually useful for Spark app developers. I use (old) PyCharm and this IDE can detect this specific `DeprecationWarning` in some cases:

**Before**

<img src="https://user-images.githubusercontent.com/6477701/31762664-df68d9f8-b4f6-11e7-8773-f0468f70a2cc.png" height="45" />

**After**

<img src="https://user-images.githubusercontent.com/6477701/31762662-de4d6868-b4f6-11e7-98dc-3c8446a0c28a.png" height="70" />

For console usage, `DeprecationWarning` is usually disabled (see https://docs.python.org/2/library/warnings.html#warning-categories and https://docs.python.org/3/library/warnings.html#warning-categories):

```
>>> import warnings
>>> filter(lambda f: f[2] == DeprecationWarning, warnings.filters)
[('ignore', <_sre.SRE_Pattern object at 0x10ba58c00>, <type 'exceptions.DeprecationWarning'>, <_sre.SRE_Pattern object at 0x10bb04138>, 0), ('ignore', None, <type 'exceptions.DeprecationWarning'>, None, 0)]
```

so, it won't actually mess up the terminal much unless it is intended.

If this is intendedly enabled, it'd should as below:

```
>>> import warnings
>>> warnings.simplefilter('always', DeprecationWarning)
>>>
>>> from pyspark.sql import functions
>>> functions.approxCountDistinct("a")
.../spark/python/pyspark/sql/functions.py:232: DeprecationWarning: Deprecated in 2.1, use approx_count_distinct instead.
  "Deprecated in 2.1, use approx_count_distinct instead.", DeprecationWarning)
...
```

These instances were found by:

```
cd python/pyspark
grep -r "Deprecated" .
grep -r "deprecated" .
grep -r "deprecate" .
```

## How was this patch tested?

Manually tested.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #19535 from HyukjinKwon/deprecated-warning.
2017-10-24 12:44:47 +09:00
Nick Pentreath 98057583dd [SPARK-20679][ML] Support recommending for a subset of users/items in ALSModel
This PR adds methods `recommendForUserSubset` and `recommendForItemSubset` to `ALSModel`. These allow recommending for a specified set of user / item ids rather than for every user / item (as in the `recommendForAllX` methods).

The subset methods take a `DataFrame` as input, containing ids in the column specified by the param `userCol` or `itemCol`. The model will generate recommendations for each _unique_ id in this input dataframe.

## How was this patch tested?
New unit tests in `ALSSuite` and Python doctests in `ALS`. Ran updated examples locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #18748 from MLnick/als-recommend-df.
2017-10-09 10:42:33 +02:00
Marco Gaido 5ac96854cc [SPARK-21981][PYTHON][ML] Added Python interface for ClusteringEvaluator
## What changes were proposed in this pull request?

Added Python interface for ClusteringEvaluator

## How was this patch tested?

Manual test, eg. the example Python code in the comments.

cc yanboliang

Author: Marco Gaido <mgaido@hortonworks.com>
Author: Marco Gaido <marcogaido91@gmail.com>

Closes #19204 from mgaido91/SPARK-21981.
2017-09-22 13:12:33 +08:00
Yanbo Liang 2f962422a2 [MINOR][ML] Remove unnecessary default value setting for evaluators.
## What changes were proposed in this pull request?
Remove unnecessary default value setting for all evaluators, as we have set them in corresponding _HasXXX_ base classes.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #19262 from yanboliang/evaluation.
2017-09-19 22:22:35 +08:00
Yanbo Liang c76153cc7d [SPARK-18608][ML][FOLLOWUP] Fix double caching for PySpark OneVsRest.
## What changes were proposed in this pull request?
#19197 fixed double caching for MLlib algorithms, but missed PySpark ```OneVsRest```, this PR fixed it.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #19220 from yanboliang/SPARK-18608.
2017-09-14 14:09:44 +08:00
Ming Jiang 8d8641f122 [SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API
## What changes were proposed in this pull request?

Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API

## How was this patch tested?

Added unit test

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Ming Jiang <mjiang@fanatics.com>
Author: Ming Jiang <jmwdpk@gmail.com>
Author: jmwdpk <jmwdpk@gmail.com>

Closes #19185 from jmwdpk/SPARK-21854.
2017-09-14 13:53:28 +08:00
Ajay Saini 720c94fe77 [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark
# What changes were proposed in this pull request?

Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism.

I take this PR #18281 over because the original author is busy but we need merge this PR soon.
After this been merged, we can close #18281 .

## How was this patch tested?

Test suite added.

Author: Ajay Saini <ajays725@gmail.com>
Author: WeichenXu <weichen.xu@databricks.com>

Closes #19110 from WeichenXu123/spark-21027.
2017-09-12 10:02:27 -07:00
Chunsheng Ji 4bab8f5996 [SPARK-21856] Add probability and rawPrediction to MLPC for Python
Probability and rawPrediction has been added to MultilayerPerceptronClassifier for Python

Add unit test.

Author: Chunsheng Ji <chunsheng.ji@gmail.com>

Closes #19172 from chunshengji/SPARK-21856.
2017-09-11 16:52:48 +08:00
Xin Ren 31c74fec24 [SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for spark.ml: Python API
https://issues.apache.org/jira/browse/SPARK-19866

## What changes were proposed in this pull request?

Add Python API for findSynonymsArray matching Scala API.

## How was this patch tested?

Manual test
`./python/run-tests --python-executables=python2.7 --modules=pyspark-ml`

Author: Xin Ren <iamshrek@126.com>
Author: Xin Ren <renxin.ubc@gmail.com>
Author: Xin Ren <keypointt@users.noreply.github.com>

Closes #17451 from keypointt/SPARK-19866.
2017-09-08 12:09:00 -07:00
hyukjinkwon 8598d03a00 [SPARK-15243][ML][SQL][PYTHON] Add missing support for unicode in Param methods & functions in dataframe
## What changes were proposed in this pull request?

This PR proposes to support unicodes in Param methods in ML, other missed functions in DataFrame.

For example, this causes a `ValueError` in Python 2.x when param is a unicode string:

```python
>>> from pyspark.ml.classification import LogisticRegression
>>> lr = LogisticRegression()
>>> lr.hasParam("threshold")
True
>>> lr.hasParam(u"threshold")
Traceback (most recent call last):
 ...
    raise TypeError("hasParam(): paramName must be a string")
TypeError: hasParam(): paramName must be a string
```

This PR is based on https://github.com/apache/spark/pull/13036

## How was this patch tested?

Unit tests in `python/pyspark/ml/tests.py` and `python/pyspark/sql/tests.py`.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: sethah <seth.hendrickson16@gmail.com>

Closes #17096 from HyukjinKwon/SPARK-15243.
2017-09-08 11:57:33 -07:00
Weichen Xu d6b30edd49 [SPARK-12664][ML] Expose probability in mlp model
## What changes were proposed in this pull request?

Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability  column when transforming data.

## How was this patch tested?

Test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.
2017-08-22 21:16:34 -07:00
Bryan Cutler 41bb1ddc63 [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator
## What changes were proposed in this pull request?

Added call to copy values of Params from Estimator to Model after fit in PySpark ML.  This will copy values for any params that are also defined in the Model.  Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object.  This is a temporary fix that can be removed once the PySpark models properly define the params themselves.

## How was this patch tested?

Refactored the `check_params` test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17849 from BryanCutler/pyspark-models-own-params-SPARK-10931.
2017-08-22 17:40:50 -07:00
Nick Pentreath 988b84d7ed [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher
Add Python API for `FeatureHasher` transformer.

## How was this patch tested?

New doc test.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.
2017-08-21 14:35:38 +02:00
Ajay Saini 35db3b9fe3 [SPARK-17025][ML][PYTHON] Persistence for Pipelines with Python-only Stages
## What changes were proposed in this pull request?

Implemented a Python-only persistence framework for pipelines containing stages that cannot be saved using Java.

## How was this patch tested?

Created a custom Python-only UnaryTransformer, included it in a Pipeline, and saved/loaded the pipeline. The loaded pipeline was compared against the original using _compare_pipelines() in tests.py.

Author: Ajay Saini <ajays725@gmail.com>

Closes #18888 from ajaysaini725/PythonPipelines.
2017-08-11 23:57:08 -07:00
WeichenXu b35660dd0e [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search
## What changes were proposed in this pull request?

Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #18797 from WeichenXu123/update-breeze.
2017-08-09 14:44:10 +08:00
Yanbo Liang f763d8464b [SPARK-19270][FOLLOW-UP][ML] PySpark GLR model.summary should return a printable representation.
## What changes were proposed in this pull request?
PySpark GLR ```model.summary``` should return a printable representation by calling Scala ```toString```.

## How was this patch tested?
```
from pyspark.ml.regression import GeneralizedLinearRegression
dataset = spark.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
glr = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3)
model = glr.fit(dataset)
model.summary
```
Before this PR:
![image](https://user-images.githubusercontent.com/1962026/29021059-e221633e-7b96-11e7-8d77-5d53f89c81a9.png)
After this PR:
![image](https://user-images.githubusercontent.com/1962026/29021097-fce80fa6-7b96-11e7-8ab4-7e113d447d5d.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18870 from yanboliang/spark-19270.
2017-08-08 08:43:58 +08:00
Ajay Saini fdcee028af [SPARK-21542][ML][PYTHON] Python persistence helper functions
## What changes were proposed in this pull request?

Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters.

## How was this patch tested?

Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests.

Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master.

Author: Ajay Saini <ajays725@gmail.com>

Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.
2017-08-07 17:03:20 -07:00
Ajay Saini 1347b2a697 [SPARK-21633][ML][PYTHON] UnaryTransformer in Python
## What changes were proposed in this pull request?

Implemented UnaryTransformer in Python.

## How was this patch tested?

This patch was tested by creating a MockUnaryTransformer class in the unit tests that extends UnaryTransformer and testing that the transform function produced correct output.

Author: Ajay Saini <ajays725@gmail.com>

Closes #18746 from ajaysaini725/AddPythonUnaryTransformer.
2017-08-04 01:01:32 -07:00
zero323 845c039ceb [SPARK-20601][ML] Python API for Constrained Logistic Regression
## What changes were proposed in this pull request?
Python API for Constrained Logistic Regression based on #17922 , thanks for the original contribution from zero323 .

## How was this patch tested?
Unit tests.

Author: zero323 <zero323@users.noreply.github.com>
Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18759 from yanboliang/SPARK-20601.
2017-08-02 18:10:26 +08:00
Zheng RuiFeng 253a07e43a [SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold
## What changes were proposed in this pull request?
GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #18612 from zhengruifeng/override_HasXXX.
2017-08-01 21:34:26 +08:00
Yan Facai (颜发才) a5a3189974 [SPARK-21306][ML] OneVsRest should support setWeightCol
## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (颜发才) <facai.yan@gmail.com>

Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.
2017-07-28 10:10:35 +08:00
Ajay Saini 7047f49f45 [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest
## What changes were proposed in this pull request?
Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark.

## How was this patch tested?
Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms.

Author: Ajay Saini <ajays725@gmail.com>

Closes #18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.
2017-07-17 10:07:32 -07:00
Yanbo Liang 69e5282d3c [SPARK-20307][ML][SPARKR][FOLLOW-UP] RFormula should handle invalid for both features and label column.
## What changes were proposed in this pull request?
```RFormula``` should handle invalid for both features and label column.
#18496 only handle invalid values in features column. This PR add handling invalid values for label column and test cases.

## How was this patch tested?
Add test cases.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18613 from yanboliang/spark-20307.
2017-07-15 20:56:38 +08:00
Zheng RuiFeng d2d2a5de18 [SPARK-18619][ML] Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid
## What changes were proposed in this pull request?
1, HasHandleInvaild support override
2, Make QuantileDiscretizer/Bucketizer/StringIndexer/RFormula inherit from HasHandleInvalid

## How was this patch tested?
existing tests

[JIRA](https://issues.apache.org/jira/browse/SPARK-18619)

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #18582 from zhengruifeng/heritate_HasHandleInvalid.
2017-07-12 22:09:03 +08:00
actuaryzhang 4852b7d447 [SPARK-21310][ML][PYSPARK] Expose offset in PySpark
## What changes were proposed in this pull request?
Add offset to PySpark in GLM as in #16699.

## How was this patch tested?
Python test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18534 from actuaryzhang/pythonOffset.
2017-07-05 18:41:00 +08:00
Yanbo Liang c19680be1c [SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data
## What changes were proposed in this pull request?
This PR is to maintain API parity with changes made in SPARK-17498 to support a new option
'keep' in StringIndexer to handle unseen labels or NULL values with PySpark.

Note: This is updated version of #17237 , the primary author of this PR is VinceShieh .
## How was this patch tested?
Unit tests.

Author: VinceShieh <vincent.xie@intel.com>
Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18453 from yanboliang/spark-19852.
2017-07-02 16:17:03 +08:00
Ruifeng Zheng e0b047eafe [SPARK-18518][ML] HasSolver supports override
## What changes were proposed in this pull request?
1, make param support non-final with `finalFields` option
2, generate `HasSolver` with `finalFields = false`
3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver`

## How was this patch tested?
existing tests

Author: Ruifeng Zheng <ruifengz@foxmail.com>
Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16028 from zhengruifeng/param_non_final.
2017-07-01 15:37:41 +08:00
Joseph K. Bradley cc67bd5732 [SPARK-20929][ML] LinearSVC should use its own threshold param
## What changes were proposed in this pull request?

LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.

## How was this patch tested?

New unit test to make sure the threshold can be set to any Double value.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
2017-06-19 23:04:17 -07:00
actuaryzhang ff5676b01f [SPARK-20899][PYSPARK] PySpark supports stringIndexerOrderType in RFormula
## What changes were proposed in this pull request?

PySpark supports stringIndexerOrderType in RFormula as in #17967.

## How was this patch tested?
docstring test

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18122 from actuaryzhang/PythonRFormula.
2017-05-31 01:02:19 +08:00
Yan Facai (颜发才) 139da116f1 [SPARK-20768][PYSPARK][ML] Expose numPartitions (expert) param of PySpark FPGrowth.
## What changes were proposed in this pull request?

Expose numPartitions (expert) param of PySpark FPGrowth.

## How was this patch tested?

+ [x] Pass all unit tests.

Author: Yan Facai (颜发才) <facai.yan@gmail.com>

Closes #18058 from facaiy/ENH/pyspark_fpg_add_num_partition.
2017-05-25 21:40:39 +08:00
Yanbo Liang 913a6bfe4b [SPARK-19281][FOLLOWUP][ML] Minor fix for PySpark FPGrowth.
## What changes were proposed in this pull request?
Follow-up for #17218, some minor fix for PySpark ```FPGrowth```.

## How was this patch tested?
Existing UT.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18089 from yanboliang/spark-19281.
2017-05-25 20:15:15 +08:00
zero323 1816eb3bef [SPARK-20631][FOLLOW-UP] Fix incorrect tests.
## What changes were proposed in this pull request?

- Fix incorrect tests for `_check_thresholds`.
- Move test to `ParamTests`.

## How was this patch tested?

Unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.
2017-05-24 19:57:44 +08:00
Peng 9afcf127d3 [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
## What changes were proposed in this pull request?
Add test cases for PR-18062

## How was this patch tested?
The existing UT

Author: Peng <peng.meng@intel.com>

Closes #18068 from mpjlu/moreTest.
2017-05-24 19:54:17 +08:00
Bago Amirbekian 9434280cfd [SPARK-20861][ML][PYTHON] Delegate looping over paramMaps to estimators
Changes:

pyspark.ml Estimators can take either a list of param maps or a dict of params. This change allows the CrossValidator and TrainValidationSplit Estimators to pass through lists of param maps to the underlying estimators so that those estimators can handle parallelization when appropriate (eg distributed hyper parameter tuning).

Testing:

Existing unit tests.

Author: Bago Amirbekian <bago@databricks.com>

Closes #18077 from MrBago/delegate_params.
2017-05-23 20:56:01 -07:00
Peng cfca01136b [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
## What changes were proposed in this pull request?

SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and numInstances in GeneralizedLinearRegressionSummary. Python API should be updated to reflect these changes.

## How was this patch tested?
The existing UT

Author: Peng <peng.meng@intel.com>

Closes #18062 from mpjlu/spark-20764.
2017-05-22 22:42:37 +08:00
Wayne Zhang 0f2f56c37b [SPARK-20736][PYTHON] PySpark StringIndexer supports StringOrderType
## What changes were proposed in this pull request?
PySpark StringIndexer supports StringOrderType added in #17879.

Author: Wayne Zhang <actuaryzhang@uber.com>

Closes #17978 from actuaryzhang/PythonStringIndexer.
2017-05-21 16:51:55 -07:00
Yanbo Liang dbe81633a7 [SPARK-20501][ML] ML 2.2 QA: New Scala APIs, docs
## What changes were proposed in this pull request?
Review new Scala APIs introduced in 2.2.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17934 from yanboliang/spark-20501.
2017-05-15 21:21:54 -07:00
Yanbo Liang d4022d4951 [SPARK-20707][ML] ML deprecated APIs should be removed in major release.
## What changes were proposed in this pull request?
Before 2.2, MLlib keep to remove APIs deprecated in last feature/minor release. But from Spark 2.2, we decide to remove deprecated APIs in a major release, so we need to change corresponding annotations to tell users those will be removed in 3.0.
Meanwhile, this fixed bugs in ML documents. The original ML docs can't show deprecated annotations in ```MLWriter``` and ```MLReader``` related class, we correct it in this PR.

Before:
![image](https://cloud.githubusercontent.com/assets/1962026/25939889/f8c55f20-3666-11e7-9fa2-0605bfb3ed06.png)

After:
![image](https://cloud.githubusercontent.com/assets/1962026/25939870/e9b0d5be-3666-11e7-9765-5e04885e4b32.png)

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17946 from yanboliang/spark-20707.
2017-05-16 10:08:23 +08:00
Yanbo Liang 0698e6c88c [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"
This reverts commit b8733e0ad9.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17944 from yanboliang/spark-20606-revert.
2017-05-11 14:48:13 +08:00
zero323 804949c6bf [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params
## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17891 from zero323/SPARK-20631.
2017-05-10 16:57:52 +08:00
Yanbo Liang b8733e0ad9 [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
## What changes were proposed in this pull request?
Remove ML methods we deprecated in 2.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17867 from yanboliang/spark-20606.
2017-05-09 17:30:37 +08:00
Nick Pentreath e300a5a145 [SPARK-20300][ML][PYSPARK] Python API for ALSModel.recommendForAllUsers,Items
Add Python API for `ALSModel` methods `recommendForAllUsers`, `recommendForAllItems`

## How was this patch tested?

New doc tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17622 from MLnick/SPARK-20300-pyspark-recall.
2017-05-02 10:49:13 +02:00
Yanbo Liang dbb06c689c [MINOR][ML] Fix some PySpark & SparkR flaky tests
## What changes were proposed in this pull request?
Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, which means they are not converged. I don’t think checking intermediate result during iteration make sense, and these intermediate result may vulnerable and not stable, so we should switch to check the converged result. We hit this issue at #17746 when we upgrade breeze to 0.13.1.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17757 from yanboliang/flaky-test.
2017-04-26 21:34:18 +08:00
Yanbo Liang 67eef47acf
[SPARK-20449][ML] Upgrade breeze version to 0.13.1
## What changes were proposed in this pull request?
Upgrade breeze version to 0.13.1, which fixed some critical bugs of L-BFGS-B.

## How was this patch tested?
Existing unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17746 from yanboliang/spark-20449.
2017-04-25 17:10:41 +00:00
Liang-Chi Hsieh 1a52a62377 [SPARK-20076][ML][PYSPARK] Add Python interface for ml.stats.Correlation
## What changes were proposed in this pull request?

The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.

## How was this patch tested?

Python unit test.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17494 from viirya/correlation-python-api.
2017-04-07 11:00:10 +02:00
Liang-Chi Hsieh 12206058e8 [SPARK-20214][ML] Make sure converted csc matrix has sorted indices
## What changes were proposed in this pull request?

`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:

    from scipy.sparse import lil_matrix
    lil = lil_matrix((4, 1))
    lil[1, 0] = 1
    lil[3, 0] = 2
    _convert_to_vector(lil.todok())

    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
      return SparseVector(l.shape[0], csc.indices, csc.data)
    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
      % (self.indices[i], self.indices[i + 1]))
    TypeError: Indices 3 and 1 are not strictly increasing

A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:

    >>> from scipy.sparse import lil_matrix
    >>> lil = lil_matrix((4, 1))
    >>> lil[1, 0] = 1
    >>> lil[3, 0] = 2
    >>> dok = lil.todok()
    >>> csc = dok.tocsc()
    >>> csc.has_sorted_indices
    0
    >>> csc.indices
    array([3, 1], dtype=int32)

I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17532 from viirya/make-sure-sorted-indices.
2017-04-05 17:46:44 -07:00
Bago Amirbekian a5c87707ea [SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTest
## What changes were proposed in this pull request?

A pyspark wrapper for spark.ml.stat.ChiSquareTest

## How was this patch tested?

unit tests
doctests

Author: Bago Amirbekian <bago@databricks.com>

Closes #17421 from MrBago/chiSquareTestWrapper.
2017-03-28 19:19:16 -07:00
zero323 0bc8847aa2 [SPARK-19281][PYTHON][ML] spark.ml Python API for FPGrowth
## What changes were proposed in this pull request?

- Add `HasSupport` and `HasConfidence` `Params`.
- Add new module `pyspark.ml.fpm`.
- Add `FPGrowth` / `FPGrowthModel` wrappers.
- Provide tests for new features.

## How was this patch tested?

Unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17218 from zero323/SPARK-19281.
2017-03-26 16:49:27 -07:00
Nick Pentreath d9f4ce6943 [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark
Add Python wrapper for `Imputer` feature transformer.

## How was this patch tested?

New doc tests and tweak to PySpark ML `tests.py`

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
2017-03-24 08:01:15 -07:00
Yanbo Liang 81303f7ca7 [SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports tweedie distribution.
## What changes were proposed in this pull request?
PySpark ```GeneralizedLinearRegression``` supports tweedie distribution.

## How was this patch tested?
Add unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #17146 from yanboliang/spark-19806.
2017-03-08 02:09:36 -08:00
Bryan Cutler 44281ca81d [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe
## What changes were proposed in this pull request?
The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.

This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.

## How was this patch tested?
Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
2017-03-03 16:43:45 -08:00
Mark Grover d2a879762a [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLast
## What changes were proposed in this pull request?
Updates the doc string to match up with the code
i.e. say dropLast instead of includeFirst

## How was this patch tested?
Not much, since it's a doc-like change. Will run unit tests via Jenkins job.

Author: Mark Grover <mark@apache.org>

Closes #17127 from markgrover/spark_19734.
2017-03-01 22:57:34 -08:00
Yun Ni 3bd8ddf7c3 [MINOR][ML] Fix comments in LSH Examples and Python API
## What changes were proposed in this pull request?
Remove `org.apache.spark.examples.` in
Add slash in one of the python doc.

## How was this patch tested?
Run examples using the commands in the comments.

Author: Yun Ni <yunn@uber.com>

Closes #17104 from Yunni/yunn_minor.
2017-03-01 22:55:13 -08:00
Nick Pentreath b405466513 [SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategy
This PR adds a param to `ALS`/`ALSModel` to set the strategy used when encountering unknown users or items at prediction time in `transform`. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation.

The current behavior returns `NaN` if a user/item is unknown. In scenario (b), this can easily occur when using `CrossValidator` or `TrainValidationSplit` since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns `NaN` for all metrics, making model selection impossible.

The new param, `coldStartStrategy`, defaults to `nan` (the current behavior). The other option supported initially is `drop`, which drops all rows with `NaN` predictions. This flag allows users to use `ALS` in cross-validation settings. It is made an `expertParam`. The param is made a string so that the set of strategies can be extended in future (some options are discussed in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)).
## How was this patch tested?

New unit tests, and manual "before and after" tests for Scala & Python using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` or `TrainValidationSplit` with the default param setting results in metrics that are all `NaN`, while setting `coldStartStrategy` to `drop` results in valid metrics.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12896 from MLnick/SPARK-14489-als-nan.
2017-02-28 16:17:35 +02:00
Bryan Cutler 2f69e3f60f [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation
## What changes were proposed in this pull request?
Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.

## How was this patch tested?
Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #16772 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772.
2017-02-23 18:05:58 -08:00
Yun Ni 08c1972a06 [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing
## What changes were proposed in this pull request?
This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.

## How was this patch tested?
API and examples are tested using spark-submit:
`bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
`bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`

User guide changes are generated and manually inspected:
`SKIP_API=1 jekyll build`

Author: Yun Ni <yunn@uber.com>
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Yunni <Euler57721@gmail.com>

Closes #16715 from Yunni/spark-18080.
2017-02-15 16:26:05 -08:00
VinceShieh 6eca21ba88 [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark
## What changes were proposed in this pull request?
This PR is to document the changes on QuantileDiscretizer in pyspark for PR:
https://github.com/apache/spark/pull/15428

## How was this patch tested?
No test needed

Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent.xie@intel.com>

Closes #16922 from VinceShieh/spark-19590.
2017-02-15 10:12:07 -08:00
zero323 5e7cd3322b [SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util
## What changes were proposed in this pull request?

Add missing `warnings` import.

## How was this patch tested?

Manual tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16846 from zero323/SPARK-19506.
2017-02-13 09:26:49 -08:00
zero323 fab0d62a71 [SPARK-19467][ML][PYTHON] Remove cyclic imports from pyspark.ml.pipeline
## What changes were proposed in this pull request?

Remove cyclic imports between `pyspark.ml.pipeline` and `pyspark.ml`.

## How was this patch tested?

Existing unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16814 from zero323/SPARK-19467.
2017-02-06 18:12:20 -08:00
Zheng RuiFeng 317fa75081 [SPARK-19421][ML][PYSPARK] Remove numClasses and numFeatures methods in LinearSVC
## What changes were proposed in this pull request?
Methods `numClasses` and `numFeatures` in LinearSVCModel are already usable by inheriting `JavaClassificationModel`
we should not explicitly add them.

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16727 from zhengruifeng/nits_in_linearSVC.
2017-02-05 19:06:51 -08:00
Joseph K. Bradley 1d5d2a9d09 [SPARK-19389][ML][PYTHON][DOC] Minor doc fixes for ML Python Params and LinearSVC
## What changes were proposed in this pull request?

* Removed Since tags in Python Params since they are inherited by other classes
* Fixed doc links for LinearSVC

## How was this patch tested?

* doc tests
* generating docs locally and checking manually

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #16723 from jkbradley/pyparam-fix-doc.
2017-02-02 11:58:46 -08:00
Bryan Cutler 57d70d26c8 [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
## What changes were proposed in this pull request?

Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.

## How was this patch tested?

Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
2017-01-31 15:42:36 -08:00
wm624@hotmail.com bb1a1fe05e [SPARK-19336][ML][PYSPARK] LinearSVC Python API
## What changes were proposed in this pull request?

Add Python API for the newly added LinearSVC algorithm.

## How was this patch tested?

Add new doc string test.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16694 from wangmiao1981/ser.
2017-01-27 16:03:53 -08:00
Zheng RuiFeng 8ccca9170f [SPARK-14272][ML] Add Loglikelihood in GaussianMixtureSummary
## What changes were proposed in this pull request?

add loglikelihood in GMM.summary

## How was this patch tested?

added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #12064 from zhengruifeng/gmm_metric.
2017-01-19 03:46:37 -08:00
Peng, Meng 32286ba68a
[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #16434 from mpjlu/fdr_fwe_update.
2017-01-10 13:09:58 +00:00
Yanbo Liang 3ef6d98a80 [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
## What changes were proposed in this pull request?

Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it.
I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:
- mllib `GaussianMixture` allows k == 1, but ml does not.
- mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png)
## How was this patch tested?

Existing tests and added new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15413 from yanboliang/spark-17847.
2017-01-09 21:38:46 -08:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Peng 79ff853631 [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)
## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes #15212 from mpjlu/fdr_fwe.
2016-12-28 00:49:36 -08:00
krishnakalyan3 c802ad8718
[SPARK-18628][ML] Update Scala param and Python param to have quotes
## What changes were proposed in this pull request?

Updated Scala param and Python param to have quotes around the options making it easier for users to read.

## How was this patch tested?

Manually checked the docstrings

Author: krishnakalyan3 <krishnakalyan3@gmail.com>

Closes #16242 from krishnakalyan3/doc-string.
2016-12-11 09:28:16 +00:00
Sandeep Singh 78bb7f8071 [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`

## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.

Author: Sandeep Singh <sandeep@techaddict.me>
Author: jkbradley <joseph.kurata.bradley@gmail.com>

Closes #15843 from techaddict/SPARK-18274.
2016-12-01 13:22:40 -08:00
Sandeep Singh fe854f2e4f [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for QuantileDiscretizer and Bucketizer
## What changes were proposed in this pull request?
added the new handleInvalid param for these transformers to Python to maintain API parity.

## How was this patch tested?
existing tests
testing is done with new doctests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #15817 from techaddict/SPARK-18366.
2016-11-30 11:33:15 +02:00
Jeff Zhang 4c82ca86d9 [SPARK-15819][PYSPARK][ML] Add KMeanSummary in KMeans of PySpark
## What changes were proposed in this pull request?

Add python api for KMeansSummary
## How was this patch tested?

unit test added

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13557 from zjffdu/SPARK-15819.
2016-11-29 20:51:27 -08:00
Yuhao 9b670bcaec [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, sealed audit
## What changes were proposed in this pull request?
make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs.

Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319

## How was this patch tested?
existing ut

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #15972 from hhbyyh/experimental21.
2016-11-29 18:46:59 -08:00
Yanbo Liang c4a7eef0ce [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for ML
## What changes were proposed in this pull request?
Remove deprecated methods for ML.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15913 from yanboliang/spark-18481.
2016-11-26 05:28:41 -08:00
hyukjinkwon 933a6548d4
[SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation
## What changes were proposed in this pull request?

It seems in Python, there are

- `Note:`
- `NOTE:`
- `Note that`
- `.. note::`

This PR proposes to fix those to `.. note::` to be consistent.

**Before**

<img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png">

<img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png">

**After**

<img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png">

<img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png">

## How was this patch tested?

The notes were found via

```bash
grep -r "Note: " .
grep -r "NOTE: " .
grep -r "Note that " .
```

And then fixed one by one comparing with API documentation.

After that, manually tested via `make html` under `./python/docs`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15947 from HyukjinKwon/SPARK-18447.
2016-11-22 11:40:18 +00:00
sethah e811fbf9ed [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM
## What changes were proposed in this pull request?

Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.

## How was this patch tested?

Unit tests.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15777 from sethah/pyspark_cluster_summaries.
2016-11-21 05:36:49 -08:00
Felix Cheung 55964c15a7 [SPARK-18239][SPARKR] Gradient Boosted Tree for R
## What changes were proposed in this pull request?

Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.

Since this is relatively isolated I'd like to target this for branch-2.1

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15746 from felixcheung/rgbt.
2016-11-08 16:00:45 -08:00
Zheng RuiFeng 9dc9f9a5dd [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier
## What changes were proposed in this pull request?
Add missing 'subsamplingRate' of pyspark GBTClassifier

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15692 from zhengruifeng/gbt_subsamplingRate.
2016-11-03 07:45:20 -07:00
Joseph K. Bradley 91c33a0ca5 [SPARK-18088][ML] Various ChiSqSelector cleanups
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15647 from jkbradley/chisqselector-follow-ups.
2016-11-01 17:00:00 -07:00
Felix Cheung 7c37869292 [SPARK-18110][PYTHON][ML] add missing parameter in Python for RandomForest regression and classification
## What changes were proposed in this pull request?

Add subsmaplingRate to randomForestClassifier
Add varianceCol to randomForestRegressor
In Python

## How was this patch tested?

manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15638 from felixcheung/pyrandomforest.
2016-10-30 16:21:37 -07:00
VinceShieh 0b076d4cb6 [SPARK-17219][ML] enhanced NaN value handling in Bucketizer
## What changes were proposed in this pull request?

This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2.
NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.

'''Before:
val bucketizer: Bucketizer = new Bucketizer()
          .setInputCol("feature")
          .setOutputCol("result")
          .setSplits(splits)
'''After:
val bucketizer: Bucketizer = new Bucketizer()
          .setInputCol("feature")
          .setOutputCol("result")
          .setSplits(splits)
          .setHandleNaN("keep")

## How was this patch tested?
Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite

Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent.xie@intel.com>
Author: Vincent Xie <vincent.xie@intel.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15428 from VinceShieh/spark-17219_followup.
2016-10-27 11:52:15 -07:00
Peng c8b612decb
[SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
## What changes were proposed in this pull request?

For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

So we change statistic to pValue for SelectKBest and SelectPercentile

## How was this patch tested?
change existing test

Author: Peng <peng.meng@intel.com>

Closes #15444 from mpjlu/chisqure-bug.
2016-10-14 12:48:57 +01:00
Yanbo Liang 1db8feab8c [SPARK-15402][ML][PYSPARK] PySpark ml.evaluation should support save/load
## What changes were proposed in this pull request?
Since ```ml.evaluation``` has supported save/load at Scala side, supporting it at Python side is very straightforward and easy.

## How was this patch tested?
Add python doctest.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13194 from yanboliang/spark-15402.
2016-10-14 04:17:03 -07:00
Yanbo Liang 44cbb61b34 [SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula forceIndexLabel.
## What changes were proposed in this pull request?
Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15430 from yanboliang/spark-15957-python.
2016-10-13 19:44:24 -07:00
WeichenXu 0d4a695279 [SPARK-17745][ML][PYSPARK] update NB python api - add weight col parameter
## What changes were proposed in this pull request?

update python api for NaiveBayes: add weight col parameter.

## How was this patch tested?

doctests added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15406 from WeichenXu123/nb_python_update.
2016-10-12 19:52:57 -07:00
Zheng RuiFeng c17f971839 [SPARK-17744][ML] Parity check between the ml and mllib test suites for NB
## What changes were proposed in this pull request?
1,parity check and add missing test suites for ml's NB
2,remove some unused imports

## How was this patch tested?
 manual tests in spark-shell

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15312 from zhengruifeng/nb_test_parity.
2016-10-04 06:54:48 -07:00
zero323 d8399b600c [SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract
## What changes were proposed in this pull request?

Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior.

Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`.

## How was this patch tested?

PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved.

Author: zero323 <zero323@users.noreply.github.com>

Closes #15144 from zero323/SPARK-17587.
2016-10-03 17:57:54 -07:00
Jason White 1f31bdaef6 [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch
## What changes were proposed in this pull request?

This PR removes a patch on ListConverter from https://github.com/apache/spark/pull/5570, as it is no longer necessary. The underlying issue in Py4J https://github.com/bartdag/py4j/issues/160 was patched in 224b94b666 and is present in 0.10.3, the version currently in use in Spark.

## How was this patch tested?

The original test added in https://github.com/apache/spark/pull/5570 remains.

Author: Jason White <jason.white@shopify.com>

Closes #15254 from JasonMWhite/remove_listconverter_patch.
2016-10-03 14:12:03 -07:00
Sean Owen b88cb63da3
[SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement.
## What changes were proposed in this pull request?

Partial revert of #15277 to instead sort and store input to model rather than require sorted input

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15299 from srowen/SPARK-17704.2.
2016-10-01 16:10:39 -04:00
WeichenXu 7f16affa26 [SPARK-17138][ML][MLIB] Add Python API for multinomial logistic regression
## What changes were proposed in this pull request?

Add Python API for multinomial logistic regression.

- add `family` param in python api.
- expose `coefficientMatrix` and `interceptVector` for `LogisticRegressionModel`
- add python-side testcase for multinomial logistic regression
- update python doc.

## How was this patch tested?

existing and added doc tests.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14852 from WeichenXu123/add_MLOR_python.
2016-09-27 00:00:21 -07:00
Yanbo Liang ac65139be9
[SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API.
## What changes were proposed in this pull request?
#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
* We should use lower case of the selector type names to follow MLlib convention.
* Add ML Python API.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15214 from yanboliang/spark-17017.
2016-09-26 09:45:33 +01:00
Sean Owen 248916f558
[SPARK-17057][ML] ProbabilisticClassifierModels' thresholds should have at most one 0
## What changes were proposed in this pull request?

Match ProbabilisticClassifer.thresholds requirements to R randomForest cutoff, requiring all > 0

## How was this patch tested?

Jenkins tests plus new test cases

Author: Sean Owen <sowen@cloudera.com>

Closes #15149 from srowen/SPARK-17057.
2016-09-24 08:15:55 +01:00
WeichenXu 72d9fba26c [SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for AFTSurvivalRegression
## What changes were proposed in this pull request?

Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR.

## How was this patch tested?

Existing tests.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.
2016-09-22 04:35:54 -07:00
VinceShieh 57dc326bd0
[SPARK-17219][ML] Add NaN value handling in Bucketizer
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>

Author: VinceShieh <vincent.xie@intel.com>

Closes #14858 from VinceShieh/spark-17219.
2016-09-21 10:20:57 +01:00
Yanbo Liang 883c763184 [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from 5 to 2.
## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15050 from yanboliang/spark-17389.
2016-09-11 13:47:13 +01:00
Yanbo Liang 39d538dddf [MINOR][ML] Correct weights doc of MultilayerPerceptronClassificationModel.
## What changes were proposed in this pull request?
```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it.

## How was this patch tested?
Doc change.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14967 from yanboliang/mlp-weights.
2016-09-06 03:30:37 -07:00
Yanbo Liang 6b8cb1fe52 [SPARK-17197][ML][PYSPARK] PySpark LiR/LoR supports tree aggregation level configurable.
## What changes were proposed in this pull request?
[SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) makes tree aggregation level in LiR/LoR configurable, this PR makes PySpark support this function.

## How was this patch tested?
Since ```aggregationDepth``` is an expert param, I'm not prefer to test it in doctest which is also used for example. Here is the offline test result:
![image](https://cloud.githubusercontent.com/assets/1962026/17879457/f83d7760-68a6-11e6-9936-d0a884d5d6ec.png)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14766 from yanboliang/spark-17197.
2016-08-25 02:26:33 -07:00
Holden Karau b264cbb16f [SPARK-15113][PYSPARK][ML] Add missing num features num classes
## What changes were proposed in this pull request?

Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc.

## How was this patch tested?

Extended doctests

Author: Holden Karau <holden@us.ibm.com>

Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
2016-08-22 12:21:22 +02:00
Bryan Cutler 39f328ba35 [SPARK-15018][PYSPARK][ML] Improve handling of PySpark Pipeline when used without stages
## What changes were proposed in this pull request?

When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages.  A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list.  This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input.  Also minor cleanup of related code.

## How was this patch tested?
Added new unit tests to verify an empty Pipeline acts as an identity transformer

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #12790 from BryanCutler/pipeline-identity-SPARK-15018.
2016-08-19 23:46:36 -07:00
Jeff Zhang 072acf5e14 [SPARK-16965][MLLIB][PYSPARK] Fix bound checking for SparseVector.
## What changes were proposed in this pull request?

1. In scala, add negative low bound checking and put all the low/upper bound checking in one place
2. In python, add low/upper bound checking of indices.

## How was this patch tested?

unit test added

Author: Jeff Zhang <zjffdu@apache.org>

Closes #14555 from zjffdu/SPARK-16965.
2016-08-19 12:38:15 +01:00
Nick Lavers 5377fc6236 [SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961

Changed one line of Utils.randomizeInPlace to allow elements to stay in place.

Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.

Author: Nick Lavers <nick.lavers@videoamp.com>

Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.
2016-08-19 10:11:59 +01:00
Yanbo Liang ccc6dc0f4b [MINOR][ML] Rename TreeEnsembleModels to TreeEnsembleModel for PySpark
## What changes were proposed in this pull request?
Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so  ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve  breaking change.

## How was this patch tested?
No new tests, should pass existing ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14454 from yanboliang/TreeEnsembleModel.
2016-08-11 22:39:19 -07:00
=^_^= 639df046a2 [SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics
## What changes were proposed in this pull request?

avgMetrics was summed, not averaged, across folds

Author: =^_^= <maxmoroz@gmail.com>

Closes #14456 from pkch/pkch-patch-1.
2016-08-03 04:18:28 -07:00
krishnakalyan3 7e8279fde1 [SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc
## What changes were proposed in this pull request?
Updated ML pipeline Cross Validation Scaladoc & PyDoc.

## How was this patch tested?

Documentation update

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: krishnakalyan3 <krishnakalyan3@gmail.com>

Closes #13894 from krishnakalyan3/kfold-cv.
2016-07-27 15:37:38 +02:00
WeichenXu ad3708e783 [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6
## What changes were proposed in this pull request?

replace ANN convergence tolerance param default
from 1e-4 to 1e-6

so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.

## How was this patch tested?

Existing Test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14286 from WeichenXu123/update_ann_tol.
2016-07-25 20:00:37 +01:00
WeichenXu 37bed97de5 [PYSPARK] add picklable SparseMatrix in pyspark.ml.common
## What changes were proposed in this pull request?

add `SparseMatrix` class whick support pickler.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14265 from WeichenXu123/picklable_py.
2016-07-24 02:29:08 -07:00
Yanbo Liang 670891496a [SPARK-16494][ML] Upgrade breeze version to 0.12
## What changes were proposed in this pull request?
breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes.
One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case.
We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12.
For more features, improvements and bug fixes of breeze 0.12, you can refer the following link:
https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c

## How was this patch tested?
No new tests, should pass the existing ones.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14150 from yanboliang/spark-16494.
2016-07-19 12:31:04 +01:00
Joseph K. Bradley 5ffd5d3838 [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide
## What changes were proposed in this pull request?

Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
  * **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
  * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
  * **Reviewers**: I did not change any of the content of the migration guides.

Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
  * **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added

Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page

## How was this patch tested?

Generated docs locally

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14213 from jkbradley/ml-guide-2.0.
2016-07-15 13:38:23 -07:00
Joseph K. Bradley 01f09b1612 [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML
## What changes were proposed in this pull request?

General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations.  Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.

Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier

DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi

## How was this patch tested?

N/A

Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature.  I did not find such cases, but please verify.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14147 from jkbradley/experimental-audit.
2016-07-13 12:33:39 -07:00
Joseph K. Bradley fdde7d0aa0 [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls
## What changes were proposed in this pull request?

Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.

This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X

## How was this patch tested?

Existing unit tests.  Manual testing in an environment where this was an issue.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14023 from jkbradley/SPARK-16348.
2016-07-05 17:00:24 -07:00
Nick Pentreath 18faa588ca [SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.

## How was this patch tested?

Existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
2016-06-22 10:05:25 -07:00
Holden Karau d281b0bafe [SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs
## What changes were proposed in this pull request?

Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.

## How was this patch tested?

Built docs locally & PySpark SQL tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
2016-06-22 11:54:49 +02:00
Bryan Cutler b76e355376 [SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None
## What changes were proposed in this pull request?

Several places set the seed Param default value to None which will translate to a zero value on the Scala side.  This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation.  These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero.

## How was this patch tested?

Ran PySpark tests locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
2016-06-21 11:43:25 -07:00
Nick Pentreath 37494a18e8 [SPARK-10258][DOC][ML] Add @Since annotations to ml.feature
This PR adds missing `Since` annotations to `ml.feature` package.

Closes #8505.

## How was this patch tested?

Existing tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13641 from MLnick/add-since-annotations.
2016-06-21 00:39:47 -07:00
Bryan Cutler a42bf55532 [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel
## What changes were proposed in this pull request?

Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method.

## How was this patch tested?

Local tests

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
2016-06-20 16:28:11 -07:00
Liang-Chi Hsieh baa3e633e1 [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
## What changes were proposed in this pull request?

Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13219 from viirya/pyspark-pickler-ml.
2016-06-13 19:59:53 -07:00
Bryan Cutler 7d7a0a5e07 [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar to Scala API
## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and resolved formula.  This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review.

## How was this patch tested?
run pyspark-ml tests locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
2016-06-10 11:27:30 -07:00
WeichenXu cdd7f5a57a [SPARK-15837][ML][PYSPARK] Word2vec python add maxsentence parameter
## What changes were proposed in this pull request?

Word2vec python add maxsentence parameter.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13578 from WeichenXu123/word2vec_python_add_maxsentence.
2016-06-10 12:26:53 +01:00
Jeff Zhang e594b49283 [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property
## What changes were proposed in this pull request?

add method idf to IDF in pyspark

## How was this patch tested?

add unit test

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13540 from zjffdu/SPARK-15788.
2016-06-09 09:54:38 -07:00
Yanbo Liang a95252823e [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples
## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.

## How was this patch tested?
Offline tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13519 from yanboliang/spark-15771.
2016-06-06 09:36:34 +01:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Ruifeng Zheng 2099e05f93 [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score
## What changes were proposed in this pull request?
1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`

## How was this patch tested?
local build

Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #13390 from zhengruifeng/clarify_f1.
2016-06-04 13:56:04 +01:00
Holden Karau 67cc89ff02 [SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier
## What changes were proposed in this pull request?

MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params.

Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 )

## How was this patch tested?

Doc tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
2016-06-03 15:56:17 -07:00
Holden Karau 72353311d3 [SPARK-15092][SPARK-15139][PYSPARK][ML] Pyspark TreeEnsemble missing methods
## What changes were proposed in this pull request?

Add `toDebugString` and `totalNumNodes` to `TreeEnsembleModels` and add `toDebugString` to `DecisionTreeModel`

## How was this patch tested?

Extended doc tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #12919 from holdenk/SPARK-15139-pyspark-treeEnsemble-missing-methods.
2016-06-02 15:55:14 -07:00
Yanbo Liang 07a98ca4ce [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature
## What changes were proposed in this pull request?
ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include:
* Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless.
* Scala API docs update.
* Sync Scala and Python API docs for these changes.

## How was this patch tested?
Exist tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13410 from yanboliang/spark-15587.
2016-06-01 10:49:51 -07:00
Yanbo Liang 594484cd83 [MINOR][DOC][ML] ml.clustering scala & python api doc sync
## What changes were proposed in this pull request?
Since we done Scala API audit for ml.clustering at #13148, we should also fix and update the corresponding Python API docs to keep them in sync.

## How was this patch tested?
Docs change, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13291 from yanboliang/spark-15361-followup.
2016-05-31 14:56:43 -07:00
yinxusen 130b8d07b8 [SPARK-15008][ML][PYSPARK] Add integration test for OneVsRest
## What changes were proposed in this pull request?

1. Add `_transfer_param_map_to/from_java` for OneVsRest;

2. Add `_compare_params` in ml/tests.py to help compare params.

3. Add `test_onevsrest` as the integration test for OneVsRest.

## How was this patch tested?

Python unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #12875 from yinxusen/SPARK-15008.
2016-05-27 13:18:29 -07:00
Nick Pentreath 1cb347fbc4 [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.

We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.

Tests N/A.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
2016-05-25 20:41:53 +02:00
Holden Karau cd9f16906c [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions
## What changes were proposed in this pull request?

PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).

## How was this patch tested?

built pydocs locally, tested new user build instructions

Author: Holden Karau <holden@us.ibm.com>

Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
2016-05-24 22:20:00 -07:00
Nick Pentreath 6075f5b4d8 [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer
This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala.

Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`).

Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`.

## How was this patch tested?

A little doctest and built API docs locally to check HTML doc generation.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
2016-05-24 10:02:10 +02:00
WeichenXu a15ca5533d [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
## What changes were proposed in this pull request?

Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
2016-05-23 18:14:48 -07:00
Liang-Chi Hsieh 4e73933118 [SPARK-15444][PYSPARK][ML][HOTFIX] Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression
## What changes were proposed in this pull request?

Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression between PySpark and Scala. That is because default value conflict between #13106 and #13129. This causes ml.tests failed.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13220 from viirya/hotfix-regresstion.
2016-05-20 13:40:13 +02:00
Yanbo Liang 6643677817 [MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync
## What changes were proposed in this pull request?
```ml.evaluation``` Scala and Python API sync.

## How was this patch tested?
Only API docs change, no new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13195 from yanboliang/evaluation-doc.
2016-05-19 17:56:21 -07:00
Holden Karau e71cd96bf7 [SPARK-15316][PYSPARK][ML] Add linkPredictionCol to GeneralizedLinearRegression
## What changes were proposed in this pull request?

Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list

## How was this patch tested?

doctests & built docs locally

Author: Holden Karau <holden@us.ibm.com>

Closes #13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.
2016-05-19 20:59:19 +02:00
Bryan Cutler b1bc5ebdd5 [DOC][MINOR] ml.feature Scala and Python API sync
## What changes were proposed in this pull request?

I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.

## How was this patch tested?

Built docs locally, ran style checks

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13159 from BryanCutler/ml.feature-api-sync.
2016-05-19 04:48:36 +02:00
Nick Pentreath e8b79afa02 [SPARK-14891][ML] Add schema validation for ALS
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.

With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).

## How was this patch tested?
New test cases in `ALSSuite`.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
2016-05-18 21:13:12 +02:00
Takuya Kuwahara 411c04adb5 [SPARK-14978][PYSPARK] PySpark TrainValidationSplitModel should support validationMetrics
## What changes were proposed in this pull request?

This pull request includes supporting validationMetrics for TrainValidationSplitModel with Python and test for it.

## How was this patch tested?

test in `python/pyspark/ml/tests.py`

Author: Takuya Kuwahara <taakuu19@gmail.com>

Closes #12767 from taku-k/spark-14978.
2016-05-18 08:29:47 +02:00
DB Tsai e2efe0529a [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms
## What changes were proposed in this pull request?

Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #12627 from dbtsai/SPARK-14615-NewML.
2016-05-17 12:51:07 -07:00
Xiangrui Meng 8ad9f08c94 [SPARK-14906][ML] Copy linalg in PySpark to new ML package
## What changes were proposed in this pull request?

Copy the linalg (Vector/Matrix and VectorUDT/MatrixUDT) in PySpark to new ML package.

## How was this patch tested?
Existing tests.

Author: Xiangrui Meng <meng@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #13099 from viirya/move-pyspark-vector-matrix-udt4.
2016-05-17 00:08:02 -07:00
sethah 5b849766ab [SPARK-15181][ML][PYSPARK] Python API for GLR summaries.
## What changes were proposed in this pull request?

This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.

## How was this patch tested?

Added a unit test to `pyspark.ml.tests`

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12961 from sethah/GLR_summary.
2016-05-13 09:01:20 +02:00
Zheng RuiFeng 87d69a01f0 [MINOR][PYSPARK] update _shared_params_code_gen.py
## What changes were proposed in this pull request?

1, add arg-checkings for `tol` and `stepSize` to  keep in line with `SharedParamsCodeGen.scala`
2, fix one typo

## How was this patch tested?
local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12996 from zhengruifeng/py_args_checking.
2016-05-13 08:52:06 +02:00
Holden Karau d1aadea05a [SPARK-15188] Add missing thresholds param to NaiveBayes in PySpark
## What changes were proposed in this pull request?

Add missing thresholds param to NiaveBayes

## How was this patch tested?
doctests

Author: Holden Karau <holden@us.ibm.com>

Closes #12963 from holdenk/SPARK-15188-add-missing-naive-bayes-param.
2016-05-13 08:39:59 +02:00
Holden Karau 5207a005cc [SPARK-15281][PYSPARK][ML][TRIVIAL] Add impurity param to GBTRegressor & add experimental inside of regression.py
## What changes were proposed in this pull request?

Add impurity param to  GBTRegressor and mark the of the models & regressors in regression.py as experimental to match Scaladoc.

## How was this patch tested?

Added default value to init, tested with unit/doc tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #13071 from holdenk/SPARK-15281-GBTRegressor-impurity.
2016-05-12 09:19:27 +02:00
Sandeep Singh 2931437972 [SPARK-15037] [SQL] [MLLIB] Part2: Use SparkSession instead of SQLContext in Python TestSuites
## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Python TestSuites

## How was this patch tested?
Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13044 from techaddict/SPARK-15037-python.
2016-05-11 11:24:16 -07:00
Holden Karau 007882c7ee [SPARK-15189][PYSPARK][DOCS] Update ml.evaluation PyDoc
## What changes were proposed in this pull request?

Fix doctest issue, short param description, and tag items as Experimental

## How was this patch tested?

build docs locally & doctests

Author: Holden Karau <holden@us.ibm.com>

Closes #12964 from holdenk/SPARK-15189-ml.Evaluation-PyDoc-issues.
2016-05-11 08:33:29 +02:00
Holden Karau 93353b0113 [SPARK-15195][PYSPARK][DOCS] Update ml.tuning PyDocs
## What changes were proposed in this pull request?

Tag classes in ml.tuning as experimental, add docs for kfolds avg metric, and copy TrainValidationSplit scaladoc for more detailed explanation.

## How was this patch tested?

built docs locally

Author: Holden Karau <holden@us.ibm.com>

Closes #12967 from holdenk/SPARK-15195-pydoc-ml-tuning.
2016-05-10 21:20:19 +02:00
Holden Karau 12fe2ecd19 [SPARK-15136][PYSPARK][DOC] Fix links to sphinx style and add a default param doc note
## What changes were proposed in this pull request?

PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT

## How was this patch tested?

Built docs locally.

Author: Holden Karau <holden@us.ibm.com>

Closes #12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
2016-05-09 09:11:17 +01:00
Burak Köse e20cd9f4ce [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover
## What changes were proposed in this pull request?

This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes #11871

cc: burakkose srowen

Author: Burak Köse <burakks41@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Burak KOSE <burakks41@gmail.com>

Closes #12843 from mengxr/SPARK-14050.
2016-05-06 13:58:12 -07:00
Holden Karau 4c0d827cfc [SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove "BETA"
## What changes were proposed in this pull request?

Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).

## How was this patch tested?

Python documentation built locally as HTML and text and verified output.

Author: Holden Karau <holden@us.ibm.com>

Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
2016-05-05 10:52:25 +01:00
Yanbo Liang d26f7cb012 [SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean up
## What changes were proposed in this pull request?
PySpark ML Params setter code clean up.
For examples,
```setInputCol``` can be simplified from
```
self._set(inputCol=value)
return self
```
to:
```
return self._set(inputCol=value)
```
This is a pretty big sweeps, and we cleaned wherever possible.
## How was this patch tested?
Exist unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12749 from yanboliang/spark-14971.
2016-05-03 16:46:13 +02:00
Xusen Yin a6428292f7 [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update
## What changes were proposed in this pull request?

This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
  * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.

Defaults changed:
* Scala
 * LogisticRegression: weightCol defaults to not set (instead of empty string)
 * StringIndexer: labels default to not set (instead of empty array)
 * GeneralizedLinearRegression:
   * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
   * weightCol defaults to not set (instead of empty string)
 * LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
 * MultilayerPerceptron: layers default to not set (instead of [1,1])
 * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)

## How was this patch tested?

Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>

Closes #12816 from jkbradley/yinxusen-SPARK-14931.
2016-05-01 12:29:01 -07:00
Herman van Hovell e5fb78baf9 [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0
#### What changes were proposed in this pull request?

This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`

The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.

#### How was this patch tested?
Compilation succeded.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12732 from hvanhovell/SPARK-14952.
2016-04-30 16:06:20 +01:00
Junyang 1192fe4cd2 [SPARK-13289][MLLIB] Fix infinite distances between word vectors in Word2VecModel
## What changes were proposed in this pull request?

This PR fixes the bug that generates infinite distances between word vectors. For example,

Before this PR, we have
```
val synonyms = model.findSynonyms("who", 40)
```
will give the following results:
```
to Infinity
and Infinity
that Infinity
with Infinity
```
With this PR, the distance between words is a value between 0 and 1, as follows:
```
scala> model.findSynonyms("who", 10)
res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006))

scala> model.findSynonyms("king", 10)
res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043))

scala> model.findSynonyms("queen", 10)
res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199))
```

### There are two places changed in this PR:
- Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once.
- Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation

## How was this patch tested?

Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors.

Author: Junyang <fly.shenjy@gmail.com>
Author: flyskyfly <fly.shenjy@gmail.com>

Closes #11812 from flyjy/TVec.
2016-04-30 10:16:35 +01:00
Xiangrui Meng 7fbe1bb24d [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALS
## What changes were proposed in this pull request?

As discussed in #12660, this PR renames
* intermediateRDDStorageLevel -> intermediateStorageLevel
* finalRDDStorageLevel -> finalStorageLevel

The argument name in `ALS.train` will be addressed in SPARK-15027.

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #12803 from mengxr/SPARK-14412.
2016-04-30 00:41:28 -07:00
Nick Pentreath 90fa2c6e7f [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALS
`mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them.

## How was this patch tested?

New test cases in `ALSSuite` and `tests.py`.

cc yanboliang jkbradley sethah rishabhbhardwaj

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12660 from MLnick/SPARK-14412-als-storage-params.
2016-04-29 22:01:41 -07:00
Joseph K. Bradley 09da43d514 [SPARK-13786][ML][PYTHON] Removed save/load for python tuning
## What changes were proposed in this pull request?

Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily.  This support should be re-designed and added in the next release.

## How was this patch tested?

Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12782 from jkbradley/remove-python-tuning-saveload.
2016-04-29 20:51:24 -07:00
Jeff Zhang 775772de36 [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2
## What changes were proposed in this pull request?

pyspark.ml API for LDA
* LDA, LDAModel, LocalLDAModel, DistributedLDAModel
* includes persistence

This replaces [https://github.com/apache/spark/pull/10242]

## How was this patch tested?

* doc test for LDA, including Param setters
* unit test for persistence

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Jeff Zhang <zjffdu@apache.org>

Closes #12723 from jkbradley/zjffdu-SPARK-11940.
2016-04-29 10:42:52 -07:00
Kai Jiang d584a2b8ac [SPARK-12810][PYSPARK] PySpark CrossValidatorModel should support avgMetrics
## What changes were proposed in this pull request?
support avgMetrics in CrossValidatorModel with Python
## How was this patch tested?
Doctest and `test_save_load` in `pyspark/ml/test.py`
[JIRA](https://issues.apache.org/jira/browse/SPARK-12810)

Author: Kai Jiang <jiangkai@gmail.com>

Closes #12464 from vectorijk/spark-12810.
2016-04-28 14:19:11 -07:00
Yanbo Liang 4672e9838b [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg option
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.

## How was this patch tested?
Unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12702 from yanboliang/spark-14899.
2016-04-27 14:08:26 -07:00
Joseph K. Bradley bd2c9a6d48 [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local
## What changes were proposed in this pull request?

Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.

This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov

This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method

Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12593 from jkbradley/sparkml-gmm-fix.
2016-04-26 16:53:16 -07:00
Joseph K. Bradley 89f082de0e [SPARK-14903][SPARK-14071][ML][PYTHON] Revert : MLWritable.write property
## What changes were proposed in this pull request?

SPARK-14071 changed MLWritable.write to be a property.  This reverts that change since there was not a good way to make MLReadable.read appear to be a property.

## How was this patch tested?

existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12671 from jkbradley/revert-MLWritable-write-py.
2016-04-26 12:00:57 -07:00
Yanbo Liang 302a186869 [SPARK-11559][MLLIB] Make runs no effect in mllib.KMeans
## What changes were proposed in this pull request?
We deprecated  ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.

## How was this patch tested?
Existing unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12608 from yanboliang/spark-11559.
2016-04-26 11:55:21 -07:00
Yanbo Liang 425f691646 [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.

Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.

## How was this patch tested?
unit tests.

cc jkbradley MLnick

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12498 from yanboliang/spark-10574.
2016-04-25 12:08:43 -07:00
Joseph K. Bradley c7758ba384 [MINOR][ML][PYTHON][DOC] Remove use of JavaMLWriter/Reader in public Python API docs
## What changes were proposed in this pull request?

Removed instances of JavaMLWriter, JavaMLReader appearing in public Python API docs

## How was this patch tested?

n/a

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12542 from jkbradley/javamlwriter-doc.
2016-04-25 11:02:32 -07:00
wm624@hotmail.com b50e2eca93 [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture
## What changes were proposed in this pull request?

Add Python API in ML for GaussianMixture

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Add doctest and test cases are the same as mllib Python tests
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.

./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (18s)
Finished test(python2.7): pyspark.ml.clustering (40s)
Finished test(python2.7): pyspark.ml.classification (49s)
Finished test(python2.7): pyspark.ml.recommendation (44s)
Finished test(python2.7): pyspark.ml.feature (64s)
Finished test(python2.7): pyspark.ml.regression (45s)
Finished test(python2.7): pyspark.ml.tuning (30s)
Finished test(python2.7): pyspark.ml.tests (56s)
Tests passed in 106 seconds

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12402 from wangmiao1981/gmm.
2016-04-25 10:48:15 -07:00
Jason Lee bfda099913 [SPARK-14768][ML][PYSPARK] removed expectedType from Param __init__()
## What changes were proposed in this pull request?
Removed expectedType arg from PySpark Param __init__, as suggested by the JIRA.

## How was this patch tested?
Manually looked through all places that use Param. Compiled and ran all ML PySpark test cases before and after the fix.

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12581 from jasoncl/SPARK-14768.
2016-04-25 15:32:11 +02:00
Yanbo Liang 296c384aff [MINOR][ML][PYSPARK] Fix omissive params which should use TypeConverter
## What changes were proposed in this pull request?
#11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12529 from yanboliang/typeConverter.
2016-04-20 13:02:37 -07:00
Yanbo Liang 08f84d7a9a [MINOR][ML][PYSPARK] Fix omissive param setters which should use _set method
## What changes were proposed in this pull request?
#11939 make Python param setters use the `_set` method. This PR fix omissive ones.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12531 from yanboliang/setters-omissive.
2016-04-20 20:06:27 +02:00
Burak Yavuz 80bf48f437 [SPARK-14555] First cut of Python API for Structured Streaming
## What changes were proposed in this pull request?

This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
 - ContinuousQuery
 - Trigger
 - ProcessingTime
in pyspark under `pyspark.sql.streaming`.

In addition, it contains the new methods added under:
 -  `DataFrameWriter`
     a) `startStream`
     b) `trigger`
     c) `queryName`

 -  `DataFrameReader`
     a) `stream`

 - `DataFrame`
    a) `isStreaming`

This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
 - `exception`
 - `sourceStatuses`
 - `sinkStatus`

They may be added in a follow up.

This PR also contains some very minor doc fixes in the Scala side.

## How was this patch tested?

Python doc tests

TODO:
 - [ ] verify Python docs look good

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Closes #12320 from brkyvz/stream-python.
2016-04-20 10:32:01 -07:00
Joseph K. Bradley d29e429eeb [SPARK-14714][ML][PYTHON] Fixed issues with non-kwarg typeConverter arg for Param constructor
## What changes were proposed in this pull request?

PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly.

This PR changes all usages in pyspark/ml/ to keyword args.

## How was this patch tested?

Existing unit tests.  I will not test type conversion for every Param unless we really think it necessary.

Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning:
```
/Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument.
  "Use typeConverter instead, as a keyword argument.")
```
That warning came from the typeConverter argument being passes as the expectedType arg by mistake.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12480 from jkbradley/typeconverter-fix.
2016-04-18 17:15:12 -07:00
Xusen Yin f31a62d1b2 [SPARK-14440][PYSPARK] Remove pipeline specific reader and writer
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14440

Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.

## How was this patch tested?

test with unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12216 from yinxusen/SPARK-14440.
2016-04-18 13:31:48 -07:00
Jason Lee 3d66a2ce9b [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize method
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib

## How was this patch tested?
Added test cases in tests.py under both ML and MLlib

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12428 from jasoncl/SPARK-14564.
2016-04-18 12:47:14 -07:00
Xusen Yin b64482f49f [SPARK-14306][ML][PYSPARK] PySpark ml.classification OneVsRest support export/import
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14306

Add PySpark OneVsRest save/load supports.

## How was this patch tested?

Test with Python unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12439 from yinxusen/SPARK-14306-0415.
2016-04-18 11:52:29 -07:00
Joseph K. Bradley 36da5e3234 [SPARK-14605][ML][PYTHON] Changed Python to use unicode UIDs for spark.ml Identifiable
## What changes were proposed in this pull request?

Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.

This PR: Use unicode everywhere in Python.

## How was this patch tested?

Updated persistence unit test to check uid type

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12368 from jkbradley/python-uid-unicode.
2016-04-16 11:23:28 -07:00
Xusen Yin 90b46e014a [SPARK-7861][ML] PySpark OneVsRest
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-7861

Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.

## How was this patch tested?

Test with doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12124 from yinxusen/SPARK-14306-7861.
2016-04-15 12:58:38 -07:00
sethah 129f2f455d [SPARK-14104][PYSPARK][ML] All Python param setters should use the _set method
## What changes were proposed in this pull request?

Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.

Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.

## How was this patch tested?

Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11939 from sethah/SPARK-14104.
2016-04-15 12:14:41 -07:00
Joseph K. Bradley d6ae7d4637 [SPARK-14665][ML][PYTHON] Fixed bug with StopWordsRemover default stopwords
## What changes were proposed in this pull request?

The default stopwords were a Java object.  They are no longer.

## How was this patch tested?

Unit test which failed before the fix

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12422 from jkbradley/pyspark-stopwords.
2016-04-15 11:50:21 -07:00
Yanbo Liang b9613239d3 [SPARK-14374][ML][PYSPARK] PySpark ml GBTClassifier, Regressor support export/import
## What changes were proposed in this pull request?
PySpark ml GBTClassifier, Regressor support export/import.

## How was this patch tested?
Doc test.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12383 from yanboliang/spark-14374.
2016-04-14 21:36:03 -07:00
Yong Tang bc748b7b8f [SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib
## What changes were proposed in this pull request?

This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.

Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.

## How was this patch tested?

This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12079 from yongtang/SPARK-14238.
2016-04-14 21:53:32 +02:00
Bryan Cutler c5172f8205 [SPARK-13967][PYSPARK][ML] Added binary Param to Python CountVectorizer
Added binary toggle param to CountVectorizer feature transformer in PySpark.

Created a unit test for using CountVectorizer with the binary toggle on.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
2016-04-14 20:47:31 +02:00
Holden Karau 478af2f455 [SPARK-14573][PYSPARK][BUILD] Fix PyDoc Makefile & highlighting issues
## What changes were proposed in this pull request?

The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced.

## How was this patch tested?

manual local export & make

Author: Holden Karau <holden@us.ibm.com>

Closes #12336 from holdenk/SPARK-14573-fix-pydoc-makefile.
2016-04-14 09:42:15 +01:00
Bryan Cutler fc3cd2f509 [SPARK-14472][PYSPARK][ML] Cleanup ML JavaWrapper and related class hierarchy
Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls.  This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls.  Also, renames Java wrapper classes to better reflect their purpose.

Ran existing Python ml tests and generated documentation to test this change.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
2016-04-13 14:08:57 -07:00
Kai Jiang 7f024c4744 [SPARK-13597][PYSPARK][ML] Python API for GeneralizedLinearRegression
## What changes were proposed in this pull request?

Python API for GeneralizedLinearRegression
JIRA: https://issues.apache.org/jira/browse/SPARK-13597

## How was this patch tested?

The patch is tested with Python doctest.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11468 from vectorijk/spark-13597.
2016-04-12 11:29:12 -07:00
Joseph K. Bradley d7af736b2c [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs
## What changes were proposed in this pull request?

Cleanups to documentation.  No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only

## How was this patch tested?

Doc build

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12266 from jkbradley/ml-doc-cleanups.
2016-04-08 20:15:44 -07:00
wm624@hotmail.com e0ad75f2b5 [SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of prediction: Python API
## What changes were proposed in this pull request?

A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.

This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.

## How was this patch tested?
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.

./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (12s)
Finished test(python2.7): pyspark.ml.clustering (18s)
Finished test(python2.7): pyspark.ml.classification (30s)
Finished test(python2.7): pyspark.ml.recommendation (28s)
Finished test(python2.7): pyspark.ml.feature (43s)
Finished test(python2.7): pyspark.ml.regression (31s)
Finished test(python2.7): pyspark.ml.tuning (19s)
Finished test(python2.7): pyspark.ml.tests (34s)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12116 from wangmiao1981/fix_api.
2016-04-08 10:47:05 -07:00
Kai Jiang e5d8d6e09c [SPARK-14373][PYSPARK] PySpark RandomForestClassifier, Regressor support export/import
## What changes were proposed in this pull request?
supporting `RandomForest{Classifier, Regressor}` save/load for Python API.
[JIRA](https://issues.apache.org/jira/browse/SPARK-14373)
## How was this patch tested?
doctest

Author: Kai Jiang <jiangkai@gmail.com>

Closes #12238 from vectorijk/spark-14373.
2016-04-08 10:39:12 -07:00
Bryan Cutler 9c6556c5f8 [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and logistic regression
## What changes were proposed in this pull request?

Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.

## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes.  Also, manually verified values are expected and match those from Scala directly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
2016-04-06 12:07:47 -07:00
Xusen Yin db0b06c6ea [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuning
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13786

Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.

## How was this patch tested?

Test with Python doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12020 from yinxusen/SPARK-13786.
2016-04-06 11:24:11 -07:00
Yanbo Liang 381358fbe9 [SPARK-14305][ML][PYSPARK] PySpark ml.clustering BisectingKMeans support export/import
## What changes were proposed in this pull request?
PySpark ml.clustering BisectingKMeans support export/import
## How was this patch tested?
doc test.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12112 from yanboliang/spark-14305.
2016-04-01 12:53:39 -07:00
Alexander Ulanov 26867ebc67 [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for multilayer perceptron
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers

These features decreased the memory usage and increased flexibility of
internal API.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>

Closes #9229 from avulanov/mlp-refactoring.
2016-03-31 23:48:36 -07:00
sethah b11887c086 [SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark
## What changes were proposed in this pull request?

Feature importances are exposed in the python API for GBTs.

Other changes:
* Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.

## How was this patch tested?

Python doc tests were updated to validate GBT feature importance.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12056 from sethah/Pyspark_GBT_feature_importance.
2016-03-31 13:00:10 -07:00
Yanbo Liang f301df37cb [SPARK-14152][ML][PYSPARK] MultilayerPerceptronClassifier supports save/load for Python API
## What changes were proposed in this pull request?
```MultilayerPerceptronClassifier``` supports save/load for Python API.

## How was this patch tested?
doctest.

cc mengxr jkbradley yinxusen

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11952 from yanboliang/spark-14152.
2016-03-30 15:47:01 -07:00
wm624@hotmail.com 63b200e8d4 [SPARK-14071][PYSPARK][ML] Change MLWritable.write to be a property
Add property to MLWritable.write method, so we can use .write instead of .write()

Add a new test to ml/test.py to check whether the write is a property.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml

Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (11s)
Finished test(python2.7): pyspark.ml.clustering (16s)
Finished test(python2.7): pyspark.ml.classification (24s)
Finished test(python2.7): pyspark.ml.recommendation (24s)
Finished test(python2.7): pyspark.ml.feature (39s)
Finished test(python2.7): pyspark.ml.regression (26s)
Finished test(python2.7): pyspark.ml.tuning (15s)
Finished test(python2.7): pyspark.ml.tests (30s)
Tests passed in 55 seconds

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #11945 from wangmiao1981/fix_property.
2016-03-28 22:33:25 -07:00
GayathriMurali 0874ff3aad [SPARK-13949][ML][PYTHON] PySpark ml DecisionTreeClassifier, Regressor support export/import
## What changes were proposed in this pull request?

Added MLReadable and MLWritable to Decision Tree Classifier and Regressor. Added doctests.

## How was this patch tested?

Python Unit tests. Tests added to check persistence in DecisionTreeClassifier and DecisionTreeRegressor.

Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #11892 from GayathriMurali/SPARK-13949.
2016-03-24 19:20:49 -07:00
sethah 585097716c [SPARK-14107][PYSPARK][ML] Add seed as named argument to GBTs in pyspark
## What changes were proposed in this pull request?

GBTs in pyspark previously had seed parameters, but they could not be passed as keyword arguments through the class constructor. This patch adds seed as a keyword argument and also sets default value.

## How was this patch tested?

Doc tests were updated to pass a random seed through the GBTClassifier and GBTRegressor constructors.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11944 from sethah/SPARK-14107.
2016-03-24 19:14:24 -07:00
Joseph K. Bradley cf823bead1 [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one
Primary change:
* Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
* spark.mllib now calls the spark.ml implementation.
* Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.

ml.tree.DecisionTreeModel
* Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses.  These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.

ml.tree.Node
* Added ```private[tree] def deepCopy```, used by unit tests

Copied developer comments from spark.mllib implementation to spark.ml one.

Moving unit tests
* Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
* Those tests were all moved to spark.ml.tree.impl.RandomForestSuite.  The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
* I made minimal changes to each test to allow it to run.  Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
* No new unit tests were added.
* mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in.  Those same split calculations were already being tested in other unit tests, for each dataset type.

**Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
* spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB.  This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit().  Once this PR is merged, I will note the change of behavior on SPARK-13448.
* spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats.  This does not remove information from the tree, and it will save a bit of storage.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11855 from jkbradley/remove-mllib-tree-impl.
2016-03-23 21:16:00 -07:00
sethah 30bdb5cbd9 [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params
## What changes were proposed in this pull request?

This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.

This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.

## How was this patch tested?

Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11663 from sethah/SPARK-13068-tc.
2016-03-23 11:20:44 -07:00
Joseph K. Bradley 7e3423b9c0 [SPARK-13951][ML][PYTHON] Nested Pipeline persistence
Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.

Also:
* Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
* Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java

Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11866 from jkbradley/nested-pipeline-io.
Closes #11835
2016-03-22 12:11:37 -07:00
Xusen Yin 454a00df2a [SPARK-13993][PYSPARK] Add pyspark Rformula/RforumlaModel save/load
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13993

## How was this patch tested?

doctest

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11807 from yinxusen/SPARK-13993.
2016-03-20 15:34:34 -07:00
Bryan Cutler 828213d4ca [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable
## What changes were proposed in this pull request?
In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.

## How was this patch tested?
Ran python tests for ML and MLlib.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
2016-03-17 10:16:51 -07:00
GayathriMurali 27e1f38851 [SPARK-13034] PySpark ml.classification support export/import
## What changes were proposed in this pull request?

Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py.

## How was this patch tested?

./python/run-tests
./dev/lint-python
Unit tests added to check persistence in Logistic Regression

Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #11707 from GayathriMurali/SPARK-13034.
2016-03-16 14:21:42 -07:00
Xusen Yin ae6c677c8a [SPARK-13038][PYSPARK] Add load/save to pipeline
## What changes were proposed in this pull request?

JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038

1. Add load/save to PySpark Pipeline and PipelineModel

2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.

## How was this patch tested?

Test with doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11683 from yinxusen/SPARK-13038-only.
2016-03-16 13:49:40 -07:00
sethah 234f781ae1 [SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision tree and random forest
## What changes were proposed in this pull request?

This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`.

## How was this patch tested?

Python doc tests for the affected classes were updated to check feature importances.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11622 from sethah/SPARK-13787.
2016-03-11 09:54:23 +02:00
Bryan Cutler d8813fa043 [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property when getting param list
## What changes were proposed in this pull request?

Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance.  This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error.

## How was this patch tested?

Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class.  Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.
2016-03-08 17:34:25 -08:00
Xusen Yin 83302c3bff [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.

In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11203 from yinxusen/SPARK-13036.
2016-03-04 08:32:24 -08:00
Dongjoon Hyun c8f25459ed [SPARK-13676] Fix mismatched default values for regParam in LogisticRegression
## What changes were proposed in this pull request?

The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.

**Scala**
```
scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
res0: Double = 0.0
```

**Python**
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.1
```

## How was this patch tested?
manual. Check the following in `pyspark`.
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.0
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11519 from dongjoon-hyun/SPARK-13676.
2016-03-04 08:25:41 -08:00
JeremyNixon 511d4929c8 [SPARK-12877][ML] Add train-validation-split to pyspark
## What changes were proposed in this pull request?
The changes proposed were to add train-validation-split to pyspark.ml.tuning.

## How was the this patch tested?
This patch was tested through unit tests located in pyspark/ml/test.py.

This is my original work and I license it to Spark.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11335 from JeremyNixon/tvs_pyspark.
2016-03-03 09:50:05 -08:00
Joseph K. Bradley 9495c40f22 [SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all lists
This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top.  Since we keep it alphabetized, it often creates a lot more changes than needed.  It is also easy to add the Estimator and forget the Model.  I'm going to switch it to have one algorithm per line.

This also alphabetizes a few out-of-place classes in pyspark.ml.feature.  No changes have been made to the moved classes.

CC: thunterdb

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10927 from jkbradley/ml-python-all-list.
2016-03-01 21:26:47 -08:00
zlpmichelle 1e5fcdf96c [SPARK-13505][ML] add python api for MaxAbsScaler
## What changes were proposed in this pull request?
After SPARK-13028, we should add Python API for MaxAbsScaler.

## How was this patch tested?
unit test

Author: zlpmichelle <zlpmichelle@gmail.com>

Closes #11393 from zlpmichelle/master.
2016-02-26 14:37:44 -08:00
Tommy YU f3be369ef7 [SPARK-13033] [ML] [PYSPARK] Add import/export for ml.regression
Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/regression.py.

yanboliang Please help to review.
For doctest, I though it's enough to add one since it's common usage. But I can add to all if we want it.

Author: Tommy YU <tummyyu@163.com>

Closes #11000 from Wenpei/spark-13033-ml.regression-exprot-import and squashes the following commits:

3646b36 [Tommy YU] address review comments
9cddc98 [Tommy YU] change base on review and pr 11197
cc61d9d [Tommy YU] remove default parameter set
19535d4 [Tommy YU] add export/import to regression
44a9dc2 [Tommy YU] add import/export for ml.regression
2016-02-25 21:09:02 -08:00
Yu ISHIKAWA 35316cb0b7 [SPARK-13292] [ML] [PYTHON] QuantileDiscretizer should take random seed in PySpark
## What changes were proposed in this pull request?
QuantileDiscretizer in Python should also specify a random seed.

## How was this patch tested?
unit tests

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11362 from yu-iskw/SPARK-13292 and squashes the following commits:

02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark
2016-02-25 13:29:10 -08:00
Yong Gang Cao ef1047fca7 [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec
add support of arbitrary length sentence by using the nature representation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience

need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?

jira link: https://issues.apache.org/jira/browse/SPARK-12153

Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>

Closes #10152 from ygcao/improvementForSentenceBoundary.
2016-02-22 09:47:36 +00:00
Holden Karau 9ca79c1ece [SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup outside of the doctests
Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
In part this is a follow up to https://github.com/apache/spark/pull/10999
Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.

Author: Holden Karau <holden@us.ibm.com>

Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
2016-02-20 09:07:19 +00:00
Yanbo Liang a183dda6ab [SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means
Add Python API for spark.ml bisecting k-means.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10889 from yanboliang/spark-12974.
2016-02-12 01:45:45 -08:00
Tommy YU d3e2e20299 [SPARK-13153][PYSPARK] ML persistence failed when handle no default value parameter
Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU <tummyyu@163.com>

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
2016-02-11 18:38:49 -08:00
sethah b354673886 [SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an error
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.

In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false

cc holdenk

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10962 from sethah/SPARK-13047.
2016-02-11 16:42:44 -08:00
Yanbo Liang 30e0095566 [SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/import
PySpark ml.clustering support export/import.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10999 from yanboliang/spark-13035.
2016-02-11 15:55:40 -08:00
Yanbo Liang 2426eb3e16 [MINOR][ML][PYSPARK] Cleanup test cases of clustering.py
Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10975 from yanboliang/clustering-cleanup.
2016-02-11 15:53:45 -08:00
Kai Jiang c8f667d7c1 [SPARK-13037][ML][PYSPARK] PySpark ml.recommendation support export/import
PySpark ml.recommendation support export/import.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11044 from vectorijk/spark-13037.
2016-02-11 15:50:33 -08:00
Yanbo Liang e51b6eaa9e [SPARK-13032][ML][PYSPARK] PySpark support model export/import and take LinearRegression as example
* Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
* Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10469 from yanboliang/spark-11939.
2016-01-29 09:22:24 -08:00
Xusen Yin 4db255c7aa [SPARK-12780] Inconsistency returning value of ML python models' properties
https://issues.apache.org/jira/browse/SPARK-12780

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10724 from yinxusen/SPARK-12780.
2016-01-26 21:16:56 -08:00
Holden Karau eb917291ca [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).

Author: Holden Karau <holden@us.ibm.com>

Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
2016-01-26 15:53:48 -08:00
Xusen Yin 8beab68152 [SPARK-11923][ML] Python API for ml.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-11923

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10186 from yinxusen/SPARK-11923.
2016-01-26 11:56:46 -08:00
Holden Karau b66afdeb52 [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer
Add Python API for ml.feature.QuantileDiscretizer.

One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc brkyvz & mengxr

Author: Holden Karau <holden@us.ibm.com>

Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
2016-01-25 22:38:31 -08:00
Yanbo Liang dcae355c64 [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark
```PCAModel```  can output ```explainedVariance``` at Python side.

cc mengxr srowen

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10830 from yanboliang/spark-12905.
2016-01-25 13:54:21 -08:00
Gábor Lipták 9bb35c5b59 [SPARK-11295][PYSPARK] Add packages to JUnit output for Python tests
This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.

Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10850 from mengxr/SPARK-11295.
2016-01-20 11:11:10 -08:00
Xiangrui Meng beda901422 Revert "[SPARK-11295] Add packages to JUnit output for Python tests"
This reverts commit c6f971b4ae.
2016-01-19 16:51:17 -08:00
BenFradet f6f7ca9d2e [SPARK-9716][ML] BinaryClassificationEvaluator should accept Double prediction column
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10472 from BenFradet/SPARK-9716.
2016-01-19 14:59:20 -08:00
Gábor Lipták c6f971b4ae [SPARK-11295] Add packages to JUnit output for Python tests
SPARK-11295 Add packages to JUnit output for Python tests

This improves grouping/display of test case results.

Author: Gábor Lipták <gliptak@gmail.com>

Closes #9263 from gliptak/SPARK-11295.
2016-01-19 14:06:53 -08:00
Yanbo Liang 5f843781e3 [SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during Spark 1.6 QA
Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9908 from yanboliang/spark-11925.
2016-01-15 15:54:19 -08:00
Yanbo Liang 3aa3488225 [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed
PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9807 from yanboliang/spark-11815.
2016-01-06 10:52:25 -08:00
Yanbo Liang 95eb651633 [SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml
Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9931 from yanboliang/SPARK-11945.
2016-01-06 10:50:02 -08:00
Holden Karau 3b29004d24 [SPARK-7675][ML][PYSPARK] sparkml params type conversion
From JIRA:
Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.

A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.

This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.

Author: Holden Karau <holden@us.ibm.com>

Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
2016-01-06 10:43:03 -08:00
Jeff Zhang 1920d72a1f [PYSPARK] Pyspark typo & Add missing abstractmethod annotation
No jira is created since this is a trivial change.

davies  Please help review it

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10143 from zjffdu/pyspark_typo.
2015-12-21 08:53:46 -08:00
Martin Menestret 3a44aebd0c [SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed
Extend CrossValidator with HasSeed in PySpark.

This PR replaces [https://github.com/apache/spark/pull/7997]

CC: yanboliang thunterdb mmenestret  Would one of you mind taking a look?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Martin MENESTRET <mmenestret@ippon.fr>

Closes #10268 from jkbradley/pyspark-cv-seed.
2015-12-16 14:05:35 -08:00
Yanbo Liang d576e76bba [MINOR][ML] Use coefficients replace weights
Use ```coefficients``` replace ```weights```, I wish they are the last two.
mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10065 from yanboliang/coefficients.
2015-12-03 11:37:34 -08:00
Yanbo Liang 7216f40545 [SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointInterval
* Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint.
* Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9856 from yanboliang/spark-11875.
2015-11-19 22:14:01 -08:00
Yanbo Liang 603a721c21 [SPARK-11820][ML][PYSPARK] PySpark LiR & LoR should support weightCol
[SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9811 from yanboliang/spark-11820.
2015-11-18 13:32:06 -08:00
Yu ISHIKAWA 88a3fdcc78 [SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8690 from yu-iskw/SPARK-10280.
2015-11-09 13:16:04 -08:00
Imran Rashid 49f1a82037 [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.
2015-11-06 20:06:24 +00:00
Yanbo Liang 9da7ceed81 [SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver
Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9485 from yanboliang/spark-11473.
2015-11-05 09:56:18 -08:00
Yanbo Liang 2e86cf1b01 [SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose coefficients/intercept/scale
PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9492 from yanboliang/spark-11527.
2015-11-05 09:00:03 -08:00
vectorijk c020f7d9d4 [SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead in ML models
Deprecated in `LogisticRegression` and `LinearRegression`

Author: vectorijk <jiangkai@gmail.com>

Closes #9311 from vectorijk/spark-10592.
2015-11-02 16:12:04 -08:00
lihao ecfb3e73fd [SPARK-10286][ML][PYSPARK][DOCS] Add @since annotation to pyspark.ml.param and pyspark.ml.*
Author: lihao <lihaowhu@gmail.com>

Closes #9275 from lidinghao/SPARK-10286.
2015-11-02 16:09:22 -08:00
Yanbo Liang f92b7b98e9 [SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solver
[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9328 from yanboliang/spark-11367.
2015-10-28 08:54:20 -07:00
vectorijk 9dba5fb2b5 [SPARK-10024][PYSPARK] Python API RF and GBT related params clear up
implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API
in pyspark/ml/{classification, regression}.py

Author: vectorijk <jiangkai@gmail.com>

Closes #9233 from vectorijk/spark-10024.
2015-10-27 13:55:03 -07:00
Gábor Lipták 163d53e829 [SPARK-7021] Add JUnit output for Python unit tests
WIP

Author: Gábor Lipták <gliptak@gmail.com>

Closes #8323 from gliptak/SPARK-7021.
2015-10-22 15:27:11 -07:00