Commit graph

39 commits

Author SHA1 Message Date
zero323 e1b3e9a3d2 [SPARK-29212][ML][PYSPARK] Add common classes without using JVM backend
### What changes were proposed in this pull request?

Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends.

Note

- `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented.
- `PredictionModel` should be abstract as `_transform` is not implemented.

### Why are the changes needed?

To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors.

For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or https://github.com/apache/spark/pull/25776.

### Does this PR introduce any user-facing change?

It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same.

Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore).

It is for discussion if the same should be done to equivalent classes from `ml.wrapper`.

If we take `JavaClassifier` as an example, type hierarchy will change from

![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png)

to

![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png)

Similarly the old model

![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png)

will become

![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png)

### How was this patch tested?

Existing unit tests.

Closes #27245 from zero323/SPARK-29212.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2020-03-04 12:20:02 +08:00
Huaxin Gao 1112fc6029 [SPARK-29867][ML][PYTHON] Add __repr__ in Python ML Models
### What changes were proposed in this pull request?
Add ```__repr__``` in Python ML Models

### Why are the changes needed?
In Python ML Models, some of them have ```__repr__```, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the  ```__repr__``` method. For example:
```
    >>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
    >>> model = gm.fit(df)
    >>> model.setPredictionCol("newPrediction")
    GaussianMixture...
```
After the change, the above code will become the following:
```
    >>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
    >>> model = gm.fit(df)
    >>> model.setPredictionCol("newPrediction")
    GaussianMixtureModel...
```

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
doctest

Closes #26489 from huaxingao/spark-29876.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-11-15 21:44:39 -08:00
Huaxin Gao 901ff92969 [SPARK-29464][PYTHON][ML] PySpark ML should expose Params.clear() to unset a user supplied Param
### What changes were proposed in this pull request?
change PySpark ml ```Params._clear``` to ```Params.clear```

### Why are the changes needed?
PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param.

### Does this PR introduce any user-facing change?
Yes. PySpark ml ```Params._clear``` ---> ```Params.clear```

### How was this patch tested?
Add test.

Closes #26130 from huaxingao/spark-29464.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2019-10-17 17:02:31 -07:00
Huaxin Gao cfcaf528cd [SPARK-29381][PYTHON][ML] Add _ before the XXXParams classes
### What changes were proposed in this pull request?
Add _ before XXXParams classes to indicate internal usage

### Why are the changes needed?
Follow the PEP 8 convention to use _single_leading_underscore to indicate internal use

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
use existing tests

Closes #26103 from huaxingao/spark-29381.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-10-14 10:52:23 -05:00
Huaxin Gao e97b55d322 [SPARK-28985][PYTHON][ML] Add common classes (JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier) in PYTHON
### What changes were proposed in this pull request?

Add some common classes in Python to make it have the same structure as Scala

1. Scala has ClassifierParams/Classifier/ClassificationModel:

```
trait ClassifierParams
    extends PredictorParams with HasRawPredictionCol

abstract class Classifier
    extends Predictor with ClassifierParams {
    def setRawPredictionCol
}

abstract class ClassificationModel
  extends PredictionModel with ClassifierParams {
    def setRawPredictionCol
}

```
This PR makes Python has the following:

```
class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
    pass

class JavaClassifier(JavaPredictor, JavaClassifierParams):
    def setRawPredictionCol

class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams):
    def setRawPredictionCol
```
2. Scala has ProbabilisticClassifierParams/ProbabilisticClassifier/ProbabilisticClassificationModel:
```
trait ProbabilisticClassifierParams
    extends ClassifierParams with HasProbabilityCol with HasThresholds

abstract class ProbabilisticClassifier
    extends Classifier with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}

abstract class ProbabilisticClassificationModel
    extends ClassificationModel with ProbabilisticClassifierParams {
    def setProbabilityCol
    def setThresholds
}
```
This PR makes Python have the following:
```
class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams):
    pass

class JavaProbabilisticClassifier(JavaClassifier, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds

class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams):
    def setProbabilityCol
    def setThresholds
```
3. Scala has PredictorParams/Predictor/PredictionModel:
```
trait PredictorParams extends Params
    with HasLabelCol with HasFeaturesCol with HasPredictionCol

abstract class Predictor
    extends Estimator with PredictorParams {
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol
  }

abstract class PredictionModel
    extends Model with PredictorParams {
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict
}
```
This PR makes Python have the following:
```
class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol):
    pass

class JavaPredictor(JavaEstimator, JavaPredictorParams):
    def setLabelCol
    def setFeaturesCol
    def setPredictionCol

class JavaPredictionModel(JavaModel, JavaPredictorParams):
    def setFeaturesCol
    def setPredictionCol
    def numFeatures
    def predict
```

### Why are the changes needed?
Have parity between Python and Scala ML

### Does this PR introduce any user-facing change?
Yes. Add the following changes:

```
LinearSVCModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict
```

```
LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict
```
```
LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel

- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict
```

### How was this patch tested?
Add a few doc tests.

Closes #25776 from huaxingao/spark-28985.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-19 08:17:25 -05:00
Huaxin Gao 74e9e1c192 [SPARK-22798][PYTHON][ML] Add multiple column support to PySpark StringIndexer
## What changes were proposed in this pull request?

Add multiple column support to PySpark StringIndexer

## How was this patch tested?

Add doctest

Closes #23741 from huaxingao/spark-22798.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-20 08:52:46 -06:00
Bryan Cutler cb43bbe136 [SPARK-21685][PYTHON][ML] PySpark Params isSet state should not change after transform
## What changes were proposed in this pull request?

Currently when a PySpark Model is transformed, default params that have not been explicitly set are then set on the Java side on the call to `wrapper._transfer_values_to_java`.  This incorrectly changes the state of the Param as it should still be marked as a default value only.

## How was this patch tested?

Added a new test to verify that when transferring Params to Java, default params have their state preserved.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #18982 from BryanCutler/pyspark-ml-param-to-java-defaults-SPARK-21685.
2018-03-23 11:42:40 -07:00
Yogesh Garg 7706eea6a8 [SPARK-18630][PYTHON][ML] Move del method from JavaParams to JavaWrapper; add tests
The `__del__` method that explicitly detaches the object was moved from `JavaParams` to `JavaWrapper` class, this way model summaries could also be garbage collected in Java. A test case was added to make sure that relevant error messages are thrown after the objects are deleted.

I ran pyspark tests  agains `pyspark-ml` module
`./python/run-tests --python-executables=$(which python) --modules=pyspark-ml`

Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>

Closes #20724 from yogeshg/java_wrapper_memory.
2018-03-05 15:53:10 -08:00
Bryan Cutler 41bb1ddc63 [SPARK-10931][ML][PYSPARK] PySpark Models Copy Param Values from Estimator
## What changes were proposed in this pull request?

Added call to copy values of Params from Estimator to Model after fit in PySpark ML.  This will copy values for any params that are also defined in the Model.  Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object.  This is a temporary fix that can be removed once the PySpark models properly define the params themselves.

## How was this patch tested?

Refactored the `check_params` test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17849 from BryanCutler/pyspark-models-own-params-SPARK-10931.
2017-08-22 17:40:50 -07:00
Ajay Saini 7047f49f45 [SPARK-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest
## What changes were proposed in this pull request?
Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark.

## How was this patch tested?
Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms.

Author: Ajay Saini <ajays725@gmail.com>

Closes #18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.
2017-07-17 10:07:32 -07:00
Bryan Cutler 57d70d26c8 [SPARK-17161][PYSPARK][ML] Add PySpark-ML JavaWrapper convenience function to create Py4J JavaArrays
## What changes were proposed in this pull request?

Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor.  The function takes a Java class as input that is used by Py4J to create the Java array of the given class.  As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.

## How was this patch tested?

Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
2017-01-31 15:42:36 -08:00
Sandeep Singh 78bb7f8071 [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`

## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer

l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])

for i in range(50):
    indexer = StringIndexer(inputCol='string', outputCol='index')
    indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.

Author: Sandeep Singh <sandeep@techaddict.me>
Author: jkbradley <joseph.kurata.bradley@gmail.com>

Closes #15843 from techaddict/SPARK-18274.
2016-12-01 13:22:40 -08:00
Liang-Chi Hsieh baa3e633e1 [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
## What changes were proposed in this pull request?

Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13219 from viirya/pyspark-pickler-ml.
2016-06-13 19:59:53 -07:00
Xusen Yin a6428292f7 [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update
## What changes were proposed in this pull request?

This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
  * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.

Defaults changed:
* Scala
 * LogisticRegression: weightCol defaults to not set (instead of empty string)
 * StringIndexer: labels default to not set (instead of empty array)
 * GeneralizedLinearRegression:
   * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
   * weightCol defaults to not set (instead of empty string)
 * LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
 * MultilayerPerceptron: layers default to not set (instead of [1,1])
 * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)

## How was this patch tested?

Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>

Closes #12816 from jkbradley/yinxusen-SPARK-14931.
2016-05-01 12:29:01 -07:00
Joseph K. Bradley 36da5e3234 [SPARK-14605][ML][PYTHON] Changed Python to use unicode UIDs for spark.ml Identifiable
## What changes were proposed in this pull request?

Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.

This PR: Use unicode everywhere in Python.

## How was this patch tested?

Updated persistence unit test to check uid type

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12368 from jkbradley/python-uid-unicode.
2016-04-16 11:23:28 -07:00
sethah 129f2f455d [SPARK-14104][PYSPARK][ML] All Python param setters should use the _set method
## What changes were proposed in this pull request?

Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.

Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.

## How was this patch tested?

Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11939 from sethah/SPARK-14104.
2016-04-15 12:14:41 -07:00
Bryan Cutler fc3cd2f509 [SPARK-14472][PYSPARK][ML] Cleanup ML JavaWrapper and related class hierarchy
Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls.  This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls.  Also, renames Java wrapper classes to better reflect their purpose.

Ran existing Python ml tests and generated documentation to test this change.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
2016-04-13 14:08:57 -07:00
Joseph K. Bradley d7af736b2c [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs
## What changes were proposed in this pull request?

Cleanups to documentation.  No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only

## How was this patch tested?

Doc build

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12266 from jkbradley/ml-doc-cleanups.
2016-04-08 20:15:44 -07:00
Bryan Cutler 9c6556c5f8 [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and logistic regression
## What changes were proposed in this pull request?

Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.

## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes.  Also, manually verified values are expected and match those from Scala directly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
2016-04-06 12:07:47 -07:00
Xusen Yin db0b06c6ea [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuning
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13786

Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.

## How was this patch tested?

Test with Python doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12020 from yinxusen/SPARK-13786.
2016-04-06 11:24:11 -07:00
Joseph K. Bradley 7e3423b9c0 [SPARK-13951][ML][PYTHON] Nested Pipeline persistence
Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.

Also:
* Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
* Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java

Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11866 from jkbradley/nested-pipeline-io.
Closes #11835
2016-03-22 12:11:37 -07:00
Bryan Cutler 828213d4ca [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable
## What changes were proposed in this pull request?
In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.

## How was this patch tested?
Ran python tests for ML and MLlib.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
2016-03-17 10:16:51 -07:00
Xusen Yin ae6c677c8a [SPARK-13038][PYSPARK] Add load/save to pipeline
## What changes were proposed in this pull request?

JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038

1. Add load/save to PySpark Pipeline and PipelineModel

2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.

## How was this patch tested?

Test with doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11683 from yinxusen/SPARK-13038-only.
2016-03-16 13:49:40 -07:00
Tommy YU d3e2e20299 [SPARK-13153][PYSPARK] ML persistence failed when handle no default value parameter
Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU <tummyyu@163.com>

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
2016-02-11 18:38:49 -08:00
Yanbo Liang e51b6eaa9e [SPARK-13032][ML][PYSPARK] PySpark support model export/import and take LinearRegression as example
* Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
* Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10469 from yanboliang/spark-11939.
2016-01-29 09:22:24 -08:00
Jeff Zhang 1920d72a1f [PYSPARK] Pyspark typo & Add missing abstractmethod annotation
No jira is created since this is a trivial change.

davies  Please help review it

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10143 from zjffdu/pyspark_typo.
2015-12-21 08:53:46 -08:00
noelsmith 7c4f852bfc [DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formatting
Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e:

![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png)

.. looks like this once newline is added:

![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png)

Author: noelsmith <mail@noelsmith.com>

Closes #8851 from noel-smith/docstring-missing-newline-fix.
2015-09-21 14:24:19 -07:00
Holden Karau 2f6fd5256c [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark
Adds IndexToString to PySpark.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
2015-09-08 22:13:05 -07:00
MechCoder 5fa0863626 [SPARK-8679] [PYSPARK] [MLLIB] Default values in Pipeline API should be immutable
It might be dangerous to have a mutable as value for default param. (http://stackoverflow.com/a/11416002/1170730)

e.g

    def func(example, f={}):
        f[example] = 1
        return f

    func(2)

    {2: 1}
    func(3)
    {2:1, 3:1}

mengxr

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7058 from MechCoder/pipeline_api_playground and squashes the following commits:

40a5eb2 [MechCoder] copy
95f7ff2 [MechCoder] [SPARK-8679] [PySpark] [MLlib] Default values in Pipeline API should be immutable
2015-06-30 10:27:29 -07:00
Xiangrui Meng 8f11c6116b [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4
Some changes to the pipeilne APIs:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does.
1. Move Evaluator to ml.evaluation.
1. Mention larger metric values are better.
1. PipelineModel doc. “compiled” -> “fitted”
1. Hide object PolynomialExpansion.
1. Hide object VectorAssembler.
1. Word2Vec.minCount (and other) -> group param
1. ParamValidators -> DeveloperApi
1. Hide MetadataUtils/SchemaUtils.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits:

9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well
e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark
08ef61f [Xiangrui Meng] update pipieline APIs
2015-05-21 22:57:33 -07:00
Xiangrui Meng 9c7e802a5a [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
2015-05-18 12:02:18 -07:00
Xiangrui Meng 723853edab [SPARK-7648] [MLLIB] Add weights and intercept to GLM wrappers in spark.ml
Otherwise, users can only use `transform` on the models. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #6156 from mengxr/SPARK-7647 and squashes the following commits:

1ae3d2d [Xiangrui Meng] add weights and intercept to LogisticRegression in Python
f49eb46 [Xiangrui Meng] add weights and intercept to LinearRegressionModel
2015-05-14 18:13:58 -07:00
Burak Yavuz 9e2ffb1328 [SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in Python
The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5930 from brkyvz/ml-feat and squashes the following commits:

73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
c221db9 [Xiangrui Meng] overload StringArrayParam.w
c81072d [Burak Yavuz] addressed comments
99c2ebf [Burak Yavuz] add to python_shared_params
39ecb07 [Burak Yavuz] fix scalastyle
7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
2015-05-07 10:25:41 -07:00
Xiangrui Meng 32cdc815c6 [SPARK-6940] [MLLIB] Add CrossValidator to Python ML pipeline API
Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5926 from mengxr/SPARK-6940 and squashes the following commits:

6af181f [Xiangrui Meng] add TODOs
8285134 [Xiangrui Meng] update doc
060f7c3 [Xiangrui Meng] update doctest
acac727 [Xiangrui Meng] add keyword args
cdddecd [Xiangrui Meng] add CrossValidator in Python
2015-05-06 01:28:43 -07:00
Xiangrui Meng ee374e89cd [SPARK-7333] [MLLIB] Add BinaryClassificationEvaluator to PySpark
This PR adds `BinaryClassificationEvaluator` to Python ML Pipelines API, which is a simple wrapper of the Scala implementation. oefirouz

Author: Xiangrui Meng <meng@databricks.com>

Closes #5885 from mengxr/SPARK-7333 and squashes the following commits:

25d7451 [Xiangrui Meng] fix tests in python 3
babdde7 [Xiangrui Meng] fix doc
cb51e6a [Xiangrui Meng] add BinaryClassificationEvaluator in PySpark
2015-05-05 11:45:37 -07:00
Xiangrui Meng 57cd1e86d1 [SPARK-6893][ML] default pipeline parameter handling in python
Same as #5431 but for Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5534 from mengxr/SPARK-6893 and squashes the following commits:

d3b519b [Xiangrui Meng] address comments
ebaccc6 [Xiangrui Meng] style update
fce244e [Xiangrui Meng] update explainParams with test
4d6b07a [Xiangrui Meng] add tests
5294500 [Xiangrui Meng] update default param handling in python
2015-04-15 23:49:42 -07:00
Reynold Xin 70f88148bb [Docs] Replace references to SchemaRDD with DataFrame
Author: Reynold Xin <rxin@databricks.com>

Closes #4952 from rxin/schemardd-df-reference and squashes the following commits:

b2b1dbe [Reynold Xin] [Docs] Replace references to SchemaRDD with DataFrame
2015-03-09 13:29:19 -07:00
Joseph K. Bradley 4a17eedb16 [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.

For SPARK-5892:
* Fix Python docs
* Various other cleanups

BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]

CC: mengxr  (ML),  davies  (Python docs)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:

f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
2015-02-20 02:31:32 -08:00
Xiangrui Meng e80dc1c5a8 [SPARK-4586][MLLIB] Python API for ML pipeline and parameters
This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.

TODO:
- [x] handle parameters in LRModel
- [x] unit tests
- [x] missing some docs

CC: davies jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4151 from mengxr/SPARK-4586 and squashes the following commits:

415268e [Xiangrui Meng] remove inherit_doc from __init__
edbd6fe [Xiangrui Meng] move Identifiable to ml.util
44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml
dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
14ae7e2 [Davies Liu] fix docs
54ca7df [Davies Liu] fix tests
78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml
fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
1dca16a [Davies Liu] refactor
090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml
0882513 [Xiangrui Meng] update doc style
a4f4dbf [Xiangrui Meng] add unit test for LR
7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer
ba0ba1e [Xiangrui Meng] add unit tests for pipeline
0586c7b [Xiangrui Meng] add more comments to the example
5153cff [Xiangrui Meng] simplify java models
036ca04 [Xiangrui Meng] gen numFeatures
46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly
1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc
f66ba0c [Xiangrui Meng] make params a property
d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute
f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example
05e3e40 [Xiangrui Meng] update example
d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl
56de571 [Xiangrui Meng] fix style
d0c5bb8 [Xiangrui Meng] a working copy
bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
17ecfb9 [Xiangrui Meng] code gen for shared params
d9ea77c [Xiangrui Meng] update doc
c18dca1 [Xiangrui Meng] make the example working
dadd84e [Xiangrui Meng] add base classes and docs
a3015cf [Xiangrui Meng] add Estimator and Transformer
46eea43 [Xiangrui Meng] a pipeline in python
33b68e0 [Xiangrui Meng] a working LR
2015-01-28 17:14:23 -08:00