### What changes were proposed in this pull request?
Replace `__metaclass__` fields with `metaclass` keyword in the class statements.
### Why are the changes needed?
`__metaclass__` is no longer supported in Python 3. This means, for example, that types are no longer handled as singletons.
```
>>> from pyspark.sql.types import BooleanType
>>> BooleanType() is BooleanType()
False
```
and classes, which suppose to be abstract, are not
```
>>> import inspect
>>> from pyspark.ml import Estimator
>>> inspect.isabstract(Estimator)
False
```
### Does this PR introduce _any_ user-facing change?
Yes (classes which were no longer abstract or singleton in Python 3, are now), though visible changes should be consider a bug-fix.
### How was this patch tested?
Existing tests.
Closes#29664 from zero323/SPARK-32138-FOLLOW-UP-METACLASS.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to drop Python 2.7, 3.4 and 3.5.
Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark.
### Why are the changes needed?
1. Unsupport EOL Python versions
2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2.
3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation.
4. Users can use Python type hints with Pandas UDFs without thinking about Python version
5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle.
### Does this PR introduce _any_ user-facing change?
Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version.
### How was this patch tested?
Manually tested and also tested in Jenkins.
Closes#28957 from HyukjinKwon/SPARK-32138.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Implement common base ML classes (`Predictor`, `PredictionModel`, `Classifier`, `ClasssificationModel` `ProbabilisticClassifier`, `ProbabilisticClasssificationModel`, `Regressor`, `RegrssionModel`) for non-Java backends.
Note
- `Predictor` and `JavaClassifier` should be abstract as `_fit` method is not implemented.
- `PredictionModel` should be abstract as `_transform` is not implemented.
### Why are the changes needed?
To provide extensions points for non-JVM algorithms, as well as a public (as opposed to `Java*` variants, which are commonly described in docstrings as private) hierarchy which can be used to distinguish between different classes of predictors.
For longer discussion see [SPARK-29212](https://issues.apache.org/jira/browse/SPARK-29212) and / or https://github.com/apache/spark/pull/25776.
### Does this PR introduce any user-facing change?
It adds new base classes as listed above, but effective interfaces (method resolution order notwithstanding) stay the same.
Additionally "private" `Java*` classes in`ml.regression` and `ml.classification` have been renamed to follow PEP-8 conventions (added leading underscore).
It is for discussion if the same should be done to equivalent classes from `ml.wrapper`.
If we take `JavaClassifier` as an example, type hierarchy will change from
![old pyspark ml classification JavaClassifier](https://user-images.githubusercontent.com/1554276/72657093-5c0b0c80-39a0-11ea-9069-a897d75de483.png)
to
![new pyspark ml classification _JavaClassifier](https://user-images.githubusercontent.com/1554276/72657098-64fbde00-39a0-11ea-8f80-01187a5ea5a6.png)
Similarly the old model
![old pyspark ml classification JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657103-7513bd80-39a0-11ea-9ffc-59eb6ab61fde.png)
will become
![new pyspark ml classification _JavaClassificationModel](https://user-images.githubusercontent.com/1554276/72657110-80ff7f80-39a0-11ea-9f5c-fe408664e827.png)
### How was this patch tested?
Existing unit tests.
Closes#27245 from zero323/SPARK-29212.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Add ```__repr__``` in Python ML Models
### Why are the changes needed?
In Python ML Models, some of them have ```__repr__```, others don't. In the doctest, when calling Model.setXXX, some of the Models print out the xxxModel... correctly, some of them can't because of lacking the ```__repr__``` method. For example:
```
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> model = gm.fit(df)
>>> model.setPredictionCol("newPrediction")
GaussianMixture...
```
After the change, the above code will become the following:
```
>>> gm = GaussianMixture(k=3, tol=0.0001, seed=10)
>>> model = gm.fit(df)
>>> model.setPredictionCol("newPrediction")
GaussianMixtureModel...
```
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
doctest
Closes#26489 from huaxingao/spark-29876.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
change PySpark ml ```Params._clear``` to ```Params.clear```
### Why are the changes needed?
PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param.
### Does this PR introduce any user-facing change?
Yes. PySpark ml ```Params._clear``` ---> ```Params.clear```
### How was this patch tested?
Add test.
Closes#26130 from huaxingao/spark-29464.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
Add _ before XXXParams classes to indicate internal usage
### Why are the changes needed?
Follow the PEP 8 convention to use _single_leading_underscore to indicate internal use
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
use existing tests
Closes#26103 from huaxingao/spark-29381.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Add some common classes in Python to make it have the same structure as Scala
1. Scala has ClassifierParams/Classifier/ClassificationModel:
```
trait ClassifierParams
extends PredictorParams with HasRawPredictionCol
abstract class Classifier
extends Predictor with ClassifierParams {
def setRawPredictionCol
}
abstract class ClassificationModel
extends PredictionModel with ClassifierParams {
def setRawPredictionCol
}
```
This PR makes Python has the following:
```
class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
pass
class JavaClassifier(JavaPredictor, JavaClassifierParams):
def setRawPredictionCol
class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams):
def setRawPredictionCol
```
2. Scala has ProbabilisticClassifierParams/ProbabilisticClassifier/ProbabilisticClassificationModel:
```
trait ProbabilisticClassifierParams
extends ClassifierParams with HasProbabilityCol with HasThresholds
abstract class ProbabilisticClassifier
extends Classifier with ProbabilisticClassifierParams {
def setProbabilityCol
def setThresholds
}
abstract class ProbabilisticClassificationModel
extends ClassificationModel with ProbabilisticClassifierParams {
def setProbabilityCol
def setThresholds
}
```
This PR makes Python have the following:
```
class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams):
pass
class JavaProbabilisticClassifier(JavaClassifier, JavaProbabilisticClassifierParams):
def setProbabilityCol
def setThresholds
class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams):
def setProbabilityCol
def setThresholds
```
3. Scala has PredictorParams/Predictor/PredictionModel:
```
trait PredictorParams extends Params
with HasLabelCol with HasFeaturesCol with HasPredictionCol
abstract class Predictor
extends Estimator with PredictorParams {
def setLabelCol
def setFeaturesCol
def setPredictionCol
}
abstract class PredictionModel
extends Model with PredictorParams {
def setFeaturesCol
def setPredictionCol
def numFeatures
def predict
}
```
This PR makes Python have the following:
```
class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol):
pass
class JavaPredictor(JavaEstimator, JavaPredictorParams):
def setLabelCol
def setFeaturesCol
def setPredictionCol
class JavaPredictionModel(JavaModel, JavaPredictorParams):
def setFeaturesCol
def setPredictionCol
def numFeatures
def predict
```
### Why are the changes needed?
Have parity between Python and Scala ML
### Does this PR introduce any user-facing change?
Yes. Add the following changes:
```
LinearSVCModel
- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- predict
```
```
LogisticRegressionModel
DecisionTreeClassificationModel
RandomForestClassificationModel
GBTClassificationModel
NaiveBayesModel
MultilayerPerceptronClassificationModel
- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- get/setRawPredictionCol
- get/setProbabilityCol
- predict
```
```
LinearRegressionModel
IsotonicRegressionModel
DecisionTreeRegressionModel
RandomForestRegressionModel
GBTRegressionModel
AFTSurvivalRegressionModel
GeneralizedLinearRegressionModel
- get/setFeatureCol
- get/setPredictionCol
- get/setLabelCol
- predict
```
### How was this patch tested?
Add a few doc tests.
Closes#25776 from huaxingao/spark-28985.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Add multiple column support to PySpark StringIndexer
## How was this patch tested?
Add doctest
Closes#23741 from huaxingao/spark-22798.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Currently when a PySpark Model is transformed, default params that have not been explicitly set are then set on the Java side on the call to `wrapper._transfer_values_to_java`. This incorrectly changes the state of the Param as it should still be marked as a default value only.
## How was this patch tested?
Added a new test to verify that when transferring Params to Java, default params have their state preserved.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#18982 from BryanCutler/pyspark-ml-param-to-java-defaults-SPARK-21685.
The `__del__` method that explicitly detaches the object was moved from `JavaParams` to `JavaWrapper` class, this way model summaries could also be garbage collected in Java. A test case was added to make sure that relevant error messages are thrown after the objects are deleted.
I ran pyspark tests agains `pyspark-ml` module
`./python/run-tests --python-executables=$(which python) --modules=pyspark-ml`
Author: Yogesh Garg <yogesh(dot)garg()databricks(dot)com>
Closes#20724 from yogeshg/java_wrapper_memory.
## What changes were proposed in this pull request?
Added call to copy values of Params from Estimator to Model after fit in PySpark ML. This will copy values for any params that are also defined in the Model. Since currently most Models do not define the same params from the Estimator, also added method to create new Params from looking at the Java object if they do not exist in the Python object. This is a temporary fix that can be removed once the PySpark models properly define the params themselves.
## How was this patch tested?
Refactored the `check_params` test to optionally check if the model params for Python and Java match and added this check to an existing fitted model that shares params between Estimator and Model.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#17849 from BryanCutler/pyspark-models-own-params-SPARK-10931.
## What changes were proposed in this pull request?
Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark.
## How was this patch tested?
Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms.
Author: Ajay Saini <ajays725@gmail.com>
Closes#18428 from ajaysaini725/MetaAlgorithmPersistNestedEstimators.
## What changes were proposed in this pull request?
Adding convenience function to Python `JavaWrapper` so that it is easy to create a Py4J JavaArray that is compatible with current class constructors that have a Scala `Array` as input so that it is not necessary to have a Java/Python friendly constructor. The function takes a Java class as input that is used by Py4J to create the Java array of the given class. As an example, `OneVsRest` has been updated to use this and the alternate constructor is removed.
## How was this patch tested?
Added unit tests for the new convenience function and updated `OneVsRest` doctests which use this to persist the model.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#14725 from BryanCutler/pyspark-new_java_array-CountVectorizer-SPARK-17161.
## What changes were proposed in this pull request?
In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
## How was this patch tested?
```scala
import random, string
from pyspark.ml.feature import StringIndexer
l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters
df = spark.createDataFrame(l, ['string'])
for i in range(50):
indexer = StringIndexer(inputCol='string', outputCol='index')
indexer.fit(df)
```
* Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
After: garbage collection works as the object is dereferenced, and computation completes
* Mem footprint tested using profiler
* Added a parameter copy related test which was failing before.
Author: Sandeep Singh <sandeep@techaddict.me>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Closes#15843 from techaddict/SPARK-18274.
## What changes were proposed in this pull request?
Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
## How was this patch tested?
Existing tests.
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Closes#13219 from viirya/pyspark-pickler-ml.
## What changes were proposed in this pull request?
This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
* This includes changing classes taking weightCol to treat unset and empty String Param values the same way.
Defaults changed:
* Scala
* LogisticRegression: weightCol defaults to not set (instead of empty string)
* StringIndexer: labels default to not set (instead of empty array)
* GeneralizedLinearRegression:
* maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
* weightCol defaults to not set (instead of empty string)
* LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
* MultilayerPerceptron: layers default to not set (instead of [1,1])
* ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)
## How was this patch tested?
Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>
Closes#12816 from jkbradley/yinxusen-SPARK-14931.
## What changes were proposed in this pull request?
Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.
This PR: Use unicode everywhere in Python.
## How was this patch tested?
Updated persistence unit test to check uid type
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12368 from jkbradley/python-uid-unicode.
## What changes were proposed in this pull request?
Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.
Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.
## How was this patch tested?
Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.
Author: sethah <seth.hendrickson16@gmail.com>
Closes#11939 from sethah/SPARK-14104.
Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls. This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls. Also, renames Java wrapper classes to better reflect their purpose.
Ran existing Python ml tests and generated documentation to test this change.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
## What changes were proposed in this pull request?
Cleanups to documentation. No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only
## How was this patch tested?
Doc build
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#12266 from jkbradley/ml-doc-cleanups.
## What changes were proposed in this pull request?
Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.
## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13786
Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.
## How was this patch tested?
Test with Python doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#12020 from yinxusen/SPARK-13786.
Adds support for saving and loading nested ML Pipelines from Python. Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
Also:
* Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
* Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
Added new unit test for nested Pipelines. Abstracted validity check into a helper method for the 2 unit tests.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#11866 from jkbradley/nested-pipeline-io.
Closes#11835
## What changes were proposed in this pull request?
In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.
## How was this patch tested?
Ran python tests for ML and MLlib.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes#11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
## What changes were proposed in this pull request?
JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038
1. Add load/save to PySpark Pipeline and PipelineModel
2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.
## How was this patch tested?
Test with doctest.
Author: Xusen Yin <yinxusen@gmail.com>
Closes#11683 from yinxusen/SPARK-13038-only.
Fix this defect by check default value exist or not.
yanboliang Please help to review.
Author: Tommy YU <tummyyu@163.com>
Closes#11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
* Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
* Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#10469 from yanboliang/spark-11939.
No jira is created since this is a trivial change.
davies Please help review it
Author: Jeff Zhang <zjffdu@apache.org>
Closes#10143 from zjffdu/pyspark_typo.
It might be dangerous to have a mutable as value for default param. (http://stackoverflow.com/a/11416002/1170730)
e.g
def func(example, f={}):
f[example] = 1
return f
func(2)
{2: 1}
func(3)
{2:1, 3:1}
mengxr
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#7058 from MechCoder/pipeline_api_playground and squashes the following commits:
40a5eb2 [MechCoder] copy
95f7ff2 [MechCoder] [SPARK-8679] [PySpark] [MLlib] Default values in Pipeline API should be immutable
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#6088 from mengxr/SPARK-7380 and squashes the following commits:
413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
Otherwise, users can only use `transform` on the models. brkyvz
Author: Xiangrui Meng <meng@databricks.com>
Closes#6156 from mengxr/SPARK-7647 and squashes the following commits:
1ae3d2d [Xiangrui Meng] add weights and intercept to LogisticRegression in Python
f49eb46 [Xiangrui Meng] add weights and intercept to LinearRegressionModel
The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]`
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#5930 from brkyvz/ml-feat and squashes the following commits:
73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388
c221db9 [Xiangrui Meng] overload StringArrayParam.w
c81072d [Burak Yavuz] addressed comments
99c2ebf [Burak Yavuz] add to python_shared_params
39ecb07 [Burak Yavuz] fix scalastyle
7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python
Since CrossValidator is a meta algorithm, we copy the implementation in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5926 from mengxr/SPARK-6940 and squashes the following commits:
6af181f [Xiangrui Meng] add TODOs
8285134 [Xiangrui Meng] update doc
060f7c3 [Xiangrui Meng] update doctest
acac727 [Xiangrui Meng] add keyword args
cdddecd [Xiangrui Meng] add CrossValidator in Python
This PR adds `BinaryClassificationEvaluator` to Python ML Pipelines API, which is a simple wrapper of the Scala implementation. oefirouz
Author: Xiangrui Meng <meng@databricks.com>
Closes#5885 from mengxr/SPARK-7333 and squashes the following commits:
25d7451 [Xiangrui Meng] fix tests in python 3
babdde7 [Xiangrui Meng] fix doc
cb51e6a [Xiangrui Meng] add BinaryClassificationEvaluator in PySpark
Same as #5431 but for Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5534 from mengxr/SPARK-6893 and squashes the following commits:
d3b519b [Xiangrui Meng] address comments
ebaccc6 [Xiangrui Meng] style update
fce244e [Xiangrui Meng] update explainParams with test
4d6b07a [Xiangrui Meng] add tests
5294500 [Xiangrui Meng] update default param handling in python
Author: Reynold Xin <rxin@databricks.com>
Closes#4952 from rxin/schemardd-df-reference and squashes the following commits:
b2b1dbe [Reynold Xin] [Docs] Replace references to SchemaRDD with DataFrame
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.
For SPARK-5892:
* Fix Python docs
* Various other cleanups
BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
CC: mengxr (ML), davies (Python docs)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4675 from jkbradley/doc-review-1.3 and squashes the following commits:
f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.
TODO:
- [x] handle parameters in LRModel
- [x] unit tests
- [x] missing some docs
CC: davies jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes#4151 from mengxr/SPARK-4586 and squashes the following commits:
415268e [Xiangrui Meng] remove inherit_doc from __init__
edbd6fe [Xiangrui Meng] move Identifiable to ml.util
44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml
dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
14ae7e2 [Davies Liu] fix docs
54ca7df [Davies Liu] fix tests
78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml
fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
1dca16a [Davies Liu] refactor
090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml
0882513 [Xiangrui Meng] update doc style
a4f4dbf [Xiangrui Meng] add unit test for LR
7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer
ba0ba1e [Xiangrui Meng] add unit tests for pipeline
0586c7b [Xiangrui Meng] add more comments to the example
5153cff [Xiangrui Meng] simplify java models
036ca04 [Xiangrui Meng] gen numFeatures
46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly
1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc
f66ba0c [Xiangrui Meng] make params a property
d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute
f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example
05e3e40 [Xiangrui Meng] update example
d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl
56de571 [Xiangrui Meng] fix style
d0c5bb8 [Xiangrui Meng] a working copy
bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
17ecfb9 [Xiangrui Meng] code gen for shared params
d9ea77c [Xiangrui Meng] update doc
c18dca1 [Xiangrui Meng] make the example working
dadd84e [Xiangrui Meng] add base classes and docs
a3015cf [Xiangrui Meng] add Estimator and Transformer
46eea43 [Xiangrui Meng] a pipeline in python
33b68e0 [Xiangrui Meng] a working LR