... and normL2.
Add test cases to insufficient unit test for `normL1` and `normL2`.
Ref: https://github.com/apache/spark/pull/5359
Author: lewuathe <lewuathe@me.com>
Closes#5374 from Lewuathe/SPARK-6720 and squashes the following commits:
5541b24 [lewuathe] More accurate tests
dc5718c [lewuathe] [SPARK-6720] PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
Add below methods in pyspark for MultivariateStatisticalSummary
- normL1
- normL2
Author: lewuathe <lewuathe@me.com>
Closes#5359 from Lewuathe/SPARK-6262 and squashes the following commits:
cbe439e [lewuathe] Implement missing methods for MultivariateStatisticalSummary
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.
Author: lewuathe <lewuathe@me.com>
Closes#5296 from Lewuathe/SPARK-6615 and squashes the following commits:
f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#5318 from mengxr/SPARK-6660 and squashes the following commits:
0f66ec2 [Xiangrui Meng] recognize object arrays
ad8c42f [Xiangrui Meng] add a test for SPARK-6660
Support indexing in DenseMatrices in PySpark
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#5232 from MechCoder/SPARK-6576 and squashes the following commits:
a735078 [MechCoder] Change bounds
a062025 [MechCoder] Matrices are stored in column order
7917bc1 [MechCoder] [SPARK-6576] DenseMatrix in PySpark should support indexing
This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#5314 from mengxr/SPARK-6642 and squashes the following commits:
dc655a1 [Xiangrui Meng] relax python tests
f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation
fixed python doc build warnings
CC whomever wants to review: rxin mengxr davies
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#5317 from jkbradley/python-doc-warnings and squashes the following commits:
4cd43c2 [Joseph K. Bradley] fixed python doc build warnings
Users should be able to use numpy operators directly on dense vectors. davies atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes#5312 from mengxr/SPARK-6651 and squashes the following commits:
e665c5c [Xiangrui Meng] wrap the result in a dense vector
23dfca3 [Xiangrui Meng] delegate dense vector arithmetics to the underlying numpy array
Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
```scala
LogisticRegressionWithLBFGS
setNumClasses
setValidateData
LogisticRegressionModel
getThreshold
numClasses
numFeatures
SVMWithSGD
setValidateData
SVMModel
getThreshold
```
For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5137 from yanboliang/spark-6255 and squashes the following commits:
0bd531e [Yanbo Liang] address comments
444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
fc7990b [Yanbo Liang] address comments
b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
This is the sub-task of SPARK-6254.
Wrapping IDFModel `idf` member function for pyspark.
Author: lewuathe <lewuathe@me.com>
Closes#5264 from Lewuathe/SPARK-6598 and squashes the following commits:
1dc522c [lewuathe] [SPARK-6598] Python API for IDFModel
This fixes `predictAll` after load. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#5243 from mengxr/SPARK-6571 and squashes the following commits:
82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
setValidateData
LassoWithSGD
setIntercept
setValidateData
RidgeRegressionWithSGD
setIntercept
setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#4997 from yanboliang/spark-6256 and squashes the following commits:
102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
Weight parameters must be initialized correctly even when numpy array is passed as initial weights.
Author: lewuathe <lewuathe@me.com>
Closes#5101 from Lewuathe/SPARK-6421 and squashes the following commits:
7795201 [lewuathe] Fix lint-python errors
21d4fe3 [lewuathe] Fix init logic of weights
For Python's linear models, weights and intercept are stored in Python.
This PR implements Python's linear models sava/load functions which do the same thing as scala.
It can also make model import/export cross languages.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5016 from yanboliang/spark-6095 and squashes the following commits:
d9bb824 [Yanbo Liang] fix python style
b3813ca [Yanbo Liang] linear model save/load for Python reuse the Scala implementation
Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen
Author: Xiangrui Meng <meng@databricks.com>
Closes#5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:
570ba81 [Xiangrui Meng] fix python style
b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
Add LassoModel to __all__ in regression.py
LassoModel does not show up in Python docs
This should be merged into branch-1.3 and master.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4970 from jkbradley/SPARK-6253 and squashes the following commits:
c2cb533 [Joseph K. Bradley] Add LassoModel to __all__ in regression.py
A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.
davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?
Author: Xiangrui Meng <meng@databricks.com>
Closes#4863 from mengxr/SPARK-6090 and squashes the following commits:
009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python.
jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4854 from mengxr/SPARK-6097 and squashes the following commits:
4586a4d [Xiangrui Meng] fix more typos
8ebcac2 [Xiangrui Meng] fix python style
91172d8 [Xiangrui Meng] fix typos
201b3b9 [Xiangrui Meng] update user guide
b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib
`df.dtypes` shows `null` for UDTs. This PR uses `udt` by default and `VectorUDT` overwrites it with `vector`.
jkbradley davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#4858 from mengxr/SPARK-6121 and squashes the following commits:
34f0a77 [Xiangrui Meng] simpleString for UDT
Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter.
It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#4831 from yanboliang/pyspark_classification and squashes the following commits:
12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes#4811 from mengxr/SPARK-5991 and squashes the following commits:
f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
* Add GradientBoostedTrees Python examples to ML guide
* I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4750 from jkbradley/SPARK-5974 and squashes the following commits:
c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide. Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide. Added GBT examples to ML guide
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.
For SPARK-5892:
* Fix Python docs
* Various other cleanups
BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
CC: mengxr (ML), davies (Python docs)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4675 from jkbradley/doc-review-1.3 and squashes the following commits:
f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
Add a seed for tests.
Author: Davies Liu <davies@databricks.com>
Closes#4358 from davies/flaky_test and squashes the following commits:
02371c3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_test
ced499b [Davies Liu] add seed for test
The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.
CC: srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes#4321 from mengxr/SPARK-5536 and squashes the following commits:
5a3cee8 [Xiangrui Meng] update python tests that are too strict
e840acf [Xiangrui Meng] ignore scala style check for ALS.train
e9a721c [Xiangrui Meng] update mima excludes
9ee6a36 [Xiangrui Meng] merge master
9a8aeac [Xiangrui Meng] update tests
d8c3271 [Xiangrui Meng] remove analyzeBlocks
d68eee7 [Xiangrui Meng] add checkpoint to new ALS
22a56f8 [Xiangrui Meng] wrap old ALS
c387dff [Xiangrui Meng] support random seed
3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
This PR is implementing the Gradient Boosted Trees for Python API.
Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>
Closes#3951 from kazk1018/gbt_for_py and squashes the following commits:
620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
There is only a single `stat.py` file for the `mllib.stat` package. We recently added `MultivariateGaussian` under `mllib.stat.distribution` in Scala/Java. It would be nice to refactor `stat.py` and make it easy to expand. Note that `ChiSqTestResult` is moved from `mllib.stat` to `mllib.stat.test`. The latter is used in Scala/Java. It is only used in the return value of `Statistics.chiSqTest`, so this should be an okay change.
davies
Author: Xiangrui Meng <meng@databricks.com>
Closes#4266 from mengxr/py-stat-refactor and squashes the following commits:
1a5e1db [Xiangrui Meng] refactor stat.py
This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark
Author: nate.crosswhite <nate.crosswhite@stresearch.com>
Author: nxwhite-str <nxwhite-str@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes#3610 from nxwhite-str/master and squashes the following commits:
a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed
7668124 [Xiangrui Meng] minor updates
f8d5928 [nate.crosswhite] Addressing PR issues
277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test
616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
Slightly different than the scala code which converts the sparsevector into a densevector and then checks the index.
I also hope I've added tests in the right place.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#4025 from MechCoder/spark-2909 and squashes the following commits:
07d0f26 [MechCoder] STY: Rename item to index
f02148b [MechCoder] [SPARK-2909] [Mlib] SparseVector in pyspark now supports indexing
It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector.
Also, pickle may have better performance for larger object (less RPC).
In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert.
This PR should be ported into branch-1.2
Author: Davies Liu <davies@databricks.com>
Closes#4023 from davies/listconvert and squashes the following commits:
55d4ab2 [Davies Liu] fix MapConverter and ListConverter in MLlib
...ySpark MLlib
This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 .
Author: RJ Nowling <rnowling@gmail.com>
Closes#3955 from rnowling/spark4891 and squashes the following commits:
1236a01 [RJ Nowling] Fix Python style issues
7a01a78 [RJ Nowling] Fix Python style issues
174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to PySpark MLlib
This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.
The PR includes the fix, as well as a new test for the correct conversion behavior.
davies
Author: freeman <the.freeman.lab@gmail.com>
Closes#3902 from freeman-lab/fix-vector-convert and squashes the following commits:
764db47 [freeman] Add a test for proper conversion behavior
704f97e [freeman] Return array after changing type
Modify python annotations for sphinx. There is no change to build process from.
https://github.com/apache/spark/blob/master/docs/README.md
Author: lewuathe <lewuathe@me.com>
Closes#3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:
88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
+ small doc edit
+ include edit to make IntelliJ happy
CC: davies mengxr
Note to davies -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like. (Those warnings occur when generating the Python docs.)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3669 from jkbradley/python-warnings and squashes the following commits:
4587868 [Joseph K. Bradley] fixed warning
8cb073c [Joseph K. Bradley] Updated based on davies recommendation
c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc. Small doc edit. Small include edit to make IntelliJ happy.
This PR tests the pyspark Chi-squared hypothesis test from this commit: c8abddc516 and moves some of the error messaging in to python.
It is a port of the Scala tests here: [HypothesisTestSuite.scala](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala)
Hopefully, SPARK-2980 can be closed.
Author: jbencook <jbenjamincook@gmail.com>
Closes#3679 from jbencook/master and squashes the following commits:
44078e0 [jbencook] checking that bad input throws the correct exceptions
f12ee10 [jbencook] removing checks for ValueError since input tests are on the Scala side
7536cf1 [jbencook] removing python checks for invalid input
a17ee84 [jbencook] [SPARK-2980][mllib] adding unit tests for the pyspark chi-squared test
3aeb0d9 [jbencook] [SPARK-2980][mllib] bringing Chi-squared error messages to the python side
I improved `IDFModel.transform` to allow using a single vector.
[[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)
Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#3603 from yu-iskw/idf and squashes the following commits:
256ff3d [Yuu ISHIKAWA] Fix typo
a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)
Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
* Use train/test split, and compute test error instead of training error.
* Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests. (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming
I have run all examples and relevant unit tests.
CC: mengxr manishamde codedeft
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Closes#3461 from jkbradley/ensemble-docs and squashes the following commits:
70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file. added header. Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split. tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide. still need to update their examples
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.
It also improve the serialization of DenseVector.
Before this change:
trial | trainingTime | testTime
-------|--------|--------
0 | 5.126 | 1.786
1 |2.698 |1.693
After the change:
trial | trainingTime | testTime
-------|--------|--------
0 |4.692 |0.554
1 |2.307 |0.525
This could partially fix the performance regression during test.
Author: Davies Liu <davies@databricks.com>
Closes#3420 from davies/ser2 and squashes the following commits:
0e1e6f3 [Davies Liu] fix tests
426f5db [Davies Liu] impove toArray()
44707ec [Davies Liu] add name for ISO-8859-1
fa7d791 [Davies Liu] address comments
1cfb137 [Davies Liu] handle zero sparse vector
2548ee2 [Davies Liu] fix tests
9e6389d [Davies Liu] bugfix
470f702 [Davies Liu] speed up DenseMatrix
f0d3c40 [Davies Liu] speedup SparseVector
ef6ce70 [Davies Liu] speed up dense vector
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.
This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.
Author: Davies Liu <davies@databricks.com>
Closes#3397 from davies/cache and squashes the following commits:
7f6e6ce [Davies Liu] Update -> Updater
4b52edd [Davies Liu] using named argument
63b984e [Davies Liu] fix
7da0332 [Davies Liu] add unpersist()
dff33e1 [Davies Liu] address comments
c2bdfc2 [Davies Liu] refactor
d572f00 [Davies Liu] Merge branch 'master' into cache
f1063e1 [Davies Liu] cache serialized java object
```
class RandomForestModel
| A model trained by RandomForest
|
| numTrees(self)
| Get number of trees in forest.
|
| predict(self, x)
| Predict values for a single data point or an RDD of points using the model trained.
|
| toDebugString(self)
| Full model
|
| totalNumNodes(self)
| Get total number of nodes, summed over all trees in the forest.
|
class RandomForest
| trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
| Method to train a decision tree model for binary or multiclass classification.
|
| :param data: Training dataset: RDD of LabeledPoint.
| Labels should take values {0, 1, ..., numClasses-1}.
| :param numClassesForClassification: number of classes for classification.
| :param categoricalFeaturesInfo: Map storing arity of categorical features.
| E.g., an entry (n -> k) indicates that feature n is categorical
| with k categories indexed from 0: {0, 1, ..., k-1}.
| :param numTrees: Number of trees in the random forest.
| :param featureSubsetStrategy: Number of features to consider for splits at each node.
| Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
| If "auto" is set, this parameter is set based on numTrees:
| if numTrees == 1, set to "all";
| if numTrees > 1 (forest) set to "sqrt".
| :param impurity: Criterion used for information gain calculation.
| Supported values: "gini" (recommended) or "entropy".
| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
| 1 internal node + 2 leaf nodes. (default: 4)
| :param maxBins: maximum number of bins used for splitting features (default: 100)
| :param seed: Random seed for bootstrapping and choosing feature subsets.
| :return: RandomForestModel that can be used for prediction
|
| trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
| Method to train a decision tree model for regression.
|
| :param data: Training dataset: RDD of LabeledPoint.
| Labels are real numbers.
| :param categoricalFeaturesInfo: Map storing arity of categorical features.
| E.g., an entry (n -> k) indicates that feature n is categorical
| with k categories indexed from 0: {0, 1, ..., k-1}.
| :param numTrees: Number of trees in the random forest.
| :param featureSubsetStrategy: Number of features to consider for splits at each node.
| Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
| If "auto" is set, this parameter is set based on numTrees:
| if numTrees == 1, set to "all";
| if numTrees > 1 (forest) set to "onethird".
| :param impurity: Criterion used for information gain calculation.
| Supported values: "variance".
| :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
| 1 internal node + 2 leaf nodes.(default: 4)
| :param maxBins: maximum number of bins used for splitting features (default: 100)
| :param seed: Random seed for bootstrapping and choosing feature subsets.
| :return: RandomForestModel that can be used for prediction
|
```
Author: Davies Liu <davies@databricks.com>
Closes#3320 from davies/forest and squashes the following commits:
8003dfc [Davies Liu] reorder
53cf510 [Davies Liu] fix docs
4ca593d [Davies Liu] fix docs
e0df852 [Davies Liu] fix docs
0431746 [Davies Liu] rebased
2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
885abee [Davies Liu] address comments
dae7fc0 [Davies Liu] address comments
89a000f [Davies Liu] fix docs
565d476 [Davies Liu] add python api for random forest
```
class LogisticRegressionWithLBFGS
| train(cls, data, iterations=100, initialWeights=None, corrections=10, tolerance=0.0001, regParam=0.01, intercept=False)
| Train a logistic regression model on the given data.
|
| :param data: The training data, an RDD of LabeledPoint.
| :param iterations: The number of iterations (default: 100).
| :param initialWeights: The initial weights (default: None).
| :param regParam: The regularizer parameter (default: 0.01).
| :param regType: The type of regularizer used for training
| our model.
| :Allowed values:
| - "l1" for using L1 regularization
| - "l2" for using L2 regularization
| - None for no regularization
| (default: "l2")
| :param intercept: Boolean parameter which indicates the use
| or not of the augmented representation for
| training data (i.e. whether bias features
| are activated or not).
| :param corrections: The number of corrections used in the LBFGS update (default: 10).
| :param tolerance: The convergence tolerance of iterations for L-BFGS (default: 1e-4).
|
| >>> data = [
| ... LabeledPoint(0.0, [0.0, 1.0]),
| ... LabeledPoint(1.0, [1.0, 0.0]),
| ... ]
| >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
| >>> lrm.predict([1.0, 0.0])
| 1
| >>> lrm.predict([0.0, 1.0])
| 0
| >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
| [1, 0]
```
Author: Davies Liu <davies@databricks.com>
Closes#3307 from davies/lbfgs and squashes the following commits:
34bd986 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into lbfgs
5a945a6 [Davies Liu] address comments
941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into lbfgs
03e5543 [Davies Liu] add it to docs
ed2f9a8 [Davies Liu] add regType
76cd1b6 [Davies Liu] reorder arguments
4429a74 [Davies Liu] Update classification.py
9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. However, model.predict outputs an RDD of Rating. So on the input side, users can use r[0], r[1], r[2], while on the output side, users have to use r.user, r.product, r.rating. We should allow lookup by index in Rating by making Rating a namedtuple.
davies
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261)
<!-- Reviewable:end -->
Author: Xiangrui Meng <meng@databricks.com>
Closes#3261 from mengxr/SPARK-4396 and squashes the following commits:
543aef0 [Xiangrui Meng] use named tuple to implement ALS
0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4396
d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating
This PR add setThrehold() and clearThreshold() for LogisticRegressionModel and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), SVNModel.predict() and NaiveBayes.predict()
Author: Davies Liu <davies@databricks.com>
Closes#3305 from davies/setThreshold and squashes the following commits:
d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into setThreshold
e4acd76 [Davies Liu] address comments
2231a5f [Davies Liu] bugfix
7bd9009 [Davies Liu] address comments
0b0a8a7 [Davies Liu] address comments
c1e5573 [Davies Liu] improve classification
The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy.
In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default.
davies atalwalkar
Author: Xiangrui Meng <meng@databricks.com>
Closes#3232 from mengxr/SPARK-4372 and squashes the following commits:
9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments
d3ba096 [Xiangrui Meng] change 'none' back to None
1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM
This PR rename random.py to rand.py to avoid the side affects of conflict with random module, but still keep the same interface as before.
```
>>> from pyspark.mllib.random import RandomRDDs
```
```
$ pydoc pyspark.mllib.random
Help on module random in pyspark.mllib:
NAME
random - Python package for random data generation.
FILE
/Users/davies/work/spark/python/pyspark/mllib/rand.py
CLASSES
__builtin__.object
pyspark.mllib.random.RandomRDDs
class RandomRDDs(__builtin__.object)
| Generator methods for creating RDDs comprised of i.i.d samples from
| some distribution.
|
| Static methods defined here:
|
| normalRDD(sc, size, numPartitions=None, seed=None)
```
cc mengxr
reference link: http://xion.org.pl/2012/05/06/hacking-python-imports/
Author: Davies Liu <davies@databricks.com>
Closes#3216 from davies/random and squashes the following commits:
7ac4e8b [Davies Liu] rename random.py to rand.py
Fix TreeModel.predict() with RDD, added tests for it.
(Also checked that other models don't have this issue)
Author: Davies Liu <davies@databricks.com>
Closes#3230 from davies/predict and squashes the following commits:
81172aa [Davies Liu] fix predict