Commit graph

125 commits

Author SHA1 Message Date
Yanbo Liang f4f39981f4 [SPARK-6827] [MLLIB] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
Make PySpark ```FPGrowthModel.freqItemsets``` consistent with Java/Scala API like ```MatrixFactorizationModel.userFeatures```
It return a RDD with each tuple is composed of an array and a long value.
I think it's difficult to implement namedtuples to wrap the output because items of freqItemsets can be any type with arbitrary length which is tedious to impelement corresponding SerDe function.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5614 from yanboliang/spark-6827 and squashes the following commits:

da8c404 [Yanbo Liang] use namedtuple
5532e78 [Yanbo Liang] Wrap FPGrowthModel.freqItemsets and make it consistent with Java API
2015-04-22 17:22:26 -07:00
Reynold Xin 3134c3fe49 [SPARK-6953] [PySpark] speed up python tests
This PR try to speed up some python tests:

```
tests.py                       144s -> 103s      -41s
mllib/classification.py         24s -> 17s        -7s
mllib/regression.py             27s -> 15s       -12s
mllib/tree.py                   27s -> 13s       -14s
mllib/tests.py                  64s -> 31s       -33s
streaming/tests.py             185s -> 84s      -101s
```
Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).

During testing, it will show used time for each test file:
```
Run core tests ...
Running test: pyspark/rdd.py ... ok (22s)
Running test: pyspark/context.py ... ok (16s)
Running test: pyspark/conf.py ... ok (4s)
Running test: pyspark/broadcast.py ... ok (4s)
Running test: pyspark/accumulators.py ... ok (4s)
Running test: pyspark/serializers.py ... ok (6s)
Running test: pyspark/profiler.py ... ok (5s)
Running test: pyspark/shuffle.py ... ok (1s)
Running test: pyspark/tests.py ... ok (103s)   144s
```

Author: Reynold Xin <rxin@databricks.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #5605 from rxin/python-tests-speed and squashes the following commits:

d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
89321ee [Xiangrui Meng] fix seed in tests
3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
2015-04-21 17:49:55 -07:00
Xiangrui Meng 686dd742e1 [SPARK-7036][MLLIB] ALS.train should support DataFrames in PySpark
SchemaRDD works with ALS.train in 1.2, so we should continue support DataFrames for compatibility. coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #5619 from mengxr/SPARK-7036 and squashes the following commits:

dfcaf5a [Xiangrui Meng] ALS.train should support DataFrames in PySpark
2015-04-21 16:44:52 -07:00
MechCoder 45c47fa417 [SPARK-6845] [MLlib] [PySpark] Add isTranposed flag to DenseMatrix
Since sparse matrices now support a isTransposed flag for row major data, DenseMatrices should do the same.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5455 from MechCoder/spark-6845 and squashes the following commits:

525c370 [MechCoder] minor
004a37f [MechCoder] Cast boolean to int
151f3b6 [MechCoder] [WIP] Add isTransposed to pickle DenseMatrix
cc0b90a [MechCoder] [SPARK-6845] Add isTranposed flag to DenseMatrix
2015-04-21 14:36:50 -07:00
Elisey Zanko 77176619a9 [SPARK-6661] Python type errors should print type, not object
Author: Elisey Zanko <elisey.zanko@gmail.com>

Closes #5361 from 31z4/spark-6661 and squashes the following commits:

73c5d79 [Elisey Zanko] Python type errors should print type, not object
2015-04-20 10:44:09 -07:00
Davies Liu 04e44b37cc [SPARK-4897] [PySpark] Python 3 support
This PR update PySpark to support Python 3 (tested with 3.4).

Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.

TODO: ec2/spark-ec2.py is not fully tested with python3.

Author: Davies Liu <davies@databricks.com>
Author: twneale <twneale@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #5173 from davies/python3 and squashes the following commits:

d7d6323 [Davies Liu] fix tests
6c52a98 [Davies Liu] fix mllib test
99e334f [Davies Liu] update timeout
b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
cafd5ec [Davies Liu] adddress comments from @mengxr
bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
179fc8d [Davies Liu] tuning flaky tests
8c8b957 [Davies Liu] fix ResourceWarning in Python 3
5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
4006829 [Davies Liu] fix test
2fc0066 [Davies Liu] add python3 path
71535e9 [Davies Liu] fix xrange and divide
5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ed498c8 [Davies Liu] fix compatibility with python 3
820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ad7c374 [Davies Liu] fix mllib test and warning
ef1fc2f [Davies Liu] fix tests
4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
59bb492 [Davies Liu] fix tests
1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ca0fdd3 [Davies Liu] fix code style
9563a15 [Davies Liu] add imap back for python 2
0b1ec04 [Davies Liu] make python examples work with Python 3
d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
a716d34 [Davies Liu] test with python 3.4
f1700e8 [Davies Liu] fix test in python3
671b1db [Davies Liu] fix test in python3
692ff47 [Davies Liu] fix flaky test
7b9699f [Davies Liu] invalidate import cache for Python 3.3+
9c58497 [Davies Liu] fix kill worker
309bfbf [Davies Liu] keep compatibility
5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
f53e1f0 [Davies Liu] fix tests
70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
a39167e [Davies Liu] support customize class in __main__
814c77b [Davies Liu] run unittests with python 3
7f4476e [Davies Liu] mllib tests passed
d737924 [Davies Liu] pass ml tests
375ea17 [Davies Liu] SQL tests pass
6cc42a9 [Davies Liu] rename
431a8de [Davies Liu] streaming tests pass
78901a7 [Davies Liu] fix hash of serializer in Python 3
24b2f2e [Davies Liu] pass all RDD tests
35f48fe [Davies Liu] run future again
1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
6e3c21d [Davies Liu] make cloudpickle work with Python3
2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
f40d925 [twneale] xrange --> range
e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
854be27 [Josh Rosen] Run `futurize` on Python code:
7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
2015-04-16 16:20:57 -07:00
lewuathe fc17661475 [SPARK-6643][MLLIB] Implement StandardScalerModel missing methods
This is the sub-task of SPARK-6254.
Wrap missing method for `StandardScalerModel`.

Author: lewuathe <lewuathe@me.com>

Closes #5310 from Lewuathe/SPARK-6643 and squashes the following commits:

fafd690 [lewuathe] Fix for lint-python
bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
578f5ee [lewuathe] Remove unnecessary class
a38f155 [lewuathe] Merge master
66bb2ab [lewuathe] Fix typos
82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
2015-04-12 22:17:16 -07:00
MechCoder e2360810f5 [SPARK-6577] [MLlib] [PySpark] SparseMatrix should be supported in PySpark
Supporting of SparseMatrix in PySpark.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5355 from MechCoder/spark-6577 and squashes the following commits:

7492190 [MechCoder] More readable code for densifying
ea2c54b [MechCoder] Check bounds for indexing
454ef2c [MechCoder] Made the following changes 1. Used convert_to_array for array conversion. 2. Used F order for toArray 3. Minor improvements in speed.
db76caf [MechCoder] Add support for CSR matrix
29653e7 [MechCoder] Renamed indices to rowIndices and indptr to colPtrs
b6384fe [MechCoder] [SPARK-6577] SparseMatrix should be supported in PySpark
2015-04-09 23:10:13 -07:00
Yanbo Liang a0411aebee [SPARK-6264] [MLLIB] Support FPGrowth algorithm in Python API
Support FPGrowth algorithm in Python API.
Should we remove "Experimental" which were marked for FPGrowth and FPGrowthModel in Scala? jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5213 from yanboliang/spark-6264 and squashes the following commits:

ed62ead [Yanbo Liang] trigger jenkins
8ce0359 [Yanbo Liang] fix docstring style
544c725 [Yanbo Liang] address comments
a2d7cf7 [Yanbo Liang] add doc for FPGrowth.train()
dcf7d73 [Yanbo Liang] add python doc
b18fd07 [Yanbo Liang] trigger jenkins
2c951b8 [Yanbo Liang] fix typos
7f62c8f [Yanbo Liang] add fpm to __init__.py
b96206a [Yanbo Liang] Support FPGrowth algorithm in Python API
2015-04-09 15:10:10 -07:00
lewuathe fc957dc781 [SPARK-6720][MLLIB] PySpark MultivariateStatisticalSummary unit test for normL1...
... and normL2.
Add test cases to insufficient unit test for `normL1` and `normL2`.

Ref: https://github.com/apache/spark/pull/5359

Author: lewuathe <lewuathe@me.com>

Closes #5374 from Lewuathe/SPARK-6720 and squashes the following commits:

5541b24 [lewuathe] More accurate tests
dc5718c [lewuathe] [SPARK-6720] PySpark MultivariateStatisticalSummary unit test for normL1 and normL2
2015-04-07 14:36:57 -07:00
lewuathe acffc43455 [SPARK-6262][MLLIB]Implement missing methods for MultivariateStatisticalSummary
Add below methods in pyspark for MultivariateStatisticalSummary
- normL1
- normL2

Author: lewuathe <lewuathe@me.com>

Closes #5359 from Lewuathe/SPARK-6262 and squashes the following commits:

cbe439e [lewuathe] Implement missing methods for MultivariateStatisticalSummary
2015-04-05 16:13:31 -07:00
lewuathe 512a2f191a [SPARK-6615][MLLIB] Python API for Word2Vec
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.

Author: lewuathe <lewuathe@me.com>

Closes #5296 from Lewuathe/SPARK-6615 and squashes the following commits:

f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
2015-04-03 09:49:50 -07:00
Xiangrui Meng 4815bc2128 [SPARK-6660][MLLIB] pythonToJava doesn't recognize object arrays
davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #5318 from mengxr/SPARK-6660 and squashes the following commits:

0f66ec2 [Xiangrui Meng] recognize object arrays
ad8c42f [Xiangrui Meng] add a test for SPARK-6660
2015-04-01 18:17:07 -07:00
MechCoder 2fa3b47dbf [SPARK-6576] [MLlib] [PySpark] DenseMatrix in PySpark should support indexing
Support indexing in DenseMatrices in PySpark

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5232 from MechCoder/SPARK-6576 and squashes the following commits:

a735078 [MechCoder] Change bounds
a062025 [MechCoder] Matrices are stored in column order
7917bc1 [MechCoder] [SPARK-6576] DenseMatrix in PySpark should support indexing
2015-04-01 17:03:39 -07:00
Xiangrui Meng ccafd757ed [SPARK-6642][MLLIB] use 1.2 lambda scaling and remove addImplicit from NormalEquation
This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #5314 from mengxr/SPARK-6642 and squashes the following commits:

dc655a1 [Xiangrui Meng] relax python tests
f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation
2015-04-01 16:47:18 -07:00
Joseph K. Bradley fb25e8c7f4 [SPARK-6657] [Python] [Docs] fixed python doc build warnings
fixed python doc build warnings

CC whomever wants to review: rxin mengxr davies

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #5317 from jkbradley/python-doc-warnings and squashes the following commits:

4cd43c2 [Joseph K. Bradley] fixed python doc build warnings
2015-04-01 15:15:47 -07:00
Xiangrui Meng 2275acce7b [SPARK-6651][MLLIB] delegate dense vector arithmetics to the underlying numpy array
Users should be able to use numpy operators directly on dense vectors. davies atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #5312 from mengxr/SPARK-6651 and squashes the following commits:

e665c5c [Xiangrui Meng] wrap the result in a dense vector
23dfca3 [Xiangrui Meng] delegate dense vector arithmetics to the underlying numpy array
2015-04-01 13:29:04 -07:00
Yanbo Liang b5bd75d90a [SPARK-6255] [MLLIB] Support multiclass classification in Python API
Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
```scala
LogisticRegressionWithLBFGS
    setNumClasses
    setValidateData
LogisticRegressionModel
    getThreshold
    numClasses
    numFeatures
SVMWithSGD
    setValidateData
SVMModel
    getThreshold
```
For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5137 from yanboliang/spark-6255 and squashes the following commits:

0bd531e [Yanbo Liang] address comments
444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization
fc7990b [Yanbo Liang] address comments
b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API
ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
2015-03-31 11:32:14 -07:00
lewuathe 46de6c05e0 [SPARK-6598][MLLIB] Python API for IDFModel
This is the sub-task of SPARK-6254.
Wrapping IDFModel `idf` member function for pyspark.

Author: lewuathe <lewuathe@me.com>

Closes #5264 from Lewuathe/SPARK-6598 and squashes the following commits:

1dc522c [lewuathe] [SPARK-6598] Python API for IDFModel
2015-03-31 11:25:21 -07:00
Xiangrui Meng f75f633b21 [SPARK-6571][MLLIB] use wrapper in MatrixFactorizationModel.load
This fixes `predictAll` after load. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #5243 from mengxr/SPARK-6571 and squashes the following commits:

82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
2015-03-28 15:08:05 -07:00
Yanbo Liang 435337381f [SPARK-6256] [MLlib] MLlib Python API parity check for regression
MLlib Python API parity check for Regression, major disparities need to be added for Python list following:
```scala
LinearRegressionWithSGD
    setValidateData
LassoWithSGD
    setIntercept
    setValidateData
RidgeRegressionWithSGD
    setIntercept
    setValidateData
```
setFeatureScaling is mllib private function which is not needed to expose in pyspark.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #4997 from yanboliang/spark-6256 and squashes the following commits:

102f498 [Yanbo Liang] fix intercept issue & add doc test
1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept'
de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
2015-03-25 13:38:33 -07:00
lewuathe 257cde7c36 [SPARK-6421][MLLIB] _regression_train_wrapper does not test initialWeights correctly
Weight parameters must be initialized correctly even when numpy array is passed as initial weights.

Author: lewuathe <lewuathe@me.com>

Closes #5101 from Lewuathe/SPARK-6421 and squashes the following commits:

7795201 [lewuathe] Fix lint-python errors
21d4fe3 [lewuathe] Fix init logic of weights
2015-03-20 17:18:18 -04:00
Xusen Yin 25636d9867 [Spark 6096][MLlib] Add Naive Bayes load save methods in Python
See [SPARK-6096](https://issues.apache.org/jira/browse/SPARK-6096).

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5090 from yinxusen/SPARK-6096 and squashes the following commits:

bd0fea5 [Xusen Yin] fix style problem, etc.
3fd41f2 [Xusen Yin] use hanging indent in Python style
e83803d [Xusen Yin] fix Python style
d6dbde5 [Xusen Yin] fix python call java error
a054bb3 [Xusen Yin] add save load for NaiveBayes python
2015-03-20 14:53:59 -04:00
Yanbo Liang 48866f7897 [SPARK-6095] [MLLIB] Support model save/load in Python's linear models
For Python's linear models, weights and intercept are stored in Python.
This PR implements Python's linear models sava/load functions which do the same thing as scala.
It can also make model import/export cross languages.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5016 from yanboliang/spark-6095 and squashes the following commits:

d9bb824 [Yanbo Liang] fix python style
b3813ca [Yanbo Liang] linear model save/load for Python reuse the Scala implementation
2015-03-20 14:44:21 -04:00
Xiangrui Meng c94d062647 [SPARK-6226][MLLIB] add save/load in PySpark's KMeansModel
Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen

Author: Xiangrui Meng <meng@databricks.com>

Closes #5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:

570ba81 [Xiangrui Meng] fix python style
b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
2015-03-17 12:14:40 -07:00
Joseph K. Bradley 17c309c87e [mllib] [python] Add LassoModel to __all__ in regression.py
Add LassoModel to __all__ in regression.py

LassoModel does not show up in Python docs

This should be merged into branch-1.3 and master.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4970 from jkbradley/SPARK-6253 and squashes the following commits:

c2cb533 [Joseph K. Bradley] Add LassoModel to __all__ in regression.py
2015-03-12 16:46:29 -07:00
Xiangrui Meng 0bfacd5c5d [SPARK-6090][MLLIB] add a basic BinaryClassificationMetrics to PySpark/MLlib
A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR.

davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark?

Author: Xiangrui Meng <meng@databricks.com>

Closes #4863 from mengxr/SPARK-6090 and squashes the following commits:

009a3a3 [Xiangrui Meng] provide schema
dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
2015-03-05 11:50:09 -08:00
Xiangrui Meng 7e53a79c30 [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlib
Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4854 from mengxr/SPARK-6097 and squashes the following commits:

4586a4d [Xiangrui Meng] fix more typos
8ebcac2 [Xiangrui Meng] fix python style
91172d8 [Xiangrui Meng] fix typos
201b3b9 [Xiangrui Meng] update user guide
b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib
2015-03-02 22:27:01 -08:00
Xiangrui Meng 2db6a853a5 [SPARK-6121][SQL][MLLIB] simpleString for UDT
`df.dtypes` shows `null` for UDTs. This PR uses `udt` by default and `VectorUDT` overwrites it with `vector`.

jkbradley davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #4858 from mengxr/SPARK-6121 and squashes the following commits:

34f0a77 [Xiangrui Meng] simpleString for UDT
2015-03-02 17:14:34 -08:00
Yanbo Liang af2effdd7b [SPARK-6080] [PySpark] correct LogisticRegressionWithLBFGS regType parameter for pyspark
Currently LogisticRegressionWithLBFGS in python/pyspark/mllib/classification.py will invoke callMLlibFunc with a wrong "regType" parameter.
It was assigned to "str(regType)" which translate None(Python) to "None"(Java/Scala). The right way should be translate None(Python) to null(Java/Scala) just as what we did at LogisticRegressionWithSGD.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #4831 from yanboliang/pyspark_classification and squashes the following commits:

12db65a [Yanbo Liang] correct LogisticRegressionWithLBFGS regType parameter for pyspark
2015-03-02 10:17:24 -08:00
Xiangrui Meng aedbbaa3dd [SPARK-6053][MLLIB] support save/load in PySpark's ALS
A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4811 from mengxr/SPARK-5991 and squashes the following commits:

f135dac [Xiangrui Meng] update save doc
57e5200 [Xiangrui Meng] address comments
06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991
282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
2015-03-01 16:26:57 -08:00
Joseph K. Bradley d20559b157 [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT
* Add GradientBoostedTrees Python examples to ML guide
  * I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:

c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
2015-02-25 16:13:17 -08:00
Joseph K. Bradley 4a17eedb16 [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.

For SPARK-5892:
* Fix Python docs
* Various other cleanups

BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]

CC: mengxr  (ML),  davies  (Python docs)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:

f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
2015-02-20 02:31:32 -08:00
Reynold Xin e98dfe627c [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
- The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
- toDataFrame -> toDF
- Dsl -> functions
- implicits moved into SQLContext.implicits
- addColumn -> withColumn
- renameColumn -> withColumnRenamed

Python changes:
- toDataFrame -> toDF
- Dsl -> functions package
- addColumn -> withColumn
- renameColumn -> withColumnRenamed
- add toDF functions to RDD on SQLContext init
- add flatMap to DataFrame

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4556 from rxin/SPARK-5752 and squashes the following commits:

5ef9910 [Reynold Xin] More fix
61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
ff5832c [Reynold Xin] Fix python
749c675 [Reynold Xin] count(*) fixes.
5806df0 [Reynold Xin] Fix build break again.
d941f3d [Reynold Xin] Fixed explode compilation break.
fe1267a [Davies Liu] flatMap
c4afb8e [Reynold Xin] style
d9de47f [Davies Liu] add comment
b783994 [Davies Liu] add comment for toDF
e2154e5 [Davies Liu] schema() -> schema
3a1004f [Davies Liu] Dsl -> functions, toDF()
fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
97dd47c [Davies Liu] fix mistake
6168f74 [Davies Liu] fix test
1fc0199 [Davies Liu] fix test
a075cd5 [Davies Liu] clean up, toPandas
663d314 [Davies Liu] add test for agg('*')
9e214d5 [Reynold Xin] count(*) fixes.
1ed7136 [Reynold Xin] Fix build break again.
921b2e3 [Reynold Xin] Fixed explode compilation break.
14698d4 [Davies Liu] flatMap
ba3e12d [Reynold Xin] style
d08c92d [Davies Liu] add comment
5c8b524 [Davies Liu] add comment for toDF
a4e5e66 [Davies Liu] schema() -> schema
d377fc9 [Davies Liu] Dsl -> functions, toDF()
6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
2015-02-13 23:03:22 -08:00
Davies Liu 08488c175f [SPARK-5469] restructure pyspark.sql into multiple files
All the DataTypes moved into pyspark.sql.types

The changes can be tracked by `--find-copies-harder -M25`
```
davieslocalhost:~/work/spark/python$ git diff --find-copies-harder -M25 --numstat master..
2       5       python/docs/pyspark.ml.rst
0       3       python/docs/pyspark.mllib.rst
10      2       python/docs/pyspark.sql.rst
1       1       python/pyspark/mllib/linalg.py
21      14      python/pyspark/{mllib => sql}/__init__.py
14      2108    python/pyspark/{sql.py => sql/context.py}
10      1772    python/pyspark/{sql.py => sql/dataframe.py}
7       6       python/pyspark/{sql_tests.py => sql/tests.py}
8       1465    python/pyspark/{sql.py => sql/types.py}
4       2       python/run-tests
1       1       sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala
```

Also `git blame -C -C python/pyspark/sql/context.py` to track the history.

Author: Davies Liu <davies@databricks.com>

Closes #4479 from davies/sql and squashes the following commits:

1b5f0a5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sql
2b2b983 [Davies Liu] restructure pyspark.sql
2015-02-09 20:49:22 -08:00
Davies Liu 38a416f036 [SPARK-5585] Flaky test in MLlib python
Add a seed for tests.

Author: Davies Liu <davies@databricks.com>

Closes #4358 from davies/flaky_test and squashes the following commits:

02371c3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_test
ced499b [Davies Liu] add seed for test
2015-02-04 08:54:20 -08:00
Xiangrui Meng 0cc7b88c99 [SPARK-5536] replace old ALS implementation by the new one
The only issue is that `analyzeBlock` is removed, which was marked as a developer API. I didn't change other tests in the ALSSuite under `spark.mllib` to ensure that the implementation is correct.

CC: srowen coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #4321 from mengxr/SPARK-5536 and squashes the following commits:

5a3cee8 [Xiangrui Meng] update python tests that are too strict
e840acf [Xiangrui Meng] ignore scala style check for ALS.train
e9a721c [Xiangrui Meng] update mima excludes
9ee6a36 [Xiangrui Meng] merge master
9a8aeac [Xiangrui Meng] update tests
d8c3271 [Xiangrui Meng] remove analyzeBlocks
d68eee7 [Xiangrui Meng] add checkpoint to new ALS
22a56f8 [Xiangrui Meng] wrap old ALS
c387dff [Xiangrui Meng] support random seed
3bdf24b [Xiangrui Meng] make storage level configurable in the new ALS
2015-02-02 23:49:09 -08:00
FlytxtRnD 50a1a874e1 [SPARK-5012][MLLib][PySpark]Python API for Gaussian Mixture Model
Python API for the Gaussian Mixture Model clustering algorithm in MLLib.

Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:

c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple  and Arraybuffer in trainGaussianMixture
fa0a142 [FlytxtRnD] New line added
d5b36ab [FlytxtRnD] Changed argument names to lowercase
ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
3aee84b [FlytxtRnD] Fixed style issues
2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
426d130 [FlytxtRnD] Added random seed parameter
332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
f82750b [FlytxtRnD] Fixed style issues
5c83825 [FlytxtRnD] Split input file with space delimiter
fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
2015-02-02 23:04:55 -08:00
Kazuki Taniguchi bc1fc9b60d [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
This PR is implementing the Gradient Boosted Trees for Python API.

Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>

Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:

620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
2015-01-30 00:39:44 -08:00
Xiangrui Meng a3dc618486 [SPARK-5477] refactor stat.py
There is only a single `stat.py` file for the `mllib.stat` package. We recently added `MultivariateGaussian` under `mllib.stat.distribution` in Scala/Java. It would be nice to refactor `stat.py` and make it easy to expand. Note that `ChiSqTestResult` is moved from `mllib.stat` to `mllib.stat.test`. The latter is used in Scala/Java. It is only used in the return value of `Statistics.chiSqTest`, so this should be an okay change.

davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #4266 from mengxr/py-stat-refactor and squashes the following commits:

1a5e1db [Xiangrui Meng] refactor stat.py
2015-01-29 10:11:44 -08:00
nate.crosswhite 7450a992b3 [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed
This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark

Author: nate.crosswhite <nate.crosswhite@stresearch.com>
Author: nxwhite-str <nxwhite-str@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3610 from nxwhite-str/master and squashes the following commits:

a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed
7668124 [Xiangrui Meng] minor updates
f8d5928 [nate.crosswhite] Addressing PR issues
277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test
616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
2015-01-21 10:32:10 -08:00
MechCoder 5840f5464b [SPARK-2909] [MLlib] [PySpark] SparseVector in pyspark now supports indexing
Slightly different than the scala code which converts the sparsevector into a densevector and then checks the index.

I also hope I've added tests in the right place.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #4025 from MechCoder/spark-2909 and squashes the following commits:

07d0f26 [MechCoder] STY: Rename item to index
f02148b [MechCoder] [SPARK-2909] [Mlib] SparseVector in pyspark now supports indexing
2015-01-14 11:03:11 -08:00
Davies Liu 8ead999fd6 [SPARK-5223] [MLlib] [PySpark] fix MapConverter and ListConverter in MLlib
It will introduce problems if the object in dict/list/tuple can not support by py4j, such as Vector.
Also, pickle may have better performance for larger object (less RPC).

In some cases that the object in dict/list can not be pickled (such as JavaObject), we should still use MapConvert/ListConvert.

This PR should be ported into branch-1.2

Author: Davies Liu <davies@databricks.com>

Closes #4023 from davies/listconvert and squashes the following commits:

55d4ab2 [Davies Liu] fix MapConverter and ListConverter in MLlib
2015-01-13 12:50:31 -08:00
RJ Nowling c9c8b219ad [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to P...
...ySpark MLlib

This is a follow up to PR3680 https://github.com/apache/spark/pull/3680 .

Author: RJ Nowling <rnowling@gmail.com>

Closes #3955 from rnowling/spark4891 and squashes the following commits:

1236a01 [RJ Nowling] Fix Python style issues
7a01a78 [RJ Nowling] Fix Python style issues
174beab [RJ Nowling] [SPARK-4891][PySpark][MLlib] Add gamma/log normal/exp dist sampling to PySpark MLlib
2015-01-08 15:03:43 -08:00
freeman 6c6f325740 [SPARK-5089][PYSPARK][MLLIB] Fix vector convert
This is a small change addressing a potentially significant bug in how PySpark + MLlib handles non-float64 numpy arrays. The automatic conversion to `DenseVector` that occurs when passing RDDs to MLlib algorithms in PySpark should automatically upcast to float64s, but currently this wasn't actually happening. As a result, non-float64 would be silently parsed inappropriately during SerDe, yielding erroneous results when running, for example, KMeans.

The PR includes the fix, as well as a new test for the correct conversion behavior.

davies

Author: freeman <the.freeman.lab@gmail.com>

Closes #3902 from freeman-lab/fix-vector-convert and squashes the following commits:

764db47 [freeman] Add a test for proper conversion behavior
704f97e [freeman] Return array after changing type
2015-01-05 13:10:59 -08:00
lewuathe 3cd516191b [SPARK-4822] Use sphinx tags for Python doc annotations
Modify python annotations for sphinx. There is no change to build process from.
https://github.com/apache/spark/blob/master/docs/README.md

Author: lewuathe <lewuathe@me.com>

Closes #3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:

88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
2014-12-17 17:31:24 -08:00
Joseph K. Bradley affc3f460f [SPARK-4821] [mllib] [python] [docs] Fix for pyspark.mllib.rand doc
+ small doc edit
+ include edit to make IntelliJ happy

CC: davies  mengxr

Note to davies  -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like.  (Those warnings occur when generating the Python docs.)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3669 from jkbradley/python-warnings and squashes the following commits:

4587868 [Joseph K. Bradley] fixed warning
8cb073c [Joseph K. Bradley] Updated based on davies recommendation
c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc.  Small doc edit.  Small include edit to make IntelliJ happy.
2014-12-17 14:12:46 -08:00
jbencook cb48447493 [SPARK-4855][mllib] testing the Chi-squared hypothesis test
This PR tests the pyspark Chi-squared hypothesis test from this commit: c8abddc516 and moves some of the error messaging in to python.

It is a port of the Scala tests here: [HypothesisTestSuite.scala](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/stat/HypothesisTestSuite.scala)

Hopefully, SPARK-2980 can be closed.

Author: jbencook <jbenjamincook@gmail.com>

Closes #3679 from jbencook/master and squashes the following commits:

44078e0 [jbencook] checking that bad input throws the correct exceptions
f12ee10 [jbencook] removing checks for ValueError since input tests are on the Scala side
7536cf1 [jbencook] removing python checks for invalid input
a17ee84 [jbencook] [SPARK-2980][mllib] adding unit tests for the pyspark chi-squared test
3aeb0d9 [jbencook] [SPARK-2980][mllib] bringing Chi-squared error messages to the python side
2014-12-16 11:37:23 -08:00
Yuu ISHIKAWA 8098fab06c [SPARK-4494][mllib] IDFModel.transform() add support for single vector
I improved `IDFModel.transform` to allow using a single vector.

[[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)

Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #3603 from yu-iskw/idf and squashes the following commits:

256ff3d [Yuu ISHIKAWA] Fix typo
a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
2014-12-15 13:44:15 -08:00
Joseph K. Bradley 657a88835d [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
2014-12-04 09:57:50 +08:00