Commit graph

59 commits

Author SHA1 Message Date
Sean Owen 6378d4bc06 [SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3
### What changes were proposed in this pull request?

- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc

Notes:

- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.

### Why are the changes needed?

Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.

### Does this PR introduce any user-facing change?

Yes, in that deprecated items are removed from some public APIs.

### How was this patch tested?

Existing tests.

Closes #25684 from srowen/SPARK-28980.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-09 10:19:40 -05:00
HyukjinKwon ef142371e7 [SPARK-28736][SPARK-28735][PYTHON][ML] Fix PySpark ML tests to pass in JDK 11
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

This PR proposes to fix both tests below:

```
======================================================================
FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction
    self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4))
AssertionError: False is not true
```

```
File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 386, in __main__.GaussianMixtureModel
Failed example:
    abs(softPredicted[0] - 1.0) < 0.001
Expected:
    True
Got:
    False
**********************************************************************
File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 388, in __main__.GaussianMixtureModel
Failed example:
    abs(softPredicted[1] - 0.0) < 0.001
Expected:
    True
Got:
    False
```

to pass in JDK 11.

The root cause seems to be different float values being understood via Py4J. This issue also was found in https://github.com/apache/spark/pull/25132 before.

When floats are transferred from Python to JVM, the values are sent as are. Python floats are not "precise" due to its own limitation - https://docs.python.org/3/tutorial/floatingpoint.html.
For some reasons, the floats from Python on JDK 8 and JDK 11 are different, which is already explicitly not guaranteed.

This seems why only some tests in PySpark with floats are being failed.

So, this PR fixes it by increasing tolerance in identified test cases in PySpark.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

To fully support JDK 11. See, for instance, https://github.com/apache/spark/pull/25443 and https://github.com/apache/spark/pull/25423 for ongoing efforts.

### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->

No.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Manually tested as described in JIRAs:

```
$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python
```

```
$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.mllib.clustering' --python-executables python
```

Closes #25475 from HyukjinKwon/SPARK-28735.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-08-16 19:47:29 +09:00
HyukjinKwon fe75ff8bea [SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation
## What changes were proposed in this pull request?

Seems like we used to generate PySpark API documentation by Epydoc almost at the very first place (see 85b8f2c64f).

This fixes an actual issue:

Before:

![Screen Shot 2019-07-05 at 8 20 01 PM](https://user-images.githubusercontent.com/6477701/60720491-e9879180-9f65-11e9-9562-100830a456cd.png)

After:

![Screen Shot 2019-07-05 at 8 20 05 PM](https://user-images.githubusercontent.com/6477701/60720495-ec828200-9f65-11e9-8277-8f689e292cb0.png)

It seems apparently a bug within `epytext` plugin during the conversion between`param` and `:param` syntax. See also [Epydoc syntax](http://epydoc.sourceforge.net/manual-epytext.html).

Actually, Epydoc syntax violates [PEP-257](https://www.python.org/dev/peps/pep-0257/) IIRC and blocks us to enable some rules for doctest linter as well.

We should remove this legacy away and I guess Spark 3 is good timing to do it.

## How was this patch tested?

Manually built the doc and check each.

I had to manually find the Epydoc syntax by `git grep -r "{L"`, for instance.

Closes #25060 from HyukjinKwon/SPARK-28206.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2019-07-05 10:08:22 -07:00
Sean Owen c2d0d700b5 [SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis
## What changes were proposed in this pull request?

Misc code cleanup from lgtm.com analysis. See comments below for details.

## How was this patch tested?

Existing tests.

Closes #23571 from srowen/SPARK-26640.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-17 19:40:39 -06:00
李亮 e503065fd8 [SPARK-25868][MLLIB] One part of Spark MLlib Kmean Logic Performance problem
## What changes were proposed in this pull request?

Fix fastSquaredDistance to calculate dense-dense situation calculation performance problem and meanwhile enhance the calculation accuracy.

## How was this patch tested?
From different point to test after add this patch, the dense-dense calculation situation performance is enhanced and will do influence other calculation situation like (sparse-sparse, sparse-dense)

**For calculation logic test**
There is my test for sparse-sparse, dense-dense, sparse-dense case

There is test result:
First we need define some branch path logic for sparse-sparse and sparse-dense case
if meet precisionBound1, we define it as LOGIC1
if not meet precisionBound1, and not meet precisionBound2, we define it as LOGIC2
if not meet precisionBound1, but meet precisionBound2, we define it as LOGIC3
(There is a trick, you can manually change the precision value to meet above situation)

sparse- sparse case time cost situation (milliseconds)
LOGIC1
Before add patch: 7786, 7970, 8086
After add patch: 7729, 7653, 7903
LOGIC2
Before add patch: 8412, 9029, 8606
After add patch: 8603, 8724, 9024
LOGIC3
Before add patch: 19365, 19146, 19351
After add patch: 18917, 19007, 19074

sparse-dense case time cost situation (milliseconds)
LOGIC1
Before add patch: 4195, 4014, 4409
After add patch: 4081,3971, 4151
LOGIC2
Before add patch: 4968, 5579, 5080
After add patch: 4980, 5472, 5148
LOGIC3
Before add patch: 11848, 12077, 12168
After add patch: 11718, 11874, 11743

And for dense-dense case like we already discussed in comment, only use sqdist to calculate distance

dense-dense case time cost situation (milliseconds)
Before add patch: 7340, 7816, 7672
After add patch: 5752, 5800, 5753

**For real world data test**
There is my test data situation
I use the data
http://archive.ics.uci.edu/ml/datasets/Condition+monitoring+of+hydraulic+systems
extract file (PS1, PS2, PS3, PS4, PS5, PS6) to form the test data

total instances are 13230
the attributes for line are 6000

Result for sparse-sparse situation time cost (milliseconds)
Before Enhance: 7670, 7704, 7652
After Enhance: 7634, 7729, 7645

Closes #22893 from KyleLi1985/updatekmeanpatch.

Authored-by: 李亮 <liang.li.work@outlook.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2018-11-14 07:24:13 -08:00
Sean Owen 08c76b5d39 [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4
(This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231)

## What changes were proposed in this pull request?

Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines.

## How was this patch tested?

Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure.

Closes #22400 from srowen/SPARK-25238.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-09-13 11:19:43 +08:00
hyukjinkwon 044b33b2ed [SPARK-24740][PYTHON][ML] Make PySpark's tests compatible with NumPy 1.14+
## What changes were proposed in this pull request?

This PR proposes to make PySpark's tests compatible with NumPy 0.14+
NumPy 0.14.x introduced rather radical changes about its string representation.

For example, the tests below are failed:

```
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 895, in __main__.DenseMatrix.__str__
Failed example:
    print(dm)
Expected:
    DenseMatrix([[ 0.,  2.],
                 [ 1.,  3.]])
Got:
    DenseMatrix([[0., 2.],
                 [1., 3.]])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 899, in __main__.DenseMatrix.__str__
Failed example:
    print(dm)
Expected:
    DenseMatrix([[ 0.,  1.],
                 [ 2.,  3.]])
Got:
    DenseMatrix([[0., 1.],
                 [2., 3.]])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 939, in __main__.DenseMatrix.toArray
Failed example:
    m.toArray()
Expected:
    array([[ 0.,  2.],
           [ 1.,  3.]])
Got:
    array([[0., 2.],
           [1., 3.]])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 324, in __main__.DenseVector.dot
Failed example:
    dense.dot(np.reshape([1., 2., 3., 4.], (2, 2), order='F'))
Expected:
    array([  5.,  11.])
Got:
    array([ 5., 11.])
**********************************************************************
File "/.../spark/python/pyspark/ml/linalg/__init__.py", line 567, in __main__.SparseVector.dot
Failed example:
    a.dot(np.array([[1, 1], [2, 2], [3, 3], [4, 4]]))
Expected:
    array([ 22.,  22.])
Got:
    array([22., 22.])
```

See [release note](https://docs.scipy.org/doc/numpy-1.14.0/release.html#compatibility-notes).

## How was this patch tested?

Manually tested:

```
$ ./run-tests --python-executables=python3.6,python2.7 --modules=pyspark-ml,pyspark-mllib
Running PySpark tests. Output is in /.../spark/python/unit-tests.log
Will test against the following Python executables: ['python3.6', 'python2.7']
Will test the following Python modules: ['pyspark-ml', 'pyspark-mllib']
Starting test(python2.7): pyspark.mllib.tests
Starting test(python2.7): pyspark.ml.classification
Starting test(python3.6): pyspark.mllib.tests
Starting test(python2.7): pyspark.ml.clustering
Finished test(python2.7): pyspark.ml.clustering (54s)
Starting test(python2.7): pyspark.ml.evaluation
Finished test(python2.7): pyspark.ml.classification (74s)
Starting test(python2.7): pyspark.ml.feature
Finished test(python2.7): pyspark.ml.evaluation (27s)
Starting test(python2.7): pyspark.ml.fpm
Finished test(python2.7): pyspark.ml.fpm (0s)
Starting test(python2.7): pyspark.ml.image
Finished test(python2.7): pyspark.ml.image (17s)
Starting test(python2.7): pyspark.ml.linalg.__init__
Finished test(python2.7): pyspark.ml.linalg.__init__ (1s)
Starting test(python2.7): pyspark.ml.recommendation
Finished test(python2.7): pyspark.ml.feature (76s)
Starting test(python2.7): pyspark.ml.regression
Finished test(python2.7): pyspark.ml.recommendation (69s)
Starting test(python2.7): pyspark.ml.stat
Finished test(python2.7): pyspark.ml.regression (45s)
Starting test(python2.7): pyspark.ml.tests
Finished test(python2.7): pyspark.ml.stat (28s)
Starting test(python2.7): pyspark.ml.tuning
Finished test(python2.7): pyspark.ml.tuning (20s)
Starting test(python2.7): pyspark.mllib.classification
Finished test(python2.7): pyspark.mllib.classification (31s)
Starting test(python2.7): pyspark.mllib.clustering
Finished test(python2.7): pyspark.mllib.tests (260s)
Starting test(python2.7): pyspark.mllib.evaluation
Finished test(python3.6): pyspark.mllib.tests (266s)
Starting test(python2.7): pyspark.mllib.feature
Finished test(python2.7): pyspark.mllib.evaluation (21s)
Starting test(python2.7): pyspark.mllib.fpm
Finished test(python2.7): pyspark.mllib.feature (38s)
Starting test(python2.7): pyspark.mllib.linalg.__init__
Finished test(python2.7): pyspark.mllib.linalg.__init__ (1s)
Starting test(python2.7): pyspark.mllib.linalg.distributed
Finished test(python2.7): pyspark.mllib.fpm (34s)
Starting test(python2.7): pyspark.mllib.random
Finished test(python2.7): pyspark.mllib.clustering (64s)
Starting test(python2.7): pyspark.mllib.recommendation
Finished test(python2.7): pyspark.mllib.random (15s)
Starting test(python2.7): pyspark.mllib.regression
Finished test(python2.7): pyspark.mllib.linalg.distributed (47s)
Starting test(python2.7): pyspark.mllib.stat.KernelDensity
Finished test(python2.7): pyspark.mllib.stat.KernelDensity (0s)
Starting test(python2.7): pyspark.mllib.stat._statistics
Finished test(python2.7): pyspark.mllib.recommendation (40s)
Starting test(python2.7): pyspark.mllib.tree
Finished test(python2.7): pyspark.mllib.regression (38s)
Starting test(python2.7): pyspark.mllib.util
Finished test(python2.7): pyspark.mllib.stat._statistics (19s)
Starting test(python3.6): pyspark.ml.classification
Finished test(python2.7): pyspark.mllib.tree (26s)
Starting test(python3.6): pyspark.ml.clustering
Finished test(python2.7): pyspark.mllib.util (27s)
Starting test(python3.6): pyspark.ml.evaluation
Finished test(python3.6): pyspark.ml.evaluation (30s)
Starting test(python3.6): pyspark.ml.feature
Finished test(python2.7): pyspark.ml.tests (234s)
Starting test(python3.6): pyspark.ml.fpm
Finished test(python3.6): pyspark.ml.fpm (1s)
Starting test(python3.6): pyspark.ml.image
Finished test(python3.6): pyspark.ml.clustering (55s)
Starting test(python3.6): pyspark.ml.linalg.__init__
Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
Starting test(python3.6): pyspark.ml.recommendation
Finished test(python3.6): pyspark.ml.classification (71s)
Starting test(python3.6): pyspark.ml.regression
Finished test(python3.6): pyspark.ml.image (18s)
Starting test(python3.6): pyspark.ml.stat
Finished test(python3.6): pyspark.ml.stat (37s)
Starting test(python3.6): pyspark.ml.tests
Finished test(python3.6): pyspark.ml.regression (59s)
Starting test(python3.6): pyspark.ml.tuning
Finished test(python3.6): pyspark.ml.feature (93s)
Starting test(python3.6): pyspark.mllib.classification
Finished test(python3.6): pyspark.ml.recommendation (83s)
Starting test(python3.6): pyspark.mllib.clustering
Finished test(python3.6): pyspark.ml.tuning (29s)
Starting test(python3.6): pyspark.mllib.evaluation
Finished test(python3.6): pyspark.mllib.evaluation (26s)
Starting test(python3.6): pyspark.mllib.feature
Finished test(python3.6): pyspark.mllib.classification (43s)
Starting test(python3.6): pyspark.mllib.fpm
Finished test(python3.6): pyspark.mllib.clustering (81s)
Starting test(python3.6): pyspark.mllib.linalg.__init__
Finished test(python3.6): pyspark.mllib.linalg.__init__ (2s)
Starting test(python3.6): pyspark.mllib.linalg.distributed
Finished test(python3.6): pyspark.mllib.fpm (48s)
Starting test(python3.6): pyspark.mllib.random
Finished test(python3.6): pyspark.mllib.feature (54s)
Starting test(python3.6): pyspark.mllib.recommendation
Finished test(python3.6): pyspark.mllib.random (18s)
Starting test(python3.6): pyspark.mllib.regression
Finished test(python3.6): pyspark.mllib.linalg.distributed (55s)
Starting test(python3.6): pyspark.mllib.stat.KernelDensity
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (1s)
Starting test(python3.6): pyspark.mllib.stat._statistics
Finished test(python3.6): pyspark.mllib.recommendation (51s)
Starting test(python3.6): pyspark.mllib.tree
Finished test(python3.6): pyspark.mllib.regression (45s)
Starting test(python3.6): pyspark.mllib.util
Finished test(python3.6): pyspark.mllib.stat._statistics (21s)
Finished test(python3.6): pyspark.mllib.tree (27s)
Finished test(python3.6): pyspark.mllib.util (27s)
Finished test(python3.6): pyspark.ml.tests (264s)
```

Author: hyukjinkwon <gurwls223@apache.org>

Closes #21715 from HyukjinKwon/SPARK-24740.
2018-07-07 11:39:29 +08:00
Benjamin Peterson 7013eea11c [SPARK-23522][PYTHON] always use sys.exit over builtin exit
The exit() builtin is only for interactive use. applications should use sys.exit().

## What changes were proposed in this pull request?

All usage of the builtin `exit()` function is replaced by `sys.exit()`.

## How was this patch tested?

I ran `python/run-tests`.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Benjamin Peterson <benjamin@python.org>

Closes #20682 from benjaminp/sys-exit.
2018-03-08 20:38:34 +09:00
bomeng aa6db57e39 [SPARK-22399][ML] update the location of reference paper
## What changes were proposed in this pull request?
Update the url of reference paper.

## How was this patch tested?
It is comments, so nothing tested.

Author: bomeng <bmeng@us.ibm.com>

Closes #19614 from bomeng/22399.
2017-10-31 08:20:23 +00:00
hyukjinkwon 933a6548d4
[SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation
## What changes were proposed in this pull request?

It seems in Python, there are

- `Note:`
- `NOTE:`
- `Note that`
- `.. note::`

This PR proposes to fix those to `.. note::` to be consistent.

**Before**

<img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png">

<img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png">

**After**

<img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png">

<img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png">

## How was this patch tested?

The notes were found via

```bash
grep -r "Note: " .
grep -r "NOTE: " .
grep -r "Note that " .
```

And then fixed one by one comparing with API documentation.

After that, manually tested via `make html` under `./python/docs`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15947 from HyukjinKwon/SPARK-18447.
2016-11-22 11:40:18 +00:00
Yanbo Liang 883c763184 [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from 5 to 2.
## What changes were proposed in this pull request?
#14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15050 from yanboliang/spark-17389.
2016-09-11 13:47:13 +01:00
Nick Lavers 5377fc6236 [SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961

Changed one line of Utils.randomizeInPlace to allow elements to stay in place.

Created a unit test that runs a Pearson's chi squared test to determine whether the output diverges significantly from a uniform distribution.

Author: Nick Lavers <nick.lavers@videoamp.com>

Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.
2016-08-19 10:11:59 +01:00
Joseph K. Bradley 01f09b1612 [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML
## What changes were proposed in this pull request?

General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations.  Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.

Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier

DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi

## How was this patch tested?

N/A

Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature.  I did not find such cases, but please verify.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14147 from jkbradley/experimental-audit.
2016-07-13 12:33:39 -07:00
hyukjinkwon 4e14199ff7 [MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation
## What changes were proposed in this pull request?

This PR fixes wrongly formatted examples in PySpark documentation as below:

- **`SparkSession`**

  - **Before**

    ![2016-07-06 11 34 41](https://cloud.githubusercontent.com/assets/6477701/16605847/ae939526-436d-11e6-8ab8-6ad578362425.png)

  - **After**

    ![2016-07-06 11 33 56](https://cloud.githubusercontent.com/assets/6477701/16605845/ace9ee78-436d-11e6-8923-b76d4fc3e7c3.png)

- **`Builder`**

  - **Before**
    ![2016-07-06 11 34 44](https://cloud.githubusercontent.com/assets/6477701/16605844/aba60dbc-436d-11e6-990a-c87bc0281c6b.png)

  - **After**
    ![2016-07-06 1 26 37](https://cloud.githubusercontent.com/assets/6477701/16607562/586704c0-437d-11e6-9483-e0af93d8f74e.png)

This PR also fixes several similar instances across the documentation in `sql` PySpark module.

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14063 from HyukjinKwon/minor-pyspark-builder.
2016-07-06 10:45:51 -07:00
Joseph K. Bradley fdde7d0aa0 [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls
## What changes were proposed in this pull request?

Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.

This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X

## How was this patch tested?

Existing unit tests.  Manual testing in an environment where this was an issue.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14023 from jkbradley/SPARK-16348.
2016-07-05 17:00:24 -07:00
Yanbo Liang 302a186869 [SPARK-11559][MLLIB] Make runs no effect in mllib.KMeans
## What changes were proposed in this pull request?
We deprecated  ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.

## How was this patch tested?
Existing unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12608 from yanboliang/spark-11559.
2016-04-26 11:55:21 -07:00
Zheng RuiFeng d18276cb1d [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB
JIRA: https://issues.apache.org/jira/browse/SPARK-13672

## What changes were proposed in this pull request?

add two python examples of BisectingKMeans for ml and mllib

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11515 from zhengruifeng/mllib_bkm_pe.
2016-03-11 09:21:12 +02:00
Liang-Chi Hsieh e3441e3f68 [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test
JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.
2016-02-13 15:56:20 -08:00
Bryan Cutler cba1d6b659 [SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the clustering module.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
2016-02-02 10:50:22 -08:00
Holden Karau 0ddba6d88f [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means
From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.

Author: Holden Karau <holden@us.ibm.com>

Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
2016-01-19 10:15:54 -08:00
Yanbo Liang ee4ee02b86 [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10552 from yanboliang/spark-12603.
2016-01-11 14:43:25 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
zero323 592f64985d [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None
If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #10644 from zero323/SPARK-12006.
2016-01-07 10:32:56 -08:00
Yin Huai e5cde7ab11 Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None"
This reverts commit fcd013cf70.

Author: Yin Huai <yhuai@databricks.com>

Closes #10632 from yhuai/pythonStyle.
2016-01-06 22:03:31 -08:00
zero323 fcd013cf70 [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None
If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9986 from zero323/SPARK-12006.
2016-01-06 11:58:33 -08:00
Yu ISHIKAWA c0e48dfa61 [SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in Python
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9534 from yu-iskw/SPARK-11566.
2015-11-10 16:42:28 -08:00
Yu ISHIKAWA 7dc9d8dba6 [SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics in Python more specific
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9577 from yu-iskw/SPARK-11610.
2015-11-09 16:25:29 -08:00
Yu ISHIKAWA 2ff0e79a86 [SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python
Could jkbradley and davies review it?

- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.

[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8643 from yu-iskw/SPARK-8467-2.
2015-11-06 22:56:29 -08:00
Xiangrui Meng 33ae7a35da [SPARK-11358][MLLIB] deprecate runs in k-means
This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation.

cc: srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9322 from mengxr/SPARK-11358.
2015-11-02 13:42:16 -08:00
Sean Owen 826e1e304b [SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.

Supersedes https://github.com/apache/spark/pull/9293

Author: Sean Owen <sowen@cloudera.com>

Closes #9309 from srowen/SPARK-11302.2.
2015-10-27 23:07:37 -07:00
noelsmith 5d4f6abec4 [SPARK-10271][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.clustering
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8627 from noel-smith/SPARK-10271-since-mllib-clustering.
2015-10-26 21:28:18 -07:00
Evan Chen da936fbb74 [SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib)
Provide initialModel param for pyspark.mllib.clustering.KMeans

Author: Evan Chen <chene@us.ibm.com>

Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.
2015-10-07 15:04:53 -07:00
MechCoder 198d181dfb [SPARK-7105] [PYSPARK] [MLLIB] Support model save/load in GMM
This PR introduces save / load for GMM's in python API.

Also I refactored `GaussianMixtureModel` and inherited it from `JavaModelWrapper` with model being `GaussianMixtureModelWrapper`, a wrapper which provides convenience methods to `GaussianMixtureModel` (due to serialization and deserialization issues) and I moved the creation of gaussians to the scala backend.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7617 from MechCoder/python_gmm_save_load and squashes the following commits:

9c305aa [MechCoder] [SPARK-7105] [PySpark] [MLlib] Support model save/load in GMM
2015-07-28 15:00:25 -07:00
MechCoder 5307c9d3f7 [SPARK-9223] [PYSPARK] [MLLIB] Support model save/load in LDA
Since save / load has been merged in LDA, it takes no time to write the wrappers in Python as well.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7587 from MechCoder/python_lda_save_load and squashes the following commits:

c8e4ea7 [MechCoder] [SPARK-9223] [PySpark] Support model save/load in LDA
2015-07-22 17:22:12 -07:00
Yu ISHIKAWA 4692769655 [SPARK-6259] [MLLIB] Python API for LDA
I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.

TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits:

6855f59 [Yu ISHIKAWA] LDA inherits object
28bd165 [Yu ISHIKAWA] Change the place of testing code
d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value
083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer`
9f8bed8 [Yu ISHIKAWA] Simplify casting
faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters
98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented.
57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing
73412c3 [Yu ISHIKAWA] Fix the typo
2278829 [Yu ISHIKAWA] Fix the indentation
39514ec [Yu ISHIKAWA] Modify how to cast the input data
8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala`
77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint
68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python
25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
2015-07-14 23:27:42 -07:00
Yanbo Liang 0effe180f4 [SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issue
PySpark PowerIterationClustering test failure due to bad demo data.
If the data is small,  PowerIterationClustering will behavior indeterministic.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7177 from yanboliang/spark-8765 and squashes the following commits:

392ae54 [Yanbo Liang] fix model.assignments output
5ec3f1e [Yanbo Liang] fix PySpark PowerIterationClustering test issue
2015-07-06 16:15:12 -07:00
Joseph K. Bradley b8faa32875 [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC test
See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console]

CC yanboliang  mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7164 from jkbradley/pic-python-test and squashes the following commits:

156d55b [Joseph K. Bradley] removed flaky python PIC test
2015-07-01 11:57:52 -07:00
Yanbo Liang dfde31da5c [SPARK-5962] [MLLIB] Python support for Power Iteration Clustering
Python support for Power Iteration Clustering
https://issues.apache.org/jira/browse/SPARK-5962

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6992 from yanboliang/pyspark-pic and squashes the following commits:

6b03d82 [Yanbo Liang] address comments
4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
2015-06-28 22:38:04 -07:00
Yanbo Liang 2519dcc33b [MINOR] [MLLIB] rename some functions of PythonMLLibAPI
Keep the same naming conventions for PythonMLLibAPI.
Only the following three functions is different from others
```scala
trainNaiveBayes
trainGaussianMixture
trainWord2Vec
```
So change them to
```scala
trainNaiveBayesModel
trainGaussianMixtureModel
trainWord2VecModel
```
It does not affect any users and public APIs, only to make better understand for developer and code hacker.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits:

771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
2015-06-25 08:13:17 -07:00
Yu ISHIKAWA 5d89d9f00b [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in regression.py
[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:

7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
2015-06-22 11:53:11 -07:00
MechCoder 54976e55e3 [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeans
Python bindings for StreamingKMeans

Will change status to MRG once docs, tests and examples are updated.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6499 from MechCoder/spark-4118 and squashes the following commits:

7722d16 [MechCoder] minor style fixes
51052d3 [MechCoder] Doc fixes
2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
81482fd [MechCoder] minor
5d9fe61 [MechCoder] predictOn should take into account the latest model
8ab9e89 [MechCoder] Fix Python3 error
a9817df [MechCoder] Better tests and minor fixes
c80e451 [MechCoder] Add ignore_unicode_prefix
ee8ce16 [MechCoder] Update tests, doc and examples
4b1481f [MechCoder] Some changes and tests
d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
2015-06-19 12:23:15 -07:00
FlytxtRnD 8f4aaba0e4 [SPARK-7651] [MLLIB] [PYSPARK] GMM predict, predictSoft should raise error on bad input
In the Python API for Gaussian Mixture Model, predict() and predictSoft() methods should raise an error when the input argument is not an RDD.

Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #6180 from FlytxtRnD/GmmPredictException and squashes the following commits:

4b6aa11 [FlytxtRnD] Raise error if the input to predict()/predictSoft() is not an RDD
2015-05-15 10:43:18 -07:00
Yanbo Liang 94761485b2 [SPARK-6258] [MLLIB] GaussianMixture Python API parity check
Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python
```scala
GaussianMixture
    setInitialModel
GaussianMixtureModel
    k
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6087 from yanboliang/spark-6258 and squashes the following commits:

b3af21c [Yanbo Liang] fix typo
2b645c1 [Yanbo Liang] fix doc
638b4b7 [Yanbo Liang] address comments
b5bcade [Yanbo Liang] GaussianMixture Python API parity check
2015-05-15 00:18:39 -07:00
Hrishikesh Subramonian 5995ada96b [SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parity
The following items are added to Python kmeans:

kmeans - setEpsilon, setInitializationSteps
KMeansModel - computeCost, k

Author: Hrishikesh Subramonian <hrishikesh.subramonian@flytxt.com>

Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following commits:

b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
5fd3ced [Hrishikesh Subramonian] doc test corrections
20b3c68 [Hrishikesh Subramonian] python 3 fixes
4d4e695 [Hrishikesh Subramonian] added arguments in python tests
21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, setInitializationSteps, k and computeCost added.
2015-05-05 07:57:39 -07:00
Davies Liu 04e44b37cc [SPARK-4897] [PySpark] Python 3 support
This PR update PySpark to support Python 3 (tested with 3.4).

Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.

TODO: ec2/spark-ec2.py is not fully tested with python3.

Author: Davies Liu <davies@databricks.com>
Author: twneale <twneale@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #5173 from davies/python3 and squashes the following commits:

d7d6323 [Davies Liu] fix tests
6c52a98 [Davies Liu] fix mllib test
99e334f [Davies Liu] update timeout
b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
cafd5ec [Davies Liu] adddress comments from @mengxr
bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
179fc8d [Davies Liu] tuning flaky tests
8c8b957 [Davies Liu] fix ResourceWarning in Python 3
5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
4006829 [Davies Liu] fix test
2fc0066 [Davies Liu] add python3 path
71535e9 [Davies Liu] fix xrange and divide
5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ed498c8 [Davies Liu] fix compatibility with python 3
820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ad7c374 [Davies Liu] fix mllib test and warning
ef1fc2f [Davies Liu] fix tests
4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
59bb492 [Davies Liu] fix tests
1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ca0fdd3 [Davies Liu] fix code style
9563a15 [Davies Liu] add imap back for python 2
0b1ec04 [Davies Liu] make python examples work with Python 3
d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
a716d34 [Davies Liu] test with python 3.4
f1700e8 [Davies Liu] fix test in python3
671b1db [Davies Liu] fix test in python3
692ff47 [Davies Liu] fix flaky test
7b9699f [Davies Liu] invalidate import cache for Python 3.3+
9c58497 [Davies Liu] fix kill worker
309bfbf [Davies Liu] keep compatibility
5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
f53e1f0 [Davies Liu] fix tests
70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
a39167e [Davies Liu] support customize class in __main__
814c77b [Davies Liu] run unittests with python 3
7f4476e [Davies Liu] mllib tests passed
d737924 [Davies Liu] pass ml tests
375ea17 [Davies Liu] SQL tests pass
6cc42a9 [Davies Liu] rename
431a8de [Davies Liu] streaming tests pass
78901a7 [Davies Liu] fix hash of serializer in Python 3
24b2f2e [Davies Liu] pass all RDD tests
35f48fe [Davies Liu] run future again
1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
6e3c21d [Davies Liu] make cloudpickle work with Python3
2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
f40d925 [twneale] xrange --> range
e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
854be27 [Josh Rosen] Run `futurize` on Python code:
7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
2015-04-16 16:20:57 -07:00
Xiangrui Meng c94d062647 [SPARK-6226][MLLIB] add save/load in PySpark's KMeansModel
Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen

Author: Xiangrui Meng <meng@databricks.com>

Closes #5049 from mengxr/SPARK-6226-mengxr and squashes the following commits:

570ba81 [Xiangrui Meng] fix python style
b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
2015-03-17 12:14:40 -07:00
Joseph K. Bradley 4a17eedb16 [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release
For SPARK-5867:
* The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
* It should also include Python examples now.

For SPARK-5892:
* Fix Python docs
* Various other cleanups

BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]

CC: mengxr  (ML),  davies  (Python docs)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:

f191bb0 [Joseph K. Bradley] small cleanups
e786efa [Joseph K. Bradley] small doc corrections
6b1ab4a [Joseph K. Bradley] fixed python lint test
946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
34b067f [Joseph K. Bradley] small doc correction
da16aef [Joseph K. Bradley] Fixed python mllib docs
8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
b05a80d [Joseph K. Bradley] organize imports. doc cleanups
e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
2015-02-20 02:31:32 -08:00
FlytxtRnD 50a1a874e1 [SPARK-5012][MLLib][PySpark]Python API for Gaussian Mixture Model
Python API for the Gaussian Mixture Model clustering algorithm in MLLib.

Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:

c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple  and Arraybuffer in trainGaussianMixture
fa0a142 [FlytxtRnD] New line added
d5b36ab [FlytxtRnD] Changed argument names to lowercase
ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
3aee84b [FlytxtRnD] Fixed style issues
2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
426d130 [FlytxtRnD] Added random seed parameter
332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
f82750b [FlytxtRnD] Fixed style issues
5c83825 [FlytxtRnD] Split input file with space delimiter
fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
2015-02-02 23:04:55 -08:00
nate.crosswhite 7450a992b3 [SPARK-4749] [mllib]: Allow initializing KMeans clusters using a seed
This implements the functionality for SPARK-4749 and provides units tests in Scala and PySpark

Author: nate.crosswhite <nate.crosswhite@stresearch.com>
Author: nxwhite-str <nxwhite-str@users.noreply.github.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3610 from nxwhite-str/master and squashes the following commits:

a2ebbd3 [nxwhite-str] Merge pull request #1 from mengxr/SPARK-4749-kmeans-seed
7668124 [Xiangrui Meng] minor updates
f8d5928 [nate.crosswhite] Addressing PR issues
277d367 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
9156a57 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
5d087b4 [nate.crosswhite] Adding KMeans train with seed and Scala unit test
616d111 [nate.crosswhite] Merge remote-tracking branch 'upstream/master'
35c1884 [nate.crosswhite] Add kmeans initial seed to pyspark API
2015-01-21 10:32:10 -08:00
Davies Liu ce95bd8e13 [SPARK-4531] [MLlib] cache serialized java object
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.

This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.

Author: Davies Liu <davies@databricks.com>

Closes #3397 from davies/cache and squashes the following commits:

7f6e6ce [Davies Liu] Update -> Updater
4b52edd [Davies Liu] using named argument
63b984e [Davies Liu] fix
7da0332 [Davies Liu] add unpersist()
dff33e1 [Davies Liu] address comments
c2bdfc2 [Davies Liu] refactor
d572f00 [Davies Liu] Merge branch 'master' into cache
f1063e1 [Davies Liu] cache serialized java object
2014-11-21 15:02:31 -08:00