Commit graph

45 commits

Author SHA1 Message Date
Jatin Puri d2e86cb3cd [SPARK-26616][MLLIB] Expose document frequency in IDFModel
## What changes were proposed in this pull request?

This change exposes the `df` (document frequency) as a public val along with the number of documents (`m`) as part of the IDF model.

* The document frequency is returned as an `Array[Long]`
* If the minimum  document frequency is set, this is considered in the df calculation. If the count is less than minDocFreq, the df is 0 for such terms
* numDocs is not very required. But it can be useful, if we plan to provide a provision in future for user to give their own idf function, instead of using a default (log((1+m)/(1+df))). In such cases, the user can provide a function taking input of `m` and `df` and returning the idf value
* Pyspark changes

## How was this patch tested?

The existing test case was edited to also check for the document frequency values.

I  am not very good with python or pyspark. I have committed and run tests based on my understanding. Kindly let me know if I have missed anything

Reviewer request: mengxr  zjffdu yinxusen

Closes #23549 from purijatin/master.

Authored-by: Jatin Puri <purijatin@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-22 07:41:54 -06:00
Sean Owen c2d0d700b5 [SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis
## What changes were proposed in this pull request?

Misc code cleanup from lgtm.com analysis. See comments below for details.

## How was this patch tested?

Existing tests.

Closes #23571 from srowen/SPARK-26640.

Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-17 19:40:39 -06:00
Sean Owen 08c76b5d39 [SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4
(This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231)

## What changes were proposed in this pull request?

Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines.

## How was this patch tested?

Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure.

Closes #22400 from srowen/SPARK-25238.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
2018-09-13 11:19:43 +08:00
Benjamin Peterson 7013eea11c [SPARK-23522][PYTHON] always use sys.exit over builtin exit
The exit() builtin is only for interactive use. applications should use sys.exit().

## What changes were proposed in this pull request?

All usage of the builtin `exit()` function is replaced by `sys.exit()`.

## How was this patch tested?

I ran `python/run-tests`.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Benjamin Peterson <benjamin@python.org>

Closes #20682 from benjaminp/sys-exit.
2018-03-08 20:38:34 +09:00
Peng, Meng 32286ba68a
[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #16434 from mpjlu/fdr_fwe_update.
2017-01-10 13:09:58 +00:00
Peng 79ff853631 [SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)
## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes #15212 from mpjlu/fdr_fwe.
2016-12-28 00:49:36 -08:00
hyukjinkwon 933a6548d4
[SPARK-18447][DOCS] Fix the markdown for Note:/NOTE:/Note that across Python API documentation
## What changes were proposed in this pull request?

It seems in Python, there are

- `Note:`
- `NOTE:`
- `Note that`
- `.. note::`

This PR proposes to fix those to `.. note::` to be consistent.

**Before**

<img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png">

<img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png">

**After**

<img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png">

<img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png">

## How was this patch tested?

The notes were found via

```bash
grep -r "Note: " .
grep -r "NOTE: " .
grep -r "Note that " .
```

And then fixed one by one comparing with API documentation.

After that, manually tested via `make html` under `./python/docs`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15947 from HyukjinKwon/SPARK-18447.
2016-11-22 11:40:18 +00:00
Joseph K. Bradley 91c33a0ca5 [SPARK-18088][ML] Various ChiSqSelector cleanups
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?

Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15647 from jkbradley/chisqselector-follow-ups.
2016-11-01 17:00:00 -07:00
Peng c8b612decb
[SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and SelectPercentile because of DoF difference
## What changes were proposed in this pull request?

For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.

So we change statistic to pValue for SelectKBest and SelectPercentile

## How was this patch tested?
change existing test

Author: Peng <peng.meng@intel.com>

Closes #15444 from mpjlu/chisqure-bug.
2016-10-14 12:48:57 +01:00
Yanbo Liang ac65139be9
[SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API.
## What changes were proposed in this pull request?
#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
* We should use lower case of the selector type names to follow MLlib convention.
* Add ML Python API.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15214 from yanboliang/spark-17017.
2016-09-26 09:45:33 +01:00
Peng, Meng b366f18496
[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test
## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

## How was this patch tested?

Add Scala ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #14597 from mpjlu/fprChiSquare.
2016-09-21 10:17:38 +01:00
William Benton 25cbbe6ca3
[SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector
## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary.  Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.

Author: William Benton <willb@redhat.com>

Closes #15105 from willb/fix/findSynonyms.
2016-09-17 12:49:58 +01:00
Sean Owen cdeb97a8cd [SPARK-17311][MLLIB] Standardize Python-Java MLlib API to accept optional long seeds in all cases
## What changes were proposed in this pull request?

Related to https://github.com/apache/spark/pull/14524 -- just the 'fix' rather than a behavior change.

- PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed"
- .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
- BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14826 from srowen/SPARK-16832.2.
2016-09-04 12:40:51 +01:00
Sean Owen e07baf1412 [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True
## What changes were proposed in this pull request?

Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.

## How was this patch tested?

Jenkins tests, including new caes to reflect the new behavior.

Author: Sean Owen <sowen@cloudera.com>

Closes #14663 from srowen/SPARK-17001.
2016-08-27 08:48:56 +01:00
Joseph K. Bradley 01f09b1612 [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML
## What changes were proposed in this pull request?

General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations.  Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.

Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier

DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi

## How was this patch tested?

N/A

Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature.  I did not find such cases, but please verify.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14147 from jkbradley/experimental-audit.
2016-07-13 12:33:39 -07:00
Joseph K. Bradley fdde7d0aa0 [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls
## What changes were proposed in this pull request?

Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.

This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X

## How was this patch tested?

Existing unit tests.  Manual testing in an environment where this was an issue.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #14023 from jkbradley/SPARK-16348.
2016-07-05 17:00:24 -07:00
WeichenXu a15ca5533d [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
## What changes were proposed in this pull request?

Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
2016-05-23 18:14:48 -07:00
Jason Lee 3d66a2ce9b [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize method
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib

## How was this patch tested?
Added test cases in tests.py under both ML and MLlib

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12428 from jasoncl/SPARK-14564.
2016-04-18 12:47:14 -07:00
Yong Tang bc748b7b8f [SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib
## What changes were proposed in this pull request?

This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.

Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.

## How was this patch tested?

This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12079 from yongtang/SPARK-14238.
2016-04-14 21:53:32 +02:00
Holden Karau 969d5665bb [SPARK-12296][PYSPARK][MLLIB] Feature parity for pyspark mllib standard scaler model
Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel.

Author: Holden Karau <holden@us.ibm.com>

Closes #10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.
2015-12-22 09:14:12 +02:00
Davies Liu 27b98e99d2 [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib
MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes #10338 from davies/create_context.
2015-12-16 15:48:11 -08:00
Liang-Chi Hsieh b51a4cdff3 [SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pyspark
JIRA: https://issues.apache.org/jira/browse/SPARK-12016

We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10100 from viirya/fix-load-py-wordvecmodel.
2015-12-14 09:59:42 -08:00
noelsmith 610971ecfe [SPARK-10273] Add @since annotation to pyspark.mllib.feature
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
2015-09-14 21:58:52 -07:00
Yu ISHIKAWA 488bad319a [SPARK-7104] [MLLIB] Support model save/load in Python's Word2Vec
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits:

975136b [Yu ISHIKAWA] Organize import
0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
2015-07-02 15:55:16 -07:00
Yanbo Liang f9b6bf2f83 [SPARK-7667] [MLLIB] MLlib Python API consistency check
MLlib Python API consistency check

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6856 from yanboliang/spark-7667 and squashes the following commits:

21bae35 [Yanbo Liang] remove duplicate code
eb12f95 [Yanbo Liang] fix doc inherit problem
9e7ec3c [Yanbo Liang] address comments
e763d32 [Yanbo Liang] MLlib Python API consistency check
2015-06-29 18:50:23 -07:00
Yanbo Liang 2519dcc33b [MINOR] [MLLIB] rename some functions of PythonMLLibAPI
Keep the same naming conventions for PythonMLLibAPI.
Only the following three functions is different from others
```scala
trainNaiveBayes
trainGaussianMixture
trainWord2Vec
```
So change them to
```scala
trainNaiveBayesModel
trainGaussianMixtureModel
trainWord2VecModel
```
It does not affect any users and public APIs, only to make better understand for developer and code hacker.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits:

771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
2015-06-25 08:13:17 -07:00
Yanbo Liang 32e3cdaa64 [SPARK-7604] [MLLIB] Python API for PCA and PCAModel
Python API for PCA and PCAModel

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6315 from yanboliang/spark-7604 and squashes the following commits:

1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior
4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
2015-06-21 12:04:20 -07:00
MechCoder 22732e1eca [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProduct
Python API for org.apache.spark.mllib.feature.ElementwiseProduct

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6346 from MechCoder/spark-7605 and squashes the following commits:

79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
2015-06-17 22:08:38 -07:00
Yanbo Liang 1617363fbb [SPARK-7918] [MLLIB] MLlib Python doc parity check for evaluation and feature
Check then make the MLlib Python evaluation and feature doc to be as complete as the Scala doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6461 from yanboliang/spark-7918 and squashes the following commits:

940e3f1 [Yanbo Liang] truncate too long line and remove extra sparse
a80ae58 [Yanbo Liang] MLlib Python doc parity check for evaluation and feature
2015-05-30 16:24:07 -07:00
Yanbo Liang 35c9599b94 [SPARK-5913] [MLLIB] Python API for ChiSqSelector
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #5939 from yanboliang/spark-5913 and squashes the following commits:

cdaac99 [Yanbo Liang] Python API for ChiSqSelector
2015-05-08 15:48:39 -07:00
Davies Liu 04e44b37cc [SPARK-4897] [PySpark] Python 3 support
This PR update PySpark to support Python 3 (tested with 3.4).

Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.

TODO: ec2/spark-ec2.py is not fully tested with python3.

Author: Davies Liu <davies@databricks.com>
Author: twneale <twneale@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #5173 from davies/python3 and squashes the following commits:

d7d6323 [Davies Liu] fix tests
6c52a98 [Davies Liu] fix mllib test
99e334f [Davies Liu] update timeout
b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
cafd5ec [Davies Liu] adddress comments from @mengxr
bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
179fc8d [Davies Liu] tuning flaky tests
8c8b957 [Davies Liu] fix ResourceWarning in Python 3
5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
4006829 [Davies Liu] fix test
2fc0066 [Davies Liu] add python3 path
71535e9 [Davies Liu] fix xrange and divide
5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ed498c8 [Davies Liu] fix compatibility with python 3
820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ad7c374 [Davies Liu] fix mllib test and warning
ef1fc2f [Davies Liu] fix tests
4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
59bb492 [Davies Liu] fix tests
1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ca0fdd3 [Davies Liu] fix code style
9563a15 [Davies Liu] add imap back for python 2
0b1ec04 [Davies Liu] make python examples work with Python 3
d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
a716d34 [Davies Liu] test with python 3.4
f1700e8 [Davies Liu] fix test in python3
671b1db [Davies Liu] fix test in python3
692ff47 [Davies Liu] fix flaky test
7b9699f [Davies Liu] invalidate import cache for Python 3.3+
9c58497 [Davies Liu] fix kill worker
309bfbf [Davies Liu] keep compatibility
5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
f53e1f0 [Davies Liu] fix tests
70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
a39167e [Davies Liu] support customize class in __main__
814c77b [Davies Liu] run unittests with python 3
7f4476e [Davies Liu] mllib tests passed
d737924 [Davies Liu] pass ml tests
375ea17 [Davies Liu] SQL tests pass
6cc42a9 [Davies Liu] rename
431a8de [Davies Liu] streaming tests pass
78901a7 [Davies Liu] fix hash of serializer in Python 3
24b2f2e [Davies Liu] pass all RDD tests
35f48fe [Davies Liu] run future again
1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
6e3c21d [Davies Liu] make cloudpickle work with Python3
2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
f40d925 [twneale] xrange --> range
e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
854be27 [Josh Rosen] Run `futurize` on Python code:
7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
2015-04-16 16:20:57 -07:00
lewuathe fc17661475 [SPARK-6643][MLLIB] Implement StandardScalerModel missing methods
This is the sub-task of SPARK-6254.
Wrap missing method for `StandardScalerModel`.

Author: lewuathe <lewuathe@me.com>

Closes #5310 from Lewuathe/SPARK-6643 and squashes the following commits:

fafd690 [lewuathe] Fix for lint-python
bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
578f5ee [lewuathe] Remove unnecessary class
a38f155 [lewuathe] Merge master
66bb2ab [lewuathe] Fix typos
82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
2015-04-12 22:17:16 -07:00
lewuathe 512a2f191a [SPARK-6615][MLLIB] Python API for Word2Vec
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.

Author: lewuathe <lewuathe@me.com>

Closes #5296 from Lewuathe/SPARK-6615 and squashes the following commits:

f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
2015-04-03 09:49:50 -07:00
lewuathe 46de6c05e0 [SPARK-6598][MLLIB] Python API for IDFModel
This is the sub-task of SPARK-6254.
Wrapping IDFModel `idf` member function for pyspark.

Author: lewuathe <lewuathe@me.com>

Closes #5264 from Lewuathe/SPARK-6598 and squashes the following commits:

1dc522c [lewuathe] [SPARK-6598] Python API for IDFModel
2015-03-31 11:25:21 -07:00
Joseph K. Bradley d20559b157 [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT
* Add GradientBoostedTrees Python examples to ML guide
  * I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:

c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
2015-02-25 16:13:17 -08:00
lewuathe 3cd516191b [SPARK-4822] Use sphinx tags for Python doc annotations
Modify python annotations for sphinx. There is no change to build process from.
https://github.com/apache/spark/blob/master/docs/README.md

Author: lewuathe <lewuathe@me.com>

Closes #3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:

88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
2014-12-17 17:31:24 -08:00
Joseph K. Bradley affc3f460f [SPARK-4821] [mllib] [python] [docs] Fix for pyspark.mllib.rand doc
+ small doc edit
+ include edit to make IntelliJ happy

CC: davies  mengxr

Note to davies  -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like.  (Those warnings occur when generating the Python docs.)

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3669 from jkbradley/python-warnings and squashes the following commits:

4587868 [Joseph K. Bradley] fixed warning
8cb073c [Joseph K. Bradley] Updated based on davies recommendation
c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc.  Small doc edit.  Small include edit to make IntelliJ happy.
2014-12-17 14:12:46 -08:00
Yuu ISHIKAWA 8098fab06c [SPARK-4494][mllib] IDFModel.transform() add support for single vector
I improved `IDFModel.transform` to allow using a single vector.

[[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)

Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #3603 from yu-iskw/idf and squashes the following commits:

256ff3d [Yuu ISHIKAWA] Fix typo
a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
2014-12-15 13:44:15 -08:00
Davies Liu ce0333f9a0 [SPARK-4348] [PySpark] [MLlib] rename random.py to rand.py
This PR rename random.py to rand.py to avoid the side affects of conflict with random module, but still keep the same interface as before.

```
>>> from pyspark.mllib.random import RandomRDDs
```

```
$ pydoc pyspark.mllib.random
Help on module random in pyspark.mllib:
NAME
    random - Python package for random data generation.

FILE
    /Users/davies/work/spark/python/pyspark/mllib/rand.py

CLASSES
    __builtin__.object
        pyspark.mllib.random.RandomRDDs

    class RandomRDDs(__builtin__.object)
     |  Generator methods for creating RDDs comprised of i.i.d samples from
     |  some distribution.
     |
     |  Static methods defined here:
     |
     |  normalRDD(sc, size, numPartitions=None, seed=None)
```

cc mengxr

reference link: http://xion.org.pl/2012/05/06/hacking-python-imports/

Author: Davies Liu <davies@databricks.com>

Closes #3216 from davies/random and squashes the following commits:

7ac4e8b [Davies Liu] rename random.py to rand.py
2014-11-13 10:24:54 -08:00
Davies Liu 65083e93dd [SPARK-4324] [PySpark] [MLlib] support numpy.array for all MLlib API
This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array).

It also improve some docstring and doctest.

cc mateiz mengxr

Author: Davies Liu <davies@databricks.com>

Closes #3189 from davies/numpy and squashes the following commits:

d5057c4 [Davies Liu] fix tests
6987611 [Davies Liu] support numpy.array for all MLlib API
2014-11-10 22:26:16 -08:00
Davies Liu 872fc669b4 [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API
Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.

After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.

cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #2995 from davies/cleanup and squashes the following commits:

8fa6ec6 [Davies Liu] address comments
16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
43743e5 [Davies Liu] bugfix
731331f [Davies Liu] simplify serialization in MLlib Python API
2014-10-30 22:25:18 -07:00
Davies Liu fae095bc7c [SPARK-3961] [MLlib] [PySpark] Python API for mllib.feature
Added completed Python API for MLlib.feature

Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF

cc mengxr

Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2819 from davies/feature and squashes the following commits:

4f48f48 [Davies Liu] add a note for HashingTF
67f6d21 [Davies Liu] address comments
b628693 [Davies Liu] rollback changes in Word2Vec
efb4f4f [Davies Liu] Merge branch 'master' into feature
806c7c2 [Davies Liu] address comments
3abb8c2 [Davies Liu] address comments
59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature
a405ae7 [Davies Liu] fix tests
7a1891a [Davies Liu] fix tests
486795f [Davies Liu] update programming guide, HashTF -> HashingTF
8a50584 [Davies Liu] Python API for mllib.feature
2014-10-28 03:50:22 -07:00
Davies Liu 091d32c52e [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode
Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.

So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2830 from davies/fix_pickle and squashes the following commits:

0c85fb9 [Davies Liu] revert the privacy change
6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
2014-10-16 14:56:50 -07:00
cocoatomo 7a3f589ef8 [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
Sphinx documents contains a corrupted ReST format and have some warnings.

The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.

commit: 0e8203f4fb

output
```
$ cd ./python/docs
$ make clean html
rm -rf _build/*
sphinx-build -b html -d _build/doctrees   . _build/html
Making output directory...
Running Sphinx v1.2.3
loading pickled environment... not yet created
building [html]: targets for 4 source files that are out of date
updating environment: 4 added, 0 changed, 0 removed
reading sources... [100%] pyspark.sql
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] pyspark.sql
writing additional files... (12 module code pages) _modules/index search
copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
done
copying extra files... done
dumping search index... done
dumping object inventory... done
build succeeded, 4 warnings.

Build finished. The HTML pages are in _build/html.
```

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:

2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
2014-10-11 11:51:59 -07:00
Liquan Pei 098c7344e6 [SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec
mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string sequence both on python side and JVM side
(3) Test for SerDe of string sequence on JVM side

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2356 from Ishiihara/Word2Vec-python and squashes the following commits:

476ea34 [Liquan Pei] style fixes
b13a0b9 [Liquan Pei] resolve merge conflicts and minor fixes
8671eba [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
daf88a6 [Liquan Pei] modification according to feedback
a73fa19 [Liquan Pei] clean up
3d8007b [Liquan Pei] fix findSynonyms for vector
1bdcd2e [Liquan Pei] minor fixes
cdef9f4 [Liquan Pei] add missing comments
b7447eb [Liquan Pei] modify according to feedback
b9a7383 [Liquan Pei] cache words RDD in fit
89490bf [Liquan Pei] add tests and Word2VecModelWrapper
78bbb53 [Liquan Pei] use pickle for seq string SerDe
a264b08 [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
ca1e5ff [Liquan Pei] fix test
68e7276 [Liquan Pei] minor style fixes
48d5e72 [Liquan Pei] Functionality improvement
0ad3ac1 [Liquan Pei] minor fix
c867fdf [Liquan Pei] add Word2Vec to pyspark
2014-10-07 16:43:34 -07:00