### What changes were proposed in this pull request?
jira link: https://issues.apache.org/jira/browse/SPARK-30930
Remove ML/MLLIB DeveloperApi annotations.
### Why are the changes needed?
The Developer APIs in ML/MLLIB have been there for a long time. They are stable now and are very unlikely to be changed or removed, so I unmark these Developer APIs in this PR.
### Does this PR introduce any user-facing change?
Yes. DeveloperApi annotations are removed from docs.
### How was this patch tested?
existing tests
Closes#27859 from huaxingao/spark-30930.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
## What changes were proposed in this pull request?
This change exposes the `df` (document frequency) as a public val along with the number of documents (`m`) as part of the IDF model.
* The document frequency is returned as an `Array[Long]`
* If the minimum document frequency is set, this is considered in the df calculation. If the count is less than minDocFreq, the df is 0 for such terms
* numDocs is not very required. But it can be useful, if we plan to provide a provision in future for user to give their own idf function, instead of using a default (log((1+m)/(1+df))). In such cases, the user can provide a function taking input of `m` and `df` and returning the idf value
* Pyspark changes
## How was this patch tested?
The existing test case was edited to also check for the document frequency values.
I am not very good with python or pyspark. I have committed and run tests based on my understanding. Kindly let me know if I have missed anything
Reviewer request: mengxr zjffdu yinxusen
Closes#23549 from purijatin/master.
Authored-by: Jatin Puri <purijatin@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Misc code cleanup from lgtm.com analysis. See comments below for details.
## How was this patch tested?
Existing tests.
Closes#23571 from srowen/SPARK-26640.
Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
(This change is a subset of the changes needed for the JIRA; see https://github.com/apache/spark/pull/22231)
## What changes were proposed in this pull request?
Use raw strings and simpler regex syntax consistently in Python, which also avoids warnings from pycodestyle about accidentally relying Python's non-escaping of non-reserved chars in normal strings. Also, fix a few long lines.
## How was this patch tested?
Existing tests, and some manual double-checking of the behavior of regexes in Python 2/3 to be sure.
Closes#22400 from srowen/SPARK-25238.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
The exit() builtin is only for interactive use. applications should use sys.exit().
## What changes were proposed in this pull request?
All usage of the builtin `exit()` function is replaced by `sys.exit()`.
## How was this patch tested?
I ran `python/run-tests`.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Benjamin Peterson <benjamin@python.org>
Closes#20682 from benjaminp/sys-exit.
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.
## How was this patch tested?
ut
Author: Peng, Meng <peng.meng@intel.com>
Closes#16434 from mpjlu/fdr_fwe_update.
## What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate
We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?
ut will be added soon
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>
Closes#15212 from mpjlu/fdr_fwe.
## What changes were proposed in this pull request?
- Renamed kbest to numTopFeatures
- Renamed alpha to fpr
- Added missing Since annotations
- Doc cleanups
## How was this patch tested?
Added new standardized unit tests for spark.ml.
Improved existing unit test coverage a bit.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#15647 from jkbradley/chisqselector-follow-ups.
## What changes were proposed in this pull request?
For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features.
So we change statistic to pValue for SelectKBest and SelectPercentile
## How was this patch tested?
change existing test
Author: Peng <peng.meng@intel.com>
Closes#15444 from mpjlu/chisqure-bug.
## What changes were proposed in this pull request?
#14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed:
* We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```.
* Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also.
* If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620)
* We should use lower case of the selector type names to follow MLlib convention.
* Add ML Python API.
## How was this patch tested?
Unit test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#15214 from yanboliang/spark-17017.
## What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on univariate statistical tests. False Positive Rate (FPR) is a popular univariate statistical test for feature selection. We add a chiSquare Selector based on False Positive Rate (FPR) test in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?
Add Scala ut
Author: Peng, Meng <peng.meng@intel.com>
Closes#14597 from mpjlu/fprChiSquare.
## What changes were proposed in this pull request?
This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.
## How was this patch tested?
I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.
Author: William Benton <willb@redhat.com>
Closes#15105 from willb/fix/findSynonyms.
## What changes were proposed in this pull request?
Related to https://github.com/apache/spark/pull/14524 -- just the 'fix' rather than a behavior change.
- PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed"
- .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed
- BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it.
## How was this patch tested?
Existing tests
Author: Sean Owen <sowen@cloudera.com>
Closes#14826 from srowen/SPARK-16832.2.
## What changes were proposed in this pull request?
Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
## How was this patch tested?
Jenkins tests, including new caes to reflect the new behavior.
Author: Sean Owen <sowen@cloudera.com>
Closes#14663 from srowen/SPARK-17001.
## What changes were proposed in this pull request?
General decisions to follow, except where noted:
* spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone.
* spark.ml, pyspark.ml
** Annotate Estimator-Model pairs of classes and companion objects the same way.
** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
* DeveloperApi annotations are left alone, except where noted.
* No changes to which types are sealed.
Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
* Model Summary classes
* MLWriter, MLReader, MLWritable, MLReadable
* Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
* RFormula: Its behavior may need to change slightly to match R in edge cases.
* AFTSurvivalRegression
* MultilayerPerceptronClassifier
DeveloperApi changes:
* ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi
## How was this patch tested?
N/A
Note to reviewers:
* spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
* Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14147 from jkbradley/experimental-audit.
## What changes were proposed in this pull request?
Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.
This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X
## How was this patch tested?
Existing unit tests. Manual testing in an environment where this was an issue.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#14023 from jkbradley/SPARK-16348.
## What changes were proposed in this pull request?
Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
## How was this patch tested?
Existing test.
Author: WeichenXu <WeichenXu123@outlook.com>
Closes#13242 from WeichenXu123/python_doctest_update_sparksession.
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib
## How was this patch tested?
Added test cases in tests.py under both ML and MLlib
Author: Jason Lee <cjlee@us.ibm.com>
Closes#12428 from jasoncl/SPARK-14564.
## What changes were proposed in this pull request?
This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.
Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.
## How was this patch tested?
This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes#12079 from yongtang/SPARK-14238.
Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel.
Author: Holden Karau <holden@us.ibm.com>
Closes#10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.
MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext.
Author: Davies Liu <davies@databricks.com>
Closes#10338 from davies/create_context.
JIRA: https://issues.apache.org/jira/browse/SPARK-12016
We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#10100 from viirya/fix-load-py-wordvecmodel.
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
Author: noelsmith <mail@noelsmith.com>
Closes#8633 from noel-smith/SPARK-10273-since-mllib-feature.
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#6821 from yu-iskw/SPARK-7104 and squashes the following commits:
975136b [Yu ISHIKAWA] Organize import
0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
Keep the same naming conventions for PythonMLLibAPI.
Only the following three functions is different from others
```scala
trainNaiveBayes
trainGaussianMixture
trainWord2Vec
```
So change them to
```scala
trainNaiveBayesModel
trainGaussianMixtureModel
trainWord2VecModel
```
It does not affect any users and public APIs, only to make better understand for developer and code hacker.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#7011 from yanboliang/py-mllib-api-rename and squashes the following commits:
771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
Python API for PCA and PCAModel
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6315 from yanboliang/spark-7604 and squashes the following commits:
1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior
4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
Python API for org.apache.spark.mllib.feature.ElementwiseProduct
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes#6346 from MechCoder/spark-7605 and squashes the following commits:
79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
Check then make the MLlib Python evaluation and feature doc to be as complete as the Scala doc.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#6461 from yanboliang/spark-7918 and squashes the following commits:
940e3f1 [Yanbo Liang] truncate too long line and remove extra sparse
a80ae58 [Yanbo Liang] MLlib Python doc parity check for evaluation and feature
Add a Python API for mllib.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-5913
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#5939 from yanboliang/spark-5913 and squashes the following commits:
cdaac99 [Yanbo Liang] Python API for ChiSqSelector
This PR update PySpark to support Python 3 (tested with 3.4).
Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
TODO: ec2/spark-ec2.py is not fully tested with python3.
Author: Davies Liu <davies@databricks.com>
Author: twneale <twneale@gmail.com>
Author: Josh Rosen <joshrosen@databricks.com>
Closes#5173 from davies/python3 and squashes the following commits:
d7d6323 [Davies Liu] fix tests
6c52a98 [Davies Liu] fix mllib test
99e334f [Davies Liu] update timeout
b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
cafd5ec [Davies Liu] adddress comments from @mengxr
bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
179fc8d [Davies Liu] tuning flaky tests
8c8b957 [Davies Liu] fix ResourceWarning in Python 3
5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
4006829 [Davies Liu] fix test
2fc0066 [Davies Liu] add python3 path
71535e9 [Davies Liu] fix xrange and divide
5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ed498c8 [Davies Liu] fix compatibility with python 3
820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ad7c374 [Davies Liu] fix mllib test and warning
ef1fc2f [Davies Liu] fix tests
4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
59bb492 [Davies Liu] fix tests
1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
ca0fdd3 [Davies Liu] fix code style
9563a15 [Davies Liu] add imap back for python 2
0b1ec04 [Davies Liu] make python examples work with Python 3
d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
a716d34 [Davies Liu] test with python 3.4
f1700e8 [Davies Liu] fix test in python3
671b1db [Davies Liu] fix test in python3
692ff47 [Davies Liu] fix flaky test
7b9699f [Davies Liu] invalidate import cache for Python 3.3+
9c58497 [Davies Liu] fix kill worker
309bfbf [Davies Liu] keep compatibility
5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
f53e1f0 [Davies Liu] fix tests
70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
a39167e [Davies Liu] support customize class in __main__
814c77b [Davies Liu] run unittests with python 3
7f4476e [Davies Liu] mllib tests passed
d737924 [Davies Liu] pass ml tests
375ea17 [Davies Liu] SQL tests pass
6cc42a9 [Davies Liu] rename
431a8de [Davies Liu] streaming tests pass
78901a7 [Davies Liu] fix hash of serializer in Python 3
24b2f2e [Davies Liu] pass all RDD tests
35f48fe [Davies Liu] run future again
1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
6e3c21d [Davies Liu] make cloudpickle work with Python3
2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
7354371 [twneale] buffer --> memoryview I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
f40d925 [twneale] xrange --> range
e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
854be27 [Josh Rosen] Run `futurize` on Python code:
7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
This is the sub-task of SPARK-6254.
Wrap missing method for `StandardScalerModel`.
Author: lewuathe <lewuathe@me.com>
Closes#5310 from Lewuathe/SPARK-6643 and squashes the following commits:
fafd690 [lewuathe] Fix for lint-python
bd31a64 [lewuathe] Merge branch 'master' into SPARK-6643
578f5ee [lewuathe] Remove unnecessary class
a38f155 [lewuathe] Merge master
66bb2ab [lewuathe] Fix typos
82683a0 [lewuathe] [SPARK-6643] Implement StandardScalerModel missing methods
This is the sub-task of SPARK-6254.
Wrap missing method for `Word2Vec` and `Word2VecModel`.
Author: lewuathe <lewuathe@me.com>
Closes#5296 from Lewuathe/SPARK-6615 and squashes the following commits:
f14c304 [lewuathe] Reorder tests
1d326b9 [lewuathe] Merge master
e2bedfb [lewuathe] Modify test cases
afb866d [lewuathe] [SPARK-6615] Python API for Word2Vec
This is the sub-task of SPARK-6254.
Wrapping IDFModel `idf` member function for pyspark.
Author: lewuathe <lewuathe@me.com>
Closes#5264 from Lewuathe/SPARK-6598 and squashes the following commits:
1dc522c [lewuathe] [SPARK-6598] Python API for IDFModel
* Add GradientBoostedTrees Python examples to ML guide
* I ran these in the pyspark shell, and they worked.
* Add save/load to examples in ML guide
* Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#4750 from jkbradley/SPARK-5974 and squashes the following commits:
c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide. Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
6d81c3e [Joseph K. Bradley] completed python GBT examples
9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide. Added GBT examples to ML guide
Modify python annotations for sphinx. There is no change to build process from.
https://github.com/apache/spark/blob/master/docs/README.md
Author: lewuathe <lewuathe@me.com>
Closes#3685 from Lewuathe/sphinx-tag-for-pydoc and squashes the following commits:
88a0fd9 [lewuathe] [SPARK-4822] Fix DevelopApi and WARN tags
3d7a398 [lewuathe] [SPARK-4822] Use sphinx tags for Python doc annotations
+ small doc edit
+ include edit to make IntelliJ happy
CC: davies mengxr
Note to davies -- this does not fix the "WARNING: Literal block expected; none found." warnings since that seems to involve spacing which IntelliJ does not like. (Those warnings occur when generating the Python docs.)
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#3669 from jkbradley/python-warnings and squashes the following commits:
4587868 [Joseph K. Bradley] fixed warning
8cb073c [Joseph K. Bradley] Updated based on davies recommendation
c51eca4 [Joseph K. Bradley] Updated rst file for pyspark.mllib.rand doc. Small doc edit. Small include edit to make IntelliJ happy.
I improved `IDFModel.transform` to allow using a single vector.
[[SPARK-4494] IDFModel.transform() add support for single vector - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-4494)
Author: Yuu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#3603 from yu-iskw/idf and squashes the following commits:
256ff3d [Yuu ISHIKAWA] Fix typo
a3bf566 [Yuu ISHIKAWA] - Fix typo - Optimize import order - Aggregate the assertion tests - Modify `IDFModel.transform` API for pyspark
d25e49b [Yuu ISHIKAWA] Add the implementation of `IDFModel.transform` for a term frequency vector
This PR rename random.py to rand.py to avoid the side affects of conflict with random module, but still keep the same interface as before.
```
>>> from pyspark.mllib.random import RandomRDDs
```
```
$ pydoc pyspark.mllib.random
Help on module random in pyspark.mllib:
NAME
random - Python package for random data generation.
FILE
/Users/davies/work/spark/python/pyspark/mllib/rand.py
CLASSES
__builtin__.object
pyspark.mllib.random.RandomRDDs
class RandomRDDs(__builtin__.object)
| Generator methods for creating RDDs comprised of i.i.d samples from
| some distribution.
|
| Static methods defined here:
|
| normalRDD(sc, size, numPartitions=None, seed=None)
```
cc mengxr
reference link: http://xion.org.pl/2012/05/06/hacking-python-imports/
Author: Davies Liu <davies@databricks.com>
Closes#3216 from davies/random and squashes the following commits:
7ac4e8b [Davies Liu] rename random.py to rand.py
This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array).
It also improve some docstring and doctest.
cc mateiz mengxr
Author: Davies Liu <davies@databricks.com>
Closes#3189 from davies/numpy and squashes the following commits:
d5057c4 [Davies Liu] fix tests
6987611 [Davies Liu] support numpy.array for all MLlib API
Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
cc mengxr
Author: Davies Liu <davies@databricks.com>
Closes#2995 from davies/cleanup and squashes the following commits:
8fa6ec6 [Davies Liu] address comments
16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
43743e5 [Davies Liu] bugfix
731331f [Davies Liu] simplify serialization in MLlib Python API
Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.
So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.
Author: Davies Liu <davies.liu@gmail.com>
Closes#2830 from davies/fix_pickle and squashes the following commits:
0c85fb9 [Davies Liu] revert the privacy change
6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
Sphinx documents contains a corrupted ReST format and have some warnings.
The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.
commit: 0e8203f4fb
output
```
$ cd ./python/docs
$ make clean html
rm -rf _build/*
sphinx-build -b html -d _build/doctrees . _build/html
Making output directory...
Running Sphinx v1.2.3
loading pickled environment... not yet created
building [html]: targets for 4 source files that are out of date
updating environment: 4 added, 0 changed, 0 removed
reading sources... [100%] pyspark.sql
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] pyspark.sql
writing additional files... (12 module code pages) _modules/index search
copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
done
copying extra files... done
dumping search index... done
dumping object inventory... done
build succeeded, 4 warnings.
Build finished. The HTML pages are in _build/html.
```
Author: cocoatomo <cocoatomo77@gmail.com>
Closes#2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:
2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string sequence both on python side and JVM side
(3) Test for SerDe of string sequence on JVM side
Author: Liquan Pei <liquanpei@gmail.com>
Closes#2356 from Ishiihara/Word2Vec-python and squashes the following commits:
476ea34 [Liquan Pei] style fixes
b13a0b9 [Liquan Pei] resolve merge conflicts and minor fixes
8671eba [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
daf88a6 [Liquan Pei] modification according to feedback
a73fa19 [Liquan Pei] clean up
3d8007b [Liquan Pei] fix findSynonyms for vector
1bdcd2e [Liquan Pei] minor fixes
cdef9f4 [Liquan Pei] add missing comments
b7447eb [Liquan Pei] modify according to feedback
b9a7383 [Liquan Pei] cache words RDD in fit
89490bf [Liquan Pei] add tests and Word2VecModelWrapper
78bbb53 [Liquan Pei] use pickle for seq string SerDe
a264b08 [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
ca1e5ff [Liquan Pei] fix test
68e7276 [Liquan Pei] minor style fixes
48d5e72 [Liquan Pei] Functionality improvement
0ad3ac1 [Liquan Pei] minor fix
c867fdf [Liquan Pei] add Word2Vec to pyspark